IT Operations Technical Lead
Leads IT operations for hybrid cloud and on-premise infrastructure, managing Linux/Windows systems, ITIL processes, incident response, automation, and AI/ML workloads. Requires 5+ years leading teams and 10+ years hands-on Linux experience with cloud and DevOps tools.
Responsibilities
- Lead and manage IT operations aligned with ITIL processes including Incident, Problem, Change, and Release Management
- Provide hands-on leadership in managing Linux and Windows environments across cloud and on-premises infrastructure
- Own and drive incident response, root cause analysis, and service restoration for mission-critical systems
- Design, build, and maintain golden images, patching strategies, and system hardening standards
- Lead patch management and vulnerability remediation programs ensuring compliance and system integrity
- Develop and implement automation solutions using modern approaches including Vibe Coding (AI-assisted development) to accelerate operational efficiency and reduce toil
- Support and optimize infrastructure for AI/ML workloads, including provisioning, scaling, and performance tuning
- Manage and maintain GPU-enabled environments and instances for high-performance computing and machine learning use cases
- Oversee and optimize infrastructure monitoring, logging, alerting, and observability frameworks
- Manage and mentor a team of systems engineers; provide technical guidance and performance oversight
- Collaborate with architecture, security, and development teams to improve reliability, scalability, and operational efficiency
- Support hybrid environments including cloud platforms and on-premise data centers
- Ensure proper documentation, runbooks, SOPs, and operational readiness
- Stay abreast of new technologies including US Federal Standards, NIST Publications, cloud computing & deployment, site reliability engineering, security standards and compliance best practices
Requirements
- 5+ years of experience leading operations team with hands-on experience in driving operational process improvements and technological advancements
- Proven experience implementing and operating within ITIL frameworks
- 10+ years of hands-on Unix/Linux experience that includes specific technical experience with CentOS / Red Hat systems administration support for large scale distributed environments
- Hands-on experience with incident management, patching, system hardening, and production support
- Experience building and maintaining golden images and standardized environments
- Strong scripting/automation skills (Python, Bash, PowerShell or similar)
- Experience with configuration management and automation tools (Ansible, Terraform, Puppet, Chef, or similar)
- Strong understanding of networking fundamentals (DNS, TCP/IP, firewalls, load balancing)
- Experience with monitoring and logging tools (Nagios, Splunk, ELK, Prometheus, Grafana)
- Cloud Build-Out or Migration experience in at least one of the following providers Amazon AWS, Google GCP and Microsoft Azure
- 2+ years with CI/CD and automation tools such as Terraform, Ansible, Chef, Puppet, Jenkins, GitHub
- Experience supporting AI/ML workloads or data-intensive platforms
- Familiarity with GPU-based compute environments (e.g., NVIDIA GPU instances)
- Willing to learn new technologies, adopt and adapt to emerging technologies or needs from a project to a project
Preferred
- Knowledge of security best practices and compliance frameworks such as NIST 800-53, FedRAMP, FISMA
- Certifications such as ITIL, Linux, AWS, Azure, or Kubernetes (CKA/CKAD)
- Networking certifications (CCNA/CCNP)
Senior Network Engineer
Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.
Site Reliability Engineer
Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.