Skip to content

IT Operations Technical Lead

150k – 170kFrederick, MDDevOps / SREHybrid5+ YOE
Summary

Leads IT operations for hybrid cloud and on-premise infrastructure, managing Linux/Windows systems, ITIL processes, incident response, automation, and AI/ML workloads. Requires 5+ years leading teams and 10+ years hands-on Linux experience with cloud and DevOps tools.

About the role

Responsibilities

  • Lead and manage IT operations aligned with ITIL processes including Incident, Problem, Change, and Release Management
  • Provide hands-on leadership in managing Linux and Windows environments across cloud and on-premises infrastructure
  • Own and drive incident response, root cause analysis, and service restoration for mission-critical systems
  • Design, build, and maintain golden images, patching strategies, and system hardening standards
  • Lead patch management and vulnerability remediation programs ensuring compliance and system integrity
  • Develop and implement automation solutions using modern approaches including Vibe Coding (AI-assisted development) to accelerate operational efficiency and reduce toil
  • Support and optimize infrastructure for AI/ML workloads, including provisioning, scaling, and performance tuning
  • Manage and maintain GPU-enabled environments and instances for high-performance computing and machine learning use cases
  • Oversee and optimize infrastructure monitoring, logging, alerting, and observability frameworks
  • Manage and mentor a team of systems engineers; provide technical guidance and performance oversight
  • Collaborate with architecture, security, and development teams to improve reliability, scalability, and operational efficiency
  • Support hybrid environments including cloud platforms and on-premise data centers
  • Ensure proper documentation, runbooks, SOPs, and operational readiness
  • Stay abreast of new technologies including US Federal Standards, NIST Publications, cloud computing & deployment, site reliability engineering, security standards and compliance best practices

Requirements

  • 5+ years of experience leading operations team with hands-on experience in driving operational process improvements and technological advancements
  • Proven experience implementing and operating within ITIL frameworks
  • 10+ years of hands-on Unix/Linux experience that includes specific technical experience with CentOS / Red Hat systems administration support for large scale distributed environments
  • Hands-on experience with incident management, patching, system hardening, and production support
  • Experience building and maintaining golden images and standardized environments
  • Strong scripting/automation skills (Python, Bash, PowerShell or similar)
  • Experience with configuration management and automation tools (Ansible, Terraform, Puppet, Chef, or similar)
  • Strong understanding of networking fundamentals (DNS, TCP/IP, firewalls, load balancing)
  • Experience with monitoring and logging tools (Nagios, Splunk, ELK, Prometheus, Grafana)
  • Cloud Build-Out or Migration experience in at least one of the following providers Amazon AWS, Google GCP and Microsoft Azure
  • 2+ years with CI/CD and automation tools such as Terraform, Ansible, Chef, Puppet, Jenkins, GitHub
  • Experience supporting AI/ML workloads or data-intensive platforms
  • Familiarity with GPU-based compute environments (e.g., NVIDIA GPU instances)
  • Willing to learn new technologies, adopt and adapt to emerging technologies or needs from a project to a project

Preferred

  • Knowledge of security best practices and compliance frameworks such as NIST 800-53, FedRAMP, FISMA
  • Certifications such as ITIL, Linux, AWS, Azure, or Kubernetes (CKA/CKAD)
  • Networking certifications (CCNA/CCNP)
Skills
LinuxITILAnsibleTerraformAWSPythonBashPowerShellSplunkPrometheusGrafanaKubernetesCentOSRed HatCI/CD
Similar roles at this salary range
All DevOps / SRE jobs →
Northwood Space

Senior Network Engineer

Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.

133k – 215kLos Angeles, CA +1DevOps / SREOn-site5+ YOEAWSVLAN
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Forterra

Senior Software Engineer-Internal Tools

Senior Software Engineer on the DevOps and Tooling team building internal tools. Requires 3-5+ years experience, Rust or strong systems background, TypeScript/React, Linux, Docker, and CI/CD.

125k – 140kArlington, VA +1DevOps / SREOn-site5+ YOEAWSRust
Beacon AI

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud infrastructure and LLM platform services including RAG pipelines, vector search, model endpoints, and data ingestion for an aviation AI company.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSGlue
MongoDB

Site Reliability Engineer

Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.

127k – 249kBoston, MA +6DevOps / SREHybrid6+ YOEGoAWS