IT Operations Technical Lead

150k – 170kFrederick, MDDevOps / SREHybrid5+ YOEApr 8

Summary

Leads IT operations for hybrid cloud and on-premise infrastructure, managing Linux/Windows systems, ITIL processes, incident response, automation, and AI/ML workloads. Requires 5+ years leading teams and 10+ years hands-on Linux experience with cloud and DevOps tools.

About the role

Responsibilities

Lead and manage IT operations aligned with ITIL processes including Incident, Problem, Change, and Release Management
Provide hands-on leadership in managing Linux and Windows environments across cloud and on-premises infrastructure
Own and drive incident response, root cause analysis, and service restoration for mission-critical systems
Design, build, and maintain golden images, patching strategies, and system hardening standards
Lead patch management and vulnerability remediation programs ensuring compliance and system integrity
Develop and implement automation solutions using modern approaches including Vibe Coding (AI-assisted development) to accelerate operational efficiency and reduce toil
Support and optimize infrastructure for AI/ML workloads, including provisioning, scaling, and performance tuning
Manage and maintain GPU-enabled environments and instances for high-performance computing and machine learning use cases
Oversee and optimize infrastructure monitoring, logging, alerting, and observability frameworks
Manage and mentor a team of systems engineers; provide technical guidance and performance oversight
Collaborate with architecture, security, and development teams to improve reliability, scalability, and operational efficiency
Support hybrid environments including cloud platforms and on-premise data centers
Ensure proper documentation, runbooks, SOPs, and operational readiness
Stay abreast of new technologies including US Federal Standards, NIST Publications, cloud computing & deployment, site reliability engineering, security standards and compliance best practices

Requirements

5+ years of experience leading operations team with hands-on experience in driving operational process improvements and technological advancements
Proven experience implementing and operating within ITIL frameworks
10+ years of hands-on Unix/Linux experience that includes specific technical experience with CentOS / Red Hat systems administration support for large scale distributed environments
Hands-on experience with incident management, patching, system hardening, and production support
Experience building and maintaining golden images and standardized environments
Strong scripting/automation skills (Python, Bash, PowerShell or similar)
Experience with configuration management and automation tools (Ansible, Terraform, Puppet, Chef, or similar)
Strong understanding of networking fundamentals (DNS, TCP/IP, firewalls, load balancing)
Experience with monitoring and logging tools (Nagios, Splunk, ELK, Prometheus, Grafana)
Cloud Build-Out or Migration experience in at least one of the following providers Amazon AWS, Google GCP and Microsoft Azure
2+ years with CI/CD and automation tools such as Terraform, Ansible, Chef, Puppet, Jenkins, GitHub
Experience supporting AI/ML workloads or data-intensive platforms
Familiarity with GPU-based compute environments (e.g., NVIDIA GPU instances)
Willing to learn new technologies, adopt and adapt to emerging technologies or needs from a project to a project

Preferred

Knowledge of security best practices and compliance frameworks such as NIST 800-53, FedRAMP, FISMA
Certifications such as ITIL, Linux, AWS, Azure, or Kubernetes (CKA/CKAD)
Networking certifications (CCNA/CCNP)

Skills

LinuxITILAnsibleTerraformAWSPythonBashPowerShellSplunkPrometheusGrafanaKubernetesCentOSRed HatCI/CD

Similar roles at this salary range

All DevOps / SRE jobs →

Northwood Space

Jun 19

Senior Network Engineer

Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.

133k – 215kLos Angeles, CA +1DevOps / SREOn-site5+ YOEAWSVLAN

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Forterra

Jun 18

Senior Software Engineer-Internal Tools

Senior Software Engineer on the DevOps and Tooling team building internal tools. Requires 3-5+ years experience, Rust or strong systems background, TypeScript/React, Linux, Docker, and CI/CD.

125k – 140kArlington, VA +1DevOps / SREOn-site5+ YOEAWSRust

Beacon AI

Jun 17

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud infrastructure and LLM platform services including RAG pipelines, vector search, model endpoints, and data ingestion for an aviation AI company.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSGlue

MongoDB

Jun 17

Site Reliability Engineer

Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.

127k – 249kBoston, MA +6DevOps / SREHybrid6+ YOEGoAWS

Apply