Skip to content

Site Reliability Engineer

140k – 155kFrederick, MDDevOps / SREOnsite6+ YOE
Summary

Builds scalable multi-cloud infrastructure using Kubernetes and AI/ML tooling, implements observability with Grafana/Splunk, automates IaC with Terraform/Ansible, and ensures compliance/reliability for scientific programs. Requires 6+ years DevOps/SRE experience.

About the role

Responsibilities

  • Design and implement enterprise-grade monitoring and observability frameworks (metrics, logs, traces) across distributed systems using enterprise Splunk, Grafana and Open-telemetry tools
  • Establish and manage SLIs, SLOs, and error budgets to drive reliability improvements
  • Develop and maintain real-time asset inventory systems across cloud, on-prem, and hybrid environments
  • Automate workload onboarding and offboarding processes, ensuring standardization and governance
  • Track system ownership, dependencies, and lifecycle states for operational transparency
  • Build proactive detection mechanisms using AIOps and intelligent alerting to minimize incident impact
  • Design and operate scalable, resilient, and secure infrastructure platforms across cloud and hybrid environments
  • Implement automated compliance tracking and enforcement aligned with organizational and regulatory standards (e.g., NIST, FISMA, FedRAMP)
  • Embed ITIL processes (incident, change, problem, configuration management) into SRE workflows
  • Build and maintain automated deployment environments and pipelines that enforce security, compliance, and operational standards
  • Develop "golden paths" and standardized platform templates for consistent workload deployment
  • Automate provisioning, patching, configuration management, and environment lifecycle
  • Leverage AI/ML coding assistants and vibe coding practices to rapidly develop automation scripts, tools, and internal platforms
  • Integrate AI-driven tooling into DevOps pipelines for code quality, security scanning, and operational insights
  • Lead adoption of AI-enhanced SRE practices, including intelligent remediation and predictive operations
  • Champion DevOps and SRE practices including Infrastructure as Code, CI/CD, observability, and reliability engineering
  • Build developer-friendly platforms ("golden paths") that simplify deployments, reduce friction, and improve velocity
  • Enable and optimize infrastructure for AI/ML workloads, including data pipelines, storage systems, and inference environments, GPU-enabled and high-performance compute workloads
  • Build and manage containerized and orchestrated platforms (Docker, Kubernetes)
  • Support cloud migration, modernization, and platform standardization initiatives
  • Ensure systems meet security, compliance, backup, and disaster recovery requirements
  • Evangelize and promote best practices in DevOps, SRE, and platform engineering to developer communities

Requirements

  • Must have total of 6+ experience DevOps / SRE roles with monitoring and observability tools (Prometheus, Grafana, ELK, or cloud-native equivalents) for on-prem and cloud hosted workloads
  • Must have 4+ years of Hands-on Linux experience that includes Ubuntu/CentOS/Red Hat operating systems, containers, dependency management and administration support
  • Must have 4+ years of experience automating Infrastructure-as-Code (IaC) deployments to one of the following cloud platforms Amazon AWS, Google GCP and Microsoft Azure
  • Must have 4+ years with CI/CD and automation tools such as Terraform, Ansible, Chef, Puppet, Jenkins, GitHub Actions
  • Strong scripting skills (Python, Bash, PowerShell or similar)
  • Must be proficient using vibe coding and coding assistants to develop scripts, tools and applications for the DevOps and SRE use cases
  • Must have proficiency to debug or troubleshoot and/or deploying SQL and/or NoSQL databases, object storage, web servers, open-source programming stack for Node.JS, R, Python, .NET Core, Java is desired but not mandatory
  • Must be willing to learn new technologies, adopt and adapt to emerging technologies or needs from a project to a project
  • Cloud certifications is preferred
  • Certifications in Grafana, Splunk, Docker, Kubernetes is preferred but optional

Compensation

  • Salary Range: $140,000—$155,000 USD
Skills
KubernetesDockerTerraformAnsibleAWSAzureGoogle CloudGrafanaSplunkPrometheusLinuxPythonBashJenkinsGitHub Actions
Similar roles at this salary range
All DevOps / SRE jobs →
Northwood Space

Senior Network Engineer

Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.

133k – 215kLos Angeles, CA +1DevOps / SREOn-site5+ YOEAWSVLAN
Pinterest

Site Reliability Engineer II

Operate and scale a cloud-native CTV advertising platform on AWS and Kubernetes. Focus on reliability, GitOps workflows, infrastructure automation, observability, and incident response.

114k – 235kSan Francisco, CADevOps / SRERemote4+ YOEAWSEKS
Forterra

Senior Software Engineer-Internal Tools

Senior Software Engineer on the DevOps and Tooling team building internal tools. Requires 3-5+ years experience, Rust or strong systems background, TypeScript/React, Linux, Docker, and CI/CD.

125k – 140kArlington, VA +1DevOps / SREOn-site5+ YOEAWSRust
Beacon AI

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud infrastructure and LLM platform services including RAG pipelines, vector search, model endpoints, and data ingestion for an aviation AI company.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSGlue
MongoDB

Site Reliability Engineer

Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.

127k – 249kBoston, MA +6DevOps / SREHybrid6+ YOEGoAWS