Senior Site Reliability Engineer (Arlington, VA)
180k – 220kArlington, VAChantilly, VADevOps / SREOnsite5+ YOE
Summary
Senior SRE owns reliability, scalability, and security of mission-critical deployments in DoD on-prem and AWS environments. Leads incident response, builds observability with Prometheus/Grafana, automates IaC with Terraform/Ansible, requires 5+ years experience and active Top Secret clearance.
About the role
What You'll Do
- Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana). Create actionable insights and automated alerting to identify and resolve issues before they impact users.
- Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
- Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents. Lead blameless post-mortems / After Action Reviews (AARs) that identify root causes and drive automated, long-term solutions.
- Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible). Embed security and compliance controls (RMF, STIGs) into automation.
- Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation. Partner with other teams to share best practices for air-gapped environments.
What We Look For
- Active Top Secret clearance.
- 5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus.
- Proven partner to DevOps/Platform and application teams.
- Deep understanding of incident response processes, root cause analyses, and continuous improvement.
Technical Expertise
- Infrastructure as Code: Terraform (or CloudFormation), Ansible.
- Containers and orchestration: Kubernetes design, deployment, and operations.
- CI/CD: Building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions).
- Scripting: Proficiency with at least one of Python, Go, or Bash.
- Cloud: Familiarity with AWS or AWS GovCloud.
- Observability: Grafana stack, ELK stack, or Datadog.
- Networking fundamentals: core protocols and secure configurations.
Bonus Points (Nice to Have)
- Experience in DoD environments and compliance frameworks (RMF, STIGs, ICD 503).
- GitOps practices and toolchains.
- Security-minded design for sensitive environments.
- Experience designing and implementing meaningful SLIs/SLOs (including error budgets) for complex, distributed systems.
- Familiarity with on-prem virtualization (VMware, Proxmox, Nutanix, Hyper-V, etc).
- Service mesh exposure (Istio, Linkerd).
- Relevant certifications (e.g., AWS DevOps Engineer, CKA/CKAD).
- Active Security+ or another DoD 8570.01-approved security credential.
Skills
KubernetesTerraformAnsiblePrometheusGrafanaAWSPythonGitLab CI/CDIncident ResponseSLOs
Similar roles at this salary range
All DevOps / SRE jobs →Staff Site Reliability Engineer, Release Engineering
Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.
208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.
194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE