Skip to content

Senior Site Reliability Engineer (Arlington, VA)

180k – 220kArlington, VAChantilly, VADevOps / SREOnsite5+ YOE
Summary

Senior SRE owns reliability, scalability, and security of mission-critical deployments in DoD on-prem and AWS environments. Leads incident response, builds observability with Prometheus/Grafana, automates IaC with Terraform/Ansible, requires 5+ years experience and active Top Secret clearance.

About the role

What You'll Do

  • Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana). Create actionable insights and automated alerting to identify and resolve issues before they impact users.
  • Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
  • Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents. Lead blameless post-mortems / After Action Reviews (AARs) that identify root causes and drive automated, long-term solutions.
  • Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible). Embed security and compliance controls (RMF, STIGs) into automation.
  • Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation. Partner with other teams to share best practices for air-gapped environments.

What We Look For

  • Active Top Secret clearance.
  • 5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus.
  • Proven partner to DevOps/Platform and application teams.
  • Deep understanding of incident response processes, root cause analyses, and continuous improvement.

Technical Expertise

  • Infrastructure as Code: Terraform (or CloudFormation), Ansible.
  • Containers and orchestration: Kubernetes design, deployment, and operations.
  • CI/CD: Building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions).
  • Scripting: Proficiency with at least one of Python, Go, or Bash.
  • Cloud: Familiarity with AWS or AWS GovCloud.
  • Observability: Grafana stack, ELK stack, or Datadog.
  • Networking fundamentals: core protocols and secure configurations.

Bonus Points (Nice to Have)

  • Experience in DoD environments and compliance frameworks (RMF, STIGs, ICD 503).
  • GitOps practices and toolchains.
  • Security-minded design for sensitive environments.
  • Experience designing and implementing meaningful SLIs/SLOs (including error budgets) for complex, distributed systems.
  • Familiarity with on-prem virtualization (VMware, Proxmox, Nutanix, Hyper-V, etc).
  • Service mesh exposure (Istio, Linkerd).
  • Relevant certifications (e.g., AWS DevOps Engineer, CKA/CKAD).
  • Active Security+ or another DoD 8570.01-approved security credential.
Skills
KubernetesTerraformAnsiblePrometheusGrafanaAWSPythonGitLab CI/CDIncident ResponseSLOs
Similar roles at this salary range
All DevOps / SRE jobs →
Plaid

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3