Senior Site Reliability Engineer (Arlington, VA)

180k – 220kArlington, VAChantilly, VADevOps / SREOnsite5+ YOEApr 8

Summary

Senior SRE owns reliability, scalability, and security of mission-critical deployments in DoD on-prem and AWS environments. Leads incident response, builds observability with Prometheus/Grafana, automates IaC with Terraform/Ansible, requires 5+ years experience and active Top Secret clearance.

About the role

What You'll Do

Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana). Create actionable insights and automated alerting to identify and resolve issues before they impact users.
Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents. Lead blameless post-mortems / After Action Reviews (AARs) that identify root causes and drive automated, long-term solutions.
Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible). Embed security and compliance controls (RMF, STIGs) into automation.
Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation. Partner with other teams to share best practices for air-gapped environments.

What We Look For

Active Top Secret clearance.
5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus.
Proven partner to DevOps/Platform and application teams.
Deep understanding of incident response processes, root cause analyses, and continuous improvement.

Technical Expertise

Infrastructure as Code: Terraform (or CloudFormation), Ansible.
Containers and orchestration: Kubernetes design, deployment, and operations.
CI/CD: Building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions).
Scripting: Proficiency with at least one of Python, Go, or Bash.
Cloud: Familiarity with AWS or AWS GovCloud.
Observability: Grafana stack, ELK stack, or Datadog.
Networking fundamentals: core protocols and secure configurations.

Bonus Points (Nice to Have)

Experience in DoD environments and compliance frameworks (RMF, STIGs, ICD 503).
GitOps practices and toolchains.
Security-minded design for sensitive environments.
Experience designing and implementing meaningful SLIs/SLOs (including error budgets) for complex, distributed systems.
Familiarity with on-prem virtualization (VMware, Proxmox, Nutanix, Hyper-V, etc).
Service mesh exposure (Istio, Linkerd).
Relevant certifications (e.g., AWS DevOps Engineer, CKA/CKAD).
Active Security+ or another DoD 8570.01-approved security credential.

Skills

KubernetesTerraformAnsiblePrometheusGrafanaAWSPythonGitLab CI/CDIncident ResponseSLOs

Similar roles at this salary range

All DevOps / SRE jobs →

Plaid

Jun 19

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Cribl

Jun 17

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Apply