Forward Deployed Site Reliability Engineer (TS/SCI Required)

On-site SRE ensuring reliability of mission-critical platform in air-gapped AWS environment for government customer. Defines SLOs, leads incident response, manages deployments with Docker/Terraform, and bridges operational feedback to engineering team. Requires 5+ years SRE experience and TS/SCI clearance.

Arlington, VADevOps / SREOnsite5+ YOE

Apply

About the role

What You'll Do

Reliability Engineering

Define, track, and report on SLIs and SLOs for platform services running in the customer environment.
Use error budgets to drive reliability conversations with the Arlington engineering team, translating operational data into prioritized engineering work.
Identify and eliminate toil: build automation for repetitive operational tasks within the constraints of the secure environment.
Conduct post-incident reviews, own root cause analysis, and drive durable fixes in partnership with the engineering team.

Observability & Incident Response

Own the observability posture for the on-site deployment — dashboards, alerting thresholds, and log pipelines using the LGTM stack (Grafana, Loki, Tempo, Mimir).
Lead incident response on-site: triage, containment, coordination with Arlington, and customer communication.
Maintain and continuously improve runbooks for operational procedures and emergency response protocols.
Serve as the on-call anchor for the customer environment, with clear escalation paths to the engineering team.

Deployment & Infrastructure Operations

Work with the customer deployment team to get Twenty's platform stood up and updated within the restricted environment.
Manage containerized services (Docker, Docker Compose) across deployment lifecycle — configuration, updates, rollbacks.
Apply and validate Terraform-based infrastructure changes within the enclave, in coordination with the DSO engineer who owns IaC policy and guardrails.
Perform capacity planning and flag scaling requirements to the Arlington team before they become incidents.

Customer Liaison & Engineering Feedback

Serve as the primary technical interface between the government customer and Twenty's engineering team — translating operational requirements, constraints, and issues in both directions.
Represent the operational environment accurately in engineering discussions: what the team in Arlington can't see, you make visible.
Partner with the DevSecOps engineer on compliance, logging, and audit requirements specific to the customer environment.
Provide technical guidance and support to customer stakeholders on system behavior and troubleshooting procedures.

Must Have

5+ years of professional experience in site reliability engineering, production operations, or a closely related infrastructure role.
Proven experience defining and tracking SLIs, SLOs, and error budgets in a production environment.
Hands-on experience with Docker, Docker Compose, and AWS (EC2, ECS, RDS, VPCs, security groups) in production deployments.
Solid Linux/Unix systems administration skills; productive in constrained environments where GUI tooling may be limited or unavailable.
Experience with Terraform for infrastructure provisioning and configuration, working within DSO-provided policy guardrails.
Experience with the LGTM observability stack or equivalent (Grafana, Loki, Prometheus/Mimir, distributed tracing).
Strong incident response experience: you've led responses, written post-mortems and runbooks, and shipped the preventive fix.
Scripting proficiency in Python or Bash for operational automation, with familiarity in Go a plus; experience with PagerDuty or equivalent on-call tooling.
Experience working in or directly supporting government or defense environments, including air-gapped or enclave deployments.

Nice To Have

Experience with NATS or similar pub/sub messaging systems in production.
Background in cyber operations, intelligence systems, or signals environments.
AWS certifications (Solutions Architect, SysOps, or DevOps Engineer).

Skills

DockerDocker ComposeAWSTerraformLinuxGrafanaLokiTempoMimirPythonBashSlisSLOsIncident ResponsePagerduty

Similar roles

DevOps / SRE jobs

Cursor

Software Engineer, Services Platform

Build platform primitives for service provisioning, deploy tooling, workflow orchestration, and service ownership at a fast-scaling AI coding tool company. Requires experience with durable workflows like Temporal, internal dev platforms, and strong focus on developer experience and reliability.

San Francisco, CA +1DevOps / SREOn-site5+ YOECI/CDTemporal

Beacon AI

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud and LLM infrastructure powering RAG, inference, and data pipelines for an aviation AI platform. Requires strong AWS depth, Python data pipelines, and production LLM experience.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSVpc

Figma

Software Engineer, Traffic

Design, build, and operate scalable distributed systems and edge networks on AWS to handle Figma's growing customer traffic and services. Requires 4+ years building infrastructure at scale, experience with TypeScript or Go, and distributed/traffic systems.

153k – 376kSan Francisco, CA +1DevOps / SRERemote4+ YOEGoAWS

Clickhouse

Cloud Engineer - Product Metrics

Design, build, and operate petabyte-scale distributed systems for product metrics using Golang, Kubernetes, and ClickHouse. Requires 5+ years building scalable systems and 2+ years with Golang.

141k – 230kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Supabase

Postgres Deployment Engineer

Own stability and deployment of PostgreSQL products. Package software with Nix, manage upgrades, optimize CI/CD, and resolve production issues. Requires 3+ years PostgreSQL experience and Nix proficiency.

United StatesDevOps / SRERemote3+ YOECGo