# Forward Deployed Site Reliability Engineer (TS/SCI Required)
**Company:** [Twenty](https://hotfix.jobs/companies/twenty)
**Location:** Arlington, VA
**Experience:** 5+ years
**Skills:** Docker, Docker Compose, AWS, Terraform, Linux, Grafana, Loki, Tempo, Mimir, Python, Bash, Slis, SLOs, Incident Response, Pagerduty
**Posted:** 2026-05-04
> On-site SRE ensuring reliability of mission-critical platform in air-gapped AWS environment for government customer. Defines SLOs, leads incident response, manages deployments with Docker/Terraform, and bridges operational feedback to engineering team. Requires 5+ years SRE experience and TS/SCI clearance.
## Job Description
## What You'll Do

### Reliability Engineering
- Define, track, and report on SLIs and SLOs for platform services running in the customer environment.
- Use error budgets to drive reliability conversations with the Arlington engineering team, translating operational data into prioritized engineering work.
- Identify and eliminate toil: build automation for repetitive operational tasks within the constraints of the secure environment.
- Conduct post-incident reviews, own root cause analysis, and drive durable fixes in partnership with the engineering team.

### Observability & Incident Response
- Own the observability posture for the on-site deployment — dashboards, alerting thresholds, and log pipelines using the **LGTM stack** (Grafana, Loki, Tempo, Mimir).
- Lead incident response on-site: triage, containment, coordination with Arlington, and customer communication.
- Maintain and continuously improve runbooks for operational procedures and emergency response protocols.
- Serve as the on-call anchor for the customer environment, with clear escalation paths to the engineering team.

### Deployment & Infrastructure Operations
- Work with the customer deployment team to get Twenty's platform stood up and updated within the restricted environment.
- Manage containerized services (**Docker**, **Docker Compose**) across deployment lifecycle — configuration, updates, rollbacks.
- Apply and validate **Terraform**-based infrastructure changes within the enclave, in coordination with the DSO engineer who owns IaC policy and guardrails.
- Perform capacity planning and flag scaling requirements to the Arlington team before they become incidents.

### Customer Liaison & Engineering Feedback
- Serve as the primary technical interface between the government customer and Twenty's engineering team — translating operational requirements, constraints, and issues in both directions.
- Represent the operational environment accurately in engineering discussions: what the team in Arlington can't see, you make visible.
- Partner with the DevSecOps engineer on compliance, logging, and audit requirements specific to the customer environment.
- Provide technical guidance and support to customer stakeholders on system behavior and troubleshooting procedures.

## Must Have
- 5+ years of professional experience in site reliability engineering, production operations, or a closely related infrastructure role.
- Proven experience defining and tracking SLIs, SLOs, and error budgets in a production environment.
- Hands-on experience with **Docker**, **Docker Compose**, and **AWS** (EC2, ECS, RDS, VPCs, security groups) in production deployments.
- Solid **Linux/Unix** systems administration skills; productive in constrained environments where GUI tooling may be limited or unavailable.
- Experience with **Terraform** for infrastructure provisioning and configuration, working within DSO-provided policy guardrails.
- Experience with the **LGTM observability stack** or equivalent (**Grafana**, **Loki**, **Prometheus/Mimir**, distributed tracing).
- Strong incident response experience: you've led responses, written post-mortems and runbooks, and shipped the preventive fix.
- **Scripting** proficiency in **Python** or **Bash** for operational automation, with familiarity in **Go** a plus; experience with **PagerDuty** or equivalent on-call tooling.
- Experience working in or directly supporting government or defense environments, including air-gapped or enclave deployments.

## Nice To Have
- Experience with **NATS** or similar pub/sub messaging systems in production.
- Background in cyber operations, intelligence systems, or signals environments.
- **AWS** certifications (Solutions Architect, SysOps, or DevOps Engineer).
**Apply:** https://hotfix.jobs/jobs/forward-deployed-site-reliability-engineer-ts-sci-required-at-twenty-8ee25ad7-3721-4ba2-af40-60e644711f8a
**Canonical:** https://hotfix.jobs/jobs/forward-deployed-site-reliability-engineer-ts-sci-required-at-twenty-8ee25ad7-3721-4ba2-af40-60e644711f8a