# SRE - Infra
**Company:** [PostHog](https://hotfix.jobs/companies/posthog)
**Location:** Remote
**Skills:** Kubernetes, AWS, EKS, Terraform, Terragrunt, Linux, Argo CD, Karpenter, Cilium, GitOps, IAM, GitHub Actions
**Posted:** 2026-04-09
> Owns and automates production infrastructure on multi-region AWS with EKS clusters, focusing on scaling, reliability, and self-healing systems. Requires deep Kubernetes, Terraform, and Linux expertise for large-scale stateful workloads.
## Job Description
## Responsibilities
- Operating EKS clusters across several environments with Karpenter autoscaling, Cilium networking, and ArgoCD-driven GitOps deployments
- Managing and evolving a multi AWS account organization, provisioning, networking, access control, and cross-account connectivity
- Maintaining the Terraform/Terragrunt IaC platform - modules, automated plan-on-PR / apply-on-merge pipelines, and safe patterns for shared infrastructure
- Improving operational tooling around deploys, schema changes, backups, restores, and incident response
- Reducing operational load by identifying repeat pain points and eliminating them through code and self-healing automation
- Optimizing cloud spend as you go
- Participating in on-call and incident response, with a strong focus on making incidents rarer over time

## Requirements
- Deep hands-on experience with **Kubernetes** in production (EKS preferred). You've debugged node pressure, networking issues, and deployment failures at scale (thousands of nodes)
- Strong experience operating production infrastructure on **AWS**. Not just one account, but understanding organizational boundaries, **IAM**, and networking between many
- Experience automating infrastructure using **Terraform** or **Terragrunt** at scale, including module design and state management
- Solid understanding of **Linux** systems (disk, memory, networking, failure modes)
- Experience supporting stateful systems (databases, queues, storage systems, etc.)
- Ability to debug and reason about performance and reliability issues in production
- Comfortable owning systems end-to-end, including on-call responsibilities

## Nice to Have
- Experience with **GitOps** workflows (**ArgoCD**) and CI/CD pipelines (**GitHub Actions**)
- Experience with building AI agent-enabled base-level infra services for teams that move fast
- Familiarity with multi-region infrastructure and the consistency/availability tradeoffs that come with it
**Apply:** https://hotfix.jobs/jobs/sre-infra-at-posthog-fbd633be-70bd-4c15-bcb7-e7d51305e80e
**Canonical:** https://hotfix.jobs/jobs/sre-infra-at-posthog-fbd633be-70bd-4c15-bcb7-e7d51305e80e