SRE - Infra

Owns and automates production infrastructure on multi-region AWS with EKS clusters, focusing on scaling, reliability, and self-healing systems. Requires deep Kubernetes, Terraform, and Linux expertise for large-scale stateful workloads.

United StatesDevOps / SRERemote

Apply

About the role

Responsibilities

Operating EKS clusters across several environments with Karpenter autoscaling, Cilium networking, and ArgoCD-driven GitOps deployments
Managing and evolving a multi AWS account organization, provisioning, networking, access control, and cross-account connectivity
Maintaining the Terraform/Terragrunt IaC platform - modules, automated plan-on-PR / apply-on-merge pipelines, and safe patterns for shared infrastructure
Improving operational tooling around deploys, schema changes, backups, restores, and incident response
Reducing operational load by identifying repeat pain points and eliminating them through code and self-healing automation
Optimizing cloud spend as you go
Participating in on-call and incident response, with a strong focus on making incidents rarer over time

Requirements

Deep hands-on experience with Kubernetes in production (EKS preferred). You've debugged node pressure, networking issues, and deployment failures at scale (thousands of nodes)
Strong experience operating production infrastructure on AWS. Not just one account, but understanding organizational boundaries, IAM, and networking between many
Experience automating infrastructure using Terraform or Terragrunt at scale, including module design and state management
Solid understanding of Linux systems (disk, memory, networking, failure modes)
Experience supporting stateful systems (databases, queues, storage systems, etc.)
Ability to debug and reason about performance and reliability issues in production
Comfortable owning systems end-to-end, including on-call responsibilities

Nice to Have

Experience with GitOps workflows (ArgoCD) and CI/CD pipelines (GitHub Actions)
Experience with building AI agent-enabled base-level infra services for teams that move fast
Familiarity with multi-region infrastructure and the consistency/availability tradeoffs that come with it

Skills

KubernetesAWSEKSTerraformTerragruntLinuxArgo CDKarpenterCiliumGitOpsIAMGitHub Actions

Similar roles

DevOps / SRE jobs

Cursor

Software Engineer, Services Platform

Build platform primitives for service provisioning, deploy tooling, workflow orchestration, and service ownership at a fast-scaling AI coding tool company. Requires experience with durable workflows like Temporal, internal dev platforms, and strong focus on developer experience and reliability.

San Francisco, CA +1DevOps / SREOn-site5+ YOECI/CDTemporal

Beacon AI

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud and LLM infrastructure powering RAG, inference, and data pipelines for an aviation AI platform. Requires strong AWS depth, Python data pipelines, and production LLM experience.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSVpc

Figma

Software Engineer, Traffic

Design, build, and operate scalable distributed systems and edge networks on AWS to handle Figma's growing customer traffic and services. Requires 4+ years building infrastructure at scale, experience with TypeScript or Go, and distributed/traffic systems.

153k – 376kSan Francisco, CA +1DevOps / SRERemote4+ YOEGoAWS

Clickhouse

Cloud Engineer - Product Metrics

Design, build, and operate petabyte-scale distributed systems for product metrics using Golang, Kubernetes, and ClickHouse. Requires 5+ years building scalable systems and 2+ years with Golang.

141k – 230kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Supabase

Postgres Deployment Engineer

Own stability and deployment of PostgreSQL products. Package software with Nix, manage upgrades, optimize CI/CD, and resolve production issues. Requires 3+ years PostgreSQL experience and Nix proficiency.

United StatesDevOps / SRERemote3+ YOECGo