Skip to content

SRE - Infra

Owns and automates production infrastructure on multi-region AWS with EKS clusters, focusing on scaling, reliability, and self-healing systems. Requires deep Kubernetes, Terraform, and Linux expertise for large-scale stateful workloads.

United StatesDevOps / SRERemote

About the role

Responsibilities

  • Operating EKS clusters across several environments with Karpenter autoscaling, Cilium networking, and ArgoCD-driven GitOps deployments
  • Managing and evolving a multi AWS account organization, provisioning, networking, access control, and cross-account connectivity
  • Maintaining the Terraform/Terragrunt IaC platform - modules, automated plan-on-PR / apply-on-merge pipelines, and safe patterns for shared infrastructure
  • Improving operational tooling around deploys, schema changes, backups, restores, and incident response
  • Reducing operational load by identifying repeat pain points and eliminating them through code and self-healing automation
  • Optimizing cloud spend as you go
  • Participating in on-call and incident response, with a strong focus on making incidents rarer over time

Requirements

  • Deep hands-on experience with Kubernetes in production (EKS preferred). You've debugged node pressure, networking issues, and deployment failures at scale (thousands of nodes)
  • Strong experience operating production infrastructure on AWS. Not just one account, but understanding organizational boundaries, IAM, and networking between many
  • Experience automating infrastructure using Terraform or Terragrunt at scale, including module design and state management
  • Solid understanding of Linux systems (disk, memory, networking, failure modes)
  • Experience supporting stateful systems (databases, queues, storage systems, etc.)
  • Ability to debug and reason about performance and reliability issues in production
  • Comfortable owning systems end-to-end, including on-call responsibilities

Nice to Have

  • Experience with GitOps workflows (ArgoCD) and CI/CD pipelines (GitHub Actions)
  • Experience with building AI agent-enabled base-level infra services for teams that move fast
  • Familiarity with multi-region infrastructure and the consistency/availability tradeoffs that come with it

Skills

KubernetesAWSEKSTerraformTerragruntLinuxArgo CDKarpenterCiliumGitOpsIAMGitHub Actions

Similar roles

DevOps / SRE jobs

Software Engineer, Services Platform

Build platform primitives for service provisioning, deploy tooling, workflow orchestration, and service ownership at a fast-scaling AI coding tool company. Requires experience with durable workflows like Temporal, internal dev platforms, and strong focus on developer experience and reliability.

San Francisco, CA +1DevOps / SREOn-site5+ YOECI/CDTemporal

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud and LLM infrastructure powering RAG, inference, and data pipelines for an aviation AI platform. Requires strong AWS depth, Python data pipelines, and production LLM experience.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSVpc

Software Engineer, Traffic

Design, build, and operate scalable distributed systems and edge networks on AWS to handle Figma's growing customer traffic and services. Requires 4+ years building infrastructure at scale, experience with TypeScript or Go, and distributed/traffic systems.

153k – 376kSan Francisco, CA +1DevOps / SRERemote4+ YOEGoAWS

Cloud Engineer - Product Metrics

Design, build, and operate petabyte-scale distributed systems for product metrics using Golang, Kubernetes, and ClickHouse. Requires 5+ years building scalable systems and 2+ years with Golang.

141k – 230kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Postgres Deployment Engineer

Own stability and deployment of PostgreSQL products. Package software with Nix, manage upgrades, optimize CI/CD, and resolve production issues. Requires 3+ years PostgreSQL experience and Nix proficiency.

United StatesDevOps / SRERemote3+ YOECGo