Skip to content

SWE - Backend Infrastructure Engineer

175k – 280kSan Francisco, CABellevue, WANew York, NYDevOps / SREOnsite3+ YOE
Summary

Builds and scales core infrastructure including ML training/serving, Kubernetes clusters, and low-latency voice/audio pipelines. Requires 3+ years in infrastructure/ML systems, hands-on reliability engineering, and Kubernetes expertise.

About the role

Responsibilities

  • Design and build secure, maintainable, self-serve core infrastructure that engineering teams can rely on and operate independently
  • Architect and evolve a modern ML training infrastructure — scalable, reproducible, and built for rapid experimentation
  • Build and operate a modern model serving architecture with a focus on reliability, cost efficiency, and low latency
  • Own and scale the low-latency voice interface and audio processing pipeline — a technically demanding, performance-sensitive system at the core of Sesame's product
  • Build developer tooling, server infrastructure, and data infrastructure that is high leverage and low maintenance
  • Set technical direction within your domain, bring others along through clear communication and well-reasoned proposals, and raise the engineering bar across the team

Required Qualifications

  • A strong systems thinker who is equally comfortable setting direction and getting hands-on with implementation
  • Hands-on reliability engineering experience — you have well-formed convictions about observability, monitoring, deployment systems, and loosely coupled architectures, and you've put them into practice at scale
  • Proven track record of shipping services at scale, with all the operational complexity that comes with it
  • Kubernetes — significant production experience operating and scaling Kubernetes clusters
  • Experience designing and shipping flexible domain models and APIs — you think carefully about boundaries, contracts, and long-term maintainability
  • A default toward automation — you've consistently delivered efficiency gains through automation and have the track record to show it
  • Strong communication skills — you can set your own direction, write clearly about tradeoffs, and bring engineers and stakeholders along with you
  • 3+ years of software engineering experience, with significant time in infrastructure, platform, or ML systems roles

Preferred Qualifications

  • Infrastructure as Code at scale — significant IaC experience, preferably Terraform; CloudFormation, Pulumi, or Kubernetes-based approaches also welcome
  • ML infrastructure — PyTorch experience, especially model optimization for serving; ML training or serving experience; building ML serving and/or training infrastructure (TorchServe, Seldon, KServe, Ray Serve); large-scale distributed training and serving systems
  • Data engineering — pipeline design, dataset management, or data platform experience
  • Database design — complex schema design, query optimization, and hard data modeling decisions across relational and non-relational stores
  • Real-time communication systems — low-latency audio, video, or streaming infrastructure

Benefits

  • 401(k) max employer match: 3.5% of compensation
  • 100% employer-paid health, vision, and dental benefits for you and your dependents
  • Unlimited PTO and sick time
  • Flexible spending account with employer matching up to $1,650/year (medical FSA)
  • Guardian Employee Assistance Program (EAP)
  • Opportunity to share in the company's success with competitive stock options
Skills
KubernetesTerraformPyTorchML InfrastructureInfrastructure as CodeObservabilityMonitoringAPIsAutomationTorchServeSeldonKServeRay ServeData PipelinesReal-time Systems
Similar roles at this salary range
All DevOps / SRE jobs →
Plaid

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3