SWE - Backend Infrastructure Engineer

175k – 280kSan Francisco, CABellevue, WANew York, NYDevOps / SREOnsite3+ YOEMay 5

Summary

Builds and scales core infrastructure including ML training/serving, Kubernetes clusters, and low-latency voice/audio pipelines. Requires 3+ years in infrastructure/ML systems, hands-on reliability engineering, and Kubernetes expertise.

About the role

Responsibilities

Design and build secure, maintainable, self-serve core infrastructure that engineering teams can rely on and operate independently
Architect and evolve a modern ML training infrastructure — scalable, reproducible, and built for rapid experimentation
Build and operate a modern model serving architecture with a focus on reliability, cost efficiency, and low latency
Own and scale the low-latency voice interface and audio processing pipeline — a technically demanding, performance-sensitive system at the core of Sesame's product
Build developer tooling, server infrastructure, and data infrastructure that is high leverage and low maintenance
Set technical direction within your domain, bring others along through clear communication and well-reasoned proposals, and raise the engineering bar across the team

Required Qualifications

A strong systems thinker who is equally comfortable setting direction and getting hands-on with implementation
Hands-on reliability engineering experience — you have well-formed convictions about observability, monitoring, deployment systems, and loosely coupled architectures, and you've put them into practice at scale
Proven track record of shipping services at scale, with all the operational complexity that comes with it
Kubernetes — significant production experience operating and scaling Kubernetes clusters
Experience designing and shipping flexible domain models and APIs — you think carefully about boundaries, contracts, and long-term maintainability
A default toward automation — you've consistently delivered efficiency gains through automation and have the track record to show it
Strong communication skills — you can set your own direction, write clearly about tradeoffs, and bring engineers and stakeholders along with you
3+ years of software engineering experience, with significant time in infrastructure, platform, or ML systems roles

Preferred Qualifications

Infrastructure as Code at scale — significant IaC experience, preferably Terraform; CloudFormation, Pulumi, or Kubernetes-based approaches also welcome
ML infrastructure — PyTorch experience, especially model optimization for serving; ML training or serving experience; building ML serving and/or training infrastructure (TorchServe, Seldon, KServe, Ray Serve); large-scale distributed training and serving systems
Data engineering — pipeline design, dataset management, or data platform experience
Database design — complex schema design, query optimization, and hard data modeling decisions across relational and non-relational stores
Real-time communication systems — low-latency audio, video, or streaming infrastructure

Benefits

401(k) max employer match: 3.5% of compensation
100% employer-paid health, vision, and dental benefits for you and your dependents
Unlimited PTO and sick time
Flexible spending account with employer matching up to $1,650/year (medical FSA)
Guardian Employee Assistance Program (EAP)
Opportunity to share in the company's success with competitive stock options

Skills

KubernetesTerraformPyTorchML InfrastructureInfrastructure as CodeObservabilityMonitoringAPIsAutomationTorchServeSeldonKServeRay ServeData PipelinesReal-time Systems

Similar roles at this salary range

All DevOps / SRE jobs →

Plaid

Jun 19

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Cribl

Jun 17

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Apply