Forward Deployed SRE

135k – 285kSan Francisco, CANew York, NYDevOps / SREHybridMay 11

Summary

Site Reliability Engineer owns reliability of multi-cloud Kubernetes infrastructure for AI/ML platform, builds observability tooling as code, automates mitigations, leads incident response, and defines SLOs/SLIs. Requires extensive Kubernetes and observability experience.

About the role

Responsibilities

Own the reliability of Baseten's multi-cloud Kubernetes infrastructure, including incident response, post-mortems, and remediation tracking.
Build and maintain observability infrastructure — metrics, logging, dashboards, and alerting — as code.
Author, validate, and improve runbooks for recurring failure patterns, ensuring they're structured for low-context, safe execution.
Identify high-frequency failure patterns and convert them into automated mitigations or self-healing automations.
Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management.
Define and instrument SLOs and SLIs across customer workloads and internal services.
Navigate ambiguity, make principled tradeoffs, and avoid unnecessary complexity in the systems you build and the processes you define.

Requirements

Extensive hands-on experience with Kubernetes (multi-cloud experience across EKS, GKE, or similar is a strong plus).
Experience in building and maintaining scalable infrastructure.
Strong foundation in observability tooling: metrics (VictoriaMetrics, Prometheus), logging (Loki, ELK), dashboards (Grafana), and alerting pipelines. Observability-as-code experience is a plus.
Experience with infrastructure-as-code (Terraform, Helm) and GitOps workflows (Flux CD, ArgoCD).
Experience writing and improving runbooks, leading incident response, and doing post-mortem analysis.
Comfort working at the intersection of engineering and operations — you write code, but you also think deeply about process, escalation paths, and operational leverage.
Familiarity with incident management platforms (incident.io or similar) is a plus.
No prior ML experience required, but curiosity about how ML models are deployed and served at scale will serve you well.

Benefits

Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents.
Flexible PTO policy including company wide Winter Break.
Paid parental leave.
Fertility and family-building stipend through Carrot.
Company-facilitated 401(k).
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Skills

KubernetesPrometheusGrafanaTerraformHelmEKSGKEVictoriaMetricsLokiArgoCD

Similar roles at this salary range

All DevOps / SRE jobs →

Northwood Space

Jun 19

Senior Network Engineer

Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.

133k – 215kLos Angeles, CA +1DevOps / SREOn-site5+ YOEAWSVLAN

Jun 18

Site Reliability Engineer II

Operate and scale a cloud-native CTV advertising platform on AWS and Kubernetes. Focus on reliability, GitOps workflows, infrastructure automation, observability, and incident response.

114k – 235kSan Francisco, CADevOps / SRERemote4+ YOEAWSEKS

Forterra

Jun 18

Senior Software Engineer-Internal Tools

Senior Software Engineer on the DevOps and Tooling team building internal tools. Requires 3-5+ years experience, Rust or strong systems background, TypeScript/React, Linux, Docker, and CI/CD.

125k – 140kArlington, VA +1DevOps / SREOn-site5+ YOEAWSRust

Beacon AI

Jun 17

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud infrastructure and LLM platform services including RAG pipelines, vector search, model endpoints, and data ingestion for an aviation AI company.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSGlue

MongoDB

Jun 17

Site Reliability Engineer

Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.

127k – 249kBoston, MA +6DevOps / SREHybrid6+ YOEGoAWS

Apply