Senior Site Reliability Engineer, Platform Infrastructure

Senior SRE building and scaling control and data plane infrastructure for distributed AI/ML workloads on Ray. Requires 3+ years production experience, strong distributed systems background, Kubernetes, cloud platforms, Go/Python, and observability expertise.

San Francisco, CAPalo Alto, CACaliforniaDevOps / SREHybrid3+ YOE

Apply

About the role

Responsibilities

Design, build, and scale services that orchestrate Ray clusters across cloud and on-prem environments, supporting both VM-based and Kubernetes-based deployments
Optimize control plane components for large-scale, distributed AI/ML workloads
Build intelligent scheduling and resource management systems for heterogeneous compute clusters
Develop features to enhance the reliability, performance, scalability, and observability of Anyscale-managed Ray workloads
Support and optimize accelerator integration (e.g., GPUs, TPUs)
Handle container image management and dependency resolution for distributed workloads
Participate in code reviews, design and architecture discussions
Provide on-call support, working closely with customer and field teams to troubleshoot infrastructure issues
Collaborate with leading distributed systems and machine learning experts to push the boundaries of AI infrastructure

Requirements

Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
3+ years of experience writing high-quality production code
Hands-on experience in building and maintaining highly available, scalable, and performant distributed systems
Expertise in cloud-native technologies (AWS, Azure, GCP) and Kubernetes-based deployments
Deep understanding of networking, security, and authentication mechanisms in cloud environments
Familiarity with observability stacks (Prometheus, Grafana, etc.)
Proficiency in Go and Python
Knowledge of low-level operating system foundations (Linux kernel, file systems, containers)

Skills

KubernetesAWSAzureGCPGoPythonPrometheusGrafanaLinuxContainers

Similar roles

DevOps / SRE jobs

Okta

Senior Site Reliability Engineer

Senior Site Reliability Engineer building and operating highly reliable, scalable Kubernetes-based cloud services in Okta's Emerging Products Group. Lead incident response, define SLOs, develop automation in Go/Python/Terraform, improve observability, and mentor on reliability best practices.

San Francisco, CADevOps / SREHybrid5+ YOEGoAWS

Coinbase

Senior Software Engineer, Infrastructure

Senior engineer building and standardizing AWS/GCP cloud infrastructure, networking, and self-service tooling for Coinbase's multi-cloud platform.

186k – 219kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Snowflake

Senior Software Engineer - Snowpark Container Service

Senior engineer to design, build, and lead development of Snowpark Container Services, a Kubernetes-based container compute platform. Requires 7+ years building large-scale distributed systems and strong coding skills in Java, C++, or Go.

200k – 288kBellevue, WADevOps / SREHybrid7+ YOEGoC++

Upstart

Senior DevOps Engineer

Senior DevOps Engineer building and operating Kubernetes-based ephemeral environments and cloud infrastructure on AWS to improve developer productivity and platform reliability.

153k – 231kUnited StatesDevOps / SRERemote4+ YOEGoAWS

Tines

Senior Site Reliability Engineer - Government Cloud

Build and operate AWS GovCloud infrastructure for federal customers, owning IaC, container pipelines, compliance documentation, and operational tooling. Requires 5+ years AWS experience and FedRAMP familiarity.

210k – 220kUnited StatesDevOps / SRERemote5+ YOEAWSCdk