Skip to content

Senior Site Reliability Engineer, Platform Infrastructure

Senior SRE building and scaling control and data plane infrastructure for distributed AI/ML workloads on Ray. Requires 3+ years production experience, strong distributed systems background, Kubernetes, cloud platforms, Go/Python, and observability expertise.

San Francisco, CAPalo Alto, CACaliforniaDevOps / SREHybrid3+ YOE

About the role

Responsibilities

  • Design, build, and scale services that orchestrate Ray clusters across cloud and on-prem environments, supporting both VM-based and Kubernetes-based deployments
  • Optimize control plane components for large-scale, distributed AI/ML workloads
  • Build intelligent scheduling and resource management systems for heterogeneous compute clusters
  • Develop features to enhance the reliability, performance, scalability, and observability of Anyscale-managed Ray workloads
  • Support and optimize accelerator integration (e.g., GPUs, TPUs)
  • Handle container image management and dependency resolution for distributed workloads
  • Participate in code reviews, design and architecture discussions
  • Provide on-call support, working closely with customer and field teams to troubleshoot infrastructure issues
  • Collaborate with leading distributed systems and machine learning experts to push the boundaries of AI infrastructure

Requirements

  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 3+ years of experience writing high-quality production code
  • Hands-on experience in building and maintaining highly available, scalable, and performant distributed systems
  • Expertise in cloud-native technologies (AWS, Azure, GCP) and Kubernetes-based deployments
  • Deep understanding of networking, security, and authentication mechanisms in cloud environments
  • Familiarity with observability stacks (Prometheus, Grafana, etc.)
  • Proficiency in Go and Python
  • Knowledge of low-level operating system foundations (Linux kernel, file systems, containers)

Skills

KubernetesAWSAzureGCPGoPythonPrometheusGrafanaLinuxContainers

Similar roles

DevOps / SRE jobs

Senior Site Reliability Engineer

Senior Site Reliability Engineer building and operating highly reliable, scalable Kubernetes-based cloud services in Okta's Emerging Products Group. Lead incident response, define SLOs, develop automation in Go/Python/Terraform, improve observability, and mentor on reliability best practices.

San Francisco, CADevOps / SREHybrid5+ YOEGoAWS

Senior Software Engineer, Infrastructure

Senior engineer building and standardizing AWS/GCP cloud infrastructure, networking, and self-service tooling for Coinbase's multi-cloud platform.

186k – 219kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Senior Software Engineer - Snowpark Container Service

Senior engineer to design, build, and lead development of Snowpark Container Services, a Kubernetes-based container compute platform. Requires 7+ years building large-scale distributed systems and strong coding skills in Java, C++, or Go.

200k – 288kBellevue, WADevOps / SREHybrid7+ YOEGoC++

Senior DevOps Engineer

Senior DevOps Engineer building and operating Kubernetes-based ephemeral environments and cloud infrastructure on AWS to improve developer productivity and platform reliability.

153k – 231kUnited StatesDevOps / SRERemote4+ YOEGoAWS

Senior Site Reliability Engineer - Government Cloud

Build and operate AWS GovCloud infrastructure for federal customers, owning IaC, container pipelines, compliance documentation, and operational tooling. Requires 5+ years AWS experience and FedRAMP familiarity.

210k – 220kUnited StatesDevOps / SRERemote5+ YOEAWSCdk