# Senior Site Reliability Engineer, Platform Infrastructure
**Company:** [Anyscale](https://hotfix.jobs/companies/anyscale)
**Location:** San Francisco, CA, Palo Alto, CA, California
**Experience:** 3+ years
**Skills:** Kubernetes, AWS, Azure, GCP, Go, Python, Prometheus, Grafana, Linux, Containers
**Posted:** 2026-06-17
> Senior SRE building and scaling control and data plane infrastructure for distributed AI/ML workloads on Ray. Requires 3+ years production experience, strong distributed systems background, Kubernetes, cloud platforms, Go/Python, and observability expertise.
## Job Description
## Responsibilities
- Design, build, and scale services that orchestrate Ray clusters across cloud and on-prem environments, supporting both VM-based and Kubernetes-based deployments
- Optimize control plane components for large-scale, distributed AI/ML workloads
- Build intelligent scheduling and resource management systems for heterogeneous compute clusters
- Develop features to enhance the reliability, performance, scalability, and observability of Anyscale-managed Ray workloads
- Support and optimize accelerator integration (e.g., GPUs, TPUs)
- Handle container image management and dependency resolution for distributed workloads
- Participate in code reviews, design and architecture discussions
- Provide on-call support, working closely with customer and field teams to troubleshoot infrastructure issues
- Collaborate with leading distributed systems and machine learning experts to push the boundaries of AI infrastructure

## Requirements
- Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
- 3+ years of experience writing high-quality production code
- Hands-on experience in building and maintaining highly available, scalable, and performant distributed systems
- Expertise in cloud-native technologies (AWS, Azure, GCP) and Kubernetes-based deployments
- Deep understanding of networking, security, and authentication mechanisms in cloud environments
- Familiarity with observability stacks (Prometheus, Grafana, etc.)
- Proficiency in Go and Python
- Knowledge of low-level operating system foundations (Linux kernel, file systems, containers)
**Apply:** https://hotfix.jobs/jobs/senior-site-reliability-engineer-platform-infrastructure-at-anyscale-4501ca74-99c6-494f-93a1-0184fd51bfa5
**Canonical:** https://hotfix.jobs/jobs/senior-site-reliability-engineer-platform-infrastructure-at-anyscale-4501ca74-99c6-494f-93a1-0184fd51bfa5