# Site Reliability Engineer (SRE)
**Company:** [Mithril](https://hotfix.jobs/companies/mithril)
**Location:** Palo Alto, CA, San Francisco, CA
**Salary:** $170K-$230K
**Experience:** 3+ years
**Skills:** Kubernetes, Terraform, Pulumi, Python, Go, AWS, GCP, Azure, Prometheus, Grafana, OpenTelemetry, Linux
**Posted:** 2026-04-22
> Builds automation, observability, and tooling for Mithril's multi-cloud GPU orchestration platform, ensuring reliability, SLOs, and capacity management. Requires 3+ years SRE experience, Kubernetes proficiency, cloud expertise, and Python/Go coding skills.
## Job Description
## Core Responsibilities

### Reliability & SLOs
- Implement and own SLIs and SLOs across Mithril's API layer and internal orchestration services to ensure customer commitments are met.
- Partner with Product and Platform teams to ensure new features are designed for operability, reliability, and performance from the start.
- Participate in an on-call rotation; drive root cause analysis (RCA) for production incidents and implement durable fixes to prevent recurrence.

### Observability & Monitoring
- Build and maintain dashboards, alerts, and distributed tracing within Mithril's monitoring stack to provide high-granularity visibility into our multi-cloud infrastructure.
- Own the tooling that surfaces supply-side signal — GPU availability, provider health, reservation fill rates — to both engineering and operations teams.

### Infrastructure as Code & Automation
- Develop and maintain **Terraform/Pulumi** modules and **Kubernetes** configurations to manage Mithril's growing multi-cloud provider footprint.
- Write clean, maintainable **Python** (or **Go**) to automate repetitive operational tasks — from provider API reconciliation to automated health checks and capacity rebalancing.

### Capacity Support
- Assist in managing GPU capacity across providers, ensuring the marketplace can dynamically respond to supply fluctuations and customer demand signals.
- Surface capacity constraints and failure modes early; contribute to the tooling that enables Mithril to make fast, data-driven supply decisions.

## Requirements
- **3+ years** of experience in SRE, Production Engineering, or Infrastructure roles at a high-growth technology company.
- Hands-on **Kubernetes** experience: comfortable managing clusters, deployments, and troubleshooting production incidents in a multi-tenant environment.
- Cloud proficiency in at least one major provider (**AWS**, **GCP**, or **Azure**), including practical understanding of cloud networking fundamentals (**VPC**, **DNS**, load balancing, security groups).
- Coding ability: proficiency in **Python** or equivalent (**Go**, **Rust**, etc.) — you build tools and services, not just scripts. Willing to pick up new languages as needed.
- **Linux** fundamentals: strong command of Linux systems, **TCP/IP** networking, and security best practices.
- Disciplined troubleshooter: calm under pressure during production incidents, with a rigorous approach to RCA and long-term remediation.
- Clear communicator: able to document processes and explain technical trade-offs to engineering and non-engineering teammates.

## Nice to Have
- Experience with **GPU/TPU**-accelerated workloads or **AI/ML** infrastructure.
- Exposure to multi-cloud deployments or niche/specialized cloud providers (e.g., **CoreWeave**, **Lambda Labs**, **Nebius**).
- Familiarity with distributed systems concepts — service discovery, circuit breakers, consensus protocols.
- Production experience with **Prometheus**, **Grafana**, or **OpenTelemetry**.
- Prior experience in a high-growth startup environment where infrastructure scope expands faster than headcount.

## Benefits
- Health, dental, and vision coverage for you and your dependents
- 401k Plan with 4% company match
- 21 days of PTO & 14 company holidays; including 2 floating holidays

**Salary Range**: $170,000 - $230,000
**Apply:** https://hotfix.jobs/jobs/site-reliability-engineer-sre-at-mithril-93421fae-3680-44f8-bca3-628c0e4a2386
**Canonical:** https://hotfix.jobs/jobs/site-reliability-engineer-sre-at-mithril-93421fae-3680-44f8-bca3-628c0e4a2386