Site Reliability Engineer (SRE)

Builds automation, observability, and tooling for Mithril's multi-cloud GPU orchestration platform, ensuring reliability, SLOs, and capacity management. Requires 3+ years SRE experience, Kubernetes proficiency, cloud expertise, and Python/Go coding skills.

170k – 230kPalo Alto, CASan Francisco, CADevOps / SREHybrid3+ YOE

Apply

About the role

Core Responsibilities

Reliability & SLOs

Implement and own SLIs and SLOs across Mithril's API layer and internal orchestration services to ensure customer commitments are met.
Partner with Product and Platform teams to ensure new features are designed for operability, reliability, and performance from the start.
Participate in an on-call rotation; drive root cause analysis (RCA) for production incidents and implement durable fixes to prevent recurrence.

Observability & Monitoring

Build and maintain dashboards, alerts, and distributed tracing within Mithril's monitoring stack to provide high-granularity visibility into our multi-cloud infrastructure.
Own the tooling that surfaces supply-side signal — GPU availability, provider health, reservation fill rates — to both engineering and operations teams.

Infrastructure as Code & Automation

Develop and maintain Terraform/Pulumi modules and Kubernetes configurations to manage Mithril's growing multi-cloud provider footprint.
Write clean, maintainable Python (or Go) to automate repetitive operational tasks — from provider API reconciliation to automated health checks and capacity rebalancing.

Capacity Support

Assist in managing GPU capacity across providers, ensuring the marketplace can dynamically respond to supply fluctuations and customer demand signals.
Surface capacity constraints and failure modes early; contribute to the tooling that enables Mithril to make fast, data-driven supply decisions.

Requirements

3+ years of experience in SRE, Production Engineering, or Infrastructure roles at a high-growth technology company.
Hands-on Kubernetes experience: comfortable managing clusters, deployments, and troubleshooting production incidents in a multi-tenant environment.
Cloud proficiency in at least one major provider (AWS, GCP, or Azure), including practical understanding of cloud networking fundamentals (VPC, DNS, load balancing, security groups).
Coding ability: proficiency in Python or equivalent (Go, Rust, etc.) — you build tools and services, not just scripts. Willing to pick up new languages as needed.
Linux fundamentals: strong command of Linux systems, TCP/IP networking, and security best practices.
Disciplined troubleshooter: calm under pressure during production incidents, with a rigorous approach to RCA and long-term remediation.
Clear communicator: able to document processes and explain technical trade-offs to engineering and non-engineering teammates.

Nice to Have

Experience with GPU/TPU-accelerated workloads or AI/ML infrastructure.
Exposure to multi-cloud deployments or niche/specialized cloud providers (e.g., CoreWeave, Lambda Labs, Nebius).
Familiarity with distributed systems concepts — service discovery, circuit breakers, consensus protocols.
Production experience with Prometheus, Grafana, or OpenTelemetry.
Prior experience in a high-growth startup environment where infrastructure scope expands faster than headcount.

Benefits

Health, dental, and vision coverage for you and your dependents
401k Plan with 4% company match
21 days of PTO & 14 company holidays; including 2 floating holidays

Salary Range: $170,000 - $230,000

Skills

KubernetesTerraformPulumiPythonGoAWSGCPAzurePrometheusGrafanaOpenTelemetryLinux

Similar roles

DevOps / SRE jobs

Render

Software Engineer, Dev Velocity

Build internal developer platform, tooling, and automation to accelerate engineering velocity. Focus on CI/CD pipelines, test infrastructure, build systems, and metrics to help engineers ship faster and more reliably.

170k – 290kUnited StatesDevOps / SRERemote5+ YOEGoCI/CD

Siftstack

Software Engineer, DevSecOps

Software Engineer, Security Infrastructure role building and automating security controls, compliance tooling, and DevSecOps practices across AWS, Kubernetes, and CI/CD pipelines. Requires 4–7+ years of hands-on cloud-native security automation experience.

170k – 220kMarina Del Rey, CA +1DevOps / SREHybrid4+ YOEGoAWS

Sfcompute

Software Engineer - Systems

Builds VM orchestration software for GPU neoclouds, provisions bare metal servers, creates Linux OS images, and designs APIs for marketplace users to configure compute resources. Requires strong systems programming skills in Rust or C, Linux familiarity, and knowledge of VMs/containers/syscalls.

170k – 500kSan Francisco, CADevOps / SREOn-siteRustRpcs

Mattermost

Lead Site Reliability Engineer

Leads SRE function to architect reliable infrastructure for secure collaboration platform, driving scalability, observability, automation in cloud/hybrid environments. Requires 5+ years SRE/DevOps experience, Kubernetes/Terraform/AWS expertise, and leadership in regulated sectors.

170k – 200kUnited StatesDevOps / SRERemote5+ YOEAWSGCP

Skydio

Software Engineer - Autonomy Infrastructure, Systems and Tools

Develops internal tools and infrastructure for autonomy lifecycle testing, replay systems, and diagnostics in robotics. Requires 3+ years experience with C++, Python, simulation frameworks, and performance-sensitive systems.

170k – 240kSan Mateo, CADevOps / SREHybrid3+ YOEC++APIs