Reliability Engineer

Designs and maintains reliable infrastructure for petabyte-scale video processing across multi-cloud environments. Owns incident response, observability (Prometheus, OpenTelemetry), security, and CI/CD tooling with 3+ years experience in scalable systems.

150k – 300kSan Francisco, CADevOps / SREOnsite3+ YOE

Apply

About the role

What You’ll Do

Work with engineering to design and validate the infrastructure powering PB-scale workloads
Build and maintain Terraform-managed multi-cloud deployments
Improve cloud and data security (SSO, IAM, least privilege, auditability)
Own incident response and harden systems against failure
Develop CI/CD systems that minimize user error and maximize safety
Build monitoring + alerting platforms (Prometheus, OpenTelemetry, VictoriaMetrics)
Wrap internal reliability tooling with simple UIs for engineers

Requirements

3+ years building internal infrastructure at scale
Experience on-call for Sev 0 / Sev 1 production incidents (L3 preferred)
Strong cloud experience (GCP, AWS, Oracle, Cloudflare, etc.)
Deep Infrastructure-as-Code experience (Terraform preferred)
Familiarity with Argo, Helm, Kustomize, or similar deployment tools
Experience operating observability systems (Prometheus, OTel, VictoriaMetrics)
Backend fundamentals in Python, Go, Rust, or C++
Strong networking + security intuition, including SSO implementation
High ownership mindset over critical systems

Bonus

Experience building lightweight internal tooling (APIs, dashboards, Svelte)
Familiarity with object storage systems (“buckets”)
Active GitHub or portfolio projects

Benefits

401k + Full Health Insurance
Breakfast, Lunch, and Dinner covered and your choice of snacks
Ubers covered home

Skills

TerraformGCPAWSPrometheusOpenTelemetryVictoriametricsArgoHelmKustomizePythonGoRustC++CloudflareSvelte

Similar roles

DevOps / SRE jobs

xAI

Software Engineer - Networking Software and Services

Build software, services, and frameworks for network management, automation, and monitoring of large-scale GPU supercomputing fabrics. Requires deep network protocol knowledge and experience orchestrating tens of thousands of devices.

150k – 250kPalo Alto, CA +1DevOps / SREHybrid5+ YOEGoBGP

Rillet

Software Engineer, Platform

Own infrastructure, CI/CD, and developer tooling for a fast-scaling AI-native ERP. Set technical direction for reliability, security, and API design in a hybrid NYC/SF environment.

150k – 270kNew York, NY +1DevOps / SREHybrid5+ YOEAWSCI/CD

Capacity

Software Engineer, Enablement

Design, build, and operate AI-powered engineering tools and developer productivity platforms. Focus on AI pairing pipelines, automated workflows, and internal tooling to accelerate engineering velocity.

150k – 180kUnited StatesDevOps / SRERemote3+ YOEGoLLMs

Flint

Infrastructure Engineer

Flint is seeking an Infrastructure Engineer to own the systems powering their AI-generated pages at scale. This 0-to-1 role involves building production-grade cloud architecture, CI/CD, deployments, observability, and security, with a focus on managing parallel background agents.

150k – 250kSan Francisco, CADevOps / SREOn-siteAWSGCP

Reducto

Infrastructure Engineer

Founding Infrastructure Engineer to architect and scale resilient systems for AI/ML workloads, implement monitoring/observability, and automate infrastructure. Requires 5+ years production experience, Python, Kubernetes, and strong reliability focus.

150k – 300kSan Francisco, CADevOps / SREOn-site5+ YOEPythonKubernetes