Skip to content

Reliability Engineer

Designs and maintains reliable infrastructure for petabyte-scale video processing across multi-cloud environments. Owns incident response, observability (Prometheus, OpenTelemetry), security, and CI/CD tooling with 3+ years experience in scalable systems.

150k – 300kSan Francisco, CADevOps / SREOnsite3+ YOE

About the role

What You’ll Do

  • Work with engineering to design and validate the infrastructure powering PB-scale workloads
  • Build and maintain Terraform-managed multi-cloud deployments
  • Improve cloud and data security (SSO, IAM, least privilege, auditability)
  • Own incident response and harden systems against failure
  • Develop CI/CD systems that minimize user error and maximize safety
  • Build monitoring + alerting platforms (Prometheus, OpenTelemetry, VictoriaMetrics)
  • Wrap internal reliability tooling with simple UIs for engineers

Requirements

  • 3+ years building internal infrastructure at scale
  • Experience on-call for Sev 0 / Sev 1 production incidents (L3 preferred)
  • Strong cloud experience (GCP, AWS, Oracle, Cloudflare, etc.)
  • Deep Infrastructure-as-Code experience (Terraform preferred)
  • Familiarity with Argo, Helm, Kustomize, or similar deployment tools
  • Experience operating observability systems (Prometheus, OTel, VictoriaMetrics)
  • Backend fundamentals in Python, Go, Rust, or C++
  • Strong networking + security intuition, including SSO implementation
  • High ownership mindset over critical systems

Bonus

  • Experience building lightweight internal tooling (APIs, dashboards, Svelte)
  • Familiarity with object storage systems (“buckets”)
  • Active GitHub or portfolio projects

Benefits

  • 401k + Full Health Insurance
  • Breakfast, Lunch, and Dinner covered and your choice of snacks
  • Ubers covered home

Skills

TerraformGCPAWSPrometheusOpenTelemetryVictoriametricsArgoHelmKustomizePythonGoRustC++CloudflareSvelte

Similar roles

DevOps / SRE jobs

Software Engineer - Networking Software and Services

Build software, services, and frameworks for network management, automation, and monitoring of large-scale GPU supercomputing fabrics. Requires deep network protocol knowledge and experience orchestrating tens of thousands of devices.

150k – 250kPalo Alto, CA +1DevOps / SREHybrid5+ YOEGoBGP

Software Engineer, Platform

Own infrastructure, CI/CD, and developer tooling for a fast-scaling AI-native ERP. Set technical direction for reliability, security, and API design in a hybrid NYC/SF environment.

150k – 270kNew York, NY +1DevOps / SREHybrid5+ YOEAWSCI/CD

Software Engineer, Enablement

Design, build, and operate AI-powered engineering tools and developer productivity platforms. Focus on AI pairing pipelines, automated workflows, and internal tooling to accelerate engineering velocity.

150k – 180kUnited StatesDevOps / SRERemote3+ YOEGoLLMs

Infrastructure Engineer

Flint is seeking an Infrastructure Engineer to own the systems powering their AI-generated pages at scale. This 0-to-1 role involves building production-grade cloud architecture, CI/CD, deployments, observability, and security, with a focus on managing parallel background agents.

150k – 250kSan Francisco, CADevOps / SREOn-siteAWSGCP

Infrastructure Engineer

Founding Infrastructure Engineer to architect and scale resilient systems for AI/ML workloads, implement monitoring/observability, and automate infrastructure. Requires 5+ years production experience, Python, Kubernetes, and strong reliability focus.

150k – 300kSan Francisco, CADevOps / SREOn-site5+ YOEPythonKubernetes