Skip to content

Member of Technical Staff - Reliability Engineering

Define and implement reliability systems for a growing AI cloud infrastructure platform, including architectural improvements, operational processes, monitoring, and incident response. Requires 5+ years production coding and 2+ years on-call experience with strong cloud skills.

150k – 350kNew York, NYSan Francisco, CADevOps / SREOnsite5+ YOE

About the role

Requirements

  • 5+ years of experience writing high-quality production code.
  • 2+ years of on-call experience for critical production services.
  • Strong cloud skills, and deep familiarity with at least one hyperscaler cloud (AWS preferred).
  • Familiarity with auto scaling, fleet management, and capacity planning at scale.
  • Experience owning and scaling Kubernetes clusters to thousands of nodes a plus.
  • Experience with systems safety research (e.g. STAMP) and control theory a plus.
  • Ability to work in-person in our NYC, SF or Stockholm offices.

Skills

AWSKubernetesAuto ScalingFleet ManagementCapacity PlanningMonitoring Systems

Similar roles

DevOps / SRE jobs

Staff Site Reliability Engineer

Founding Staff SRE for Kong's internal developer platform (Volcano). Define reliability posture, build multi-region Kubernetes infrastructure, establish GitOps/CI-CD, and scale managed data services.

150k – 210kUnited StatesDevOps / SRERemote7+ YOESREHelm

Staff Engineer, DevOps (4797)

Owns and maintains C++ build systems for autonomous aircraft software, improves developer velocity by optimizing CI/CD pipelines, integrates testing with simulations, and implements monitoring to resolve issues quickly. Requires 7+ years experience with deep expertise in build tools and DevOps practices.

150k – 220kSan Diego, CA +2DevOps / SREOn-site7+ YOEC++Cpm

Staff Platform Engineer

Designs, implements, and maintains scalable infrastructure using Kubernetes and Terraform. Architects GitOps pipelines, drives security initiatives, and mentors teams to enhance developer velocity and platform reliability. Requires 7+ years experience and bachelor's degree.

150k – 170kDenver, CODevOps / SREHybrid7+ YOEC#Iac

Staff Engineer, Software Integration (R4483)

Integrates autonomy software stack for AI robotics platforms, including multi-agent systems, sensor processing, and hardware deployment across simulation, HIL, and flight environments. Requires 7+ years experience, Python/C++, CI/CD expertise, and strong systems integration skills.

150k – 220kSan Diego, CA +1DevOps / SREOn-site7+ YOEC++ROS

Infrastructure Engineer (Senior/Staff Level)

Senior/Staff Infrastructure Engineer accelerates developer workflows, manages AWS infrastructure, CI/CD pipelines, and ensures HIPAA/SOC2 compliance through automation and security best practices in a healthcare AI platform.

150k – 225kNew York, NYDevOps / SRERemote5+ YOENxAWS