Member of Technical Staff, Infrastructure

Senior infrastructure engineer owning core platform reliability, evolving Kubernetes and AWS systems with Terraform, defining SLOs, and reducing operational toil through automation and paved paths. Requires 10+ years experience and strong software engineering skills.

250k – 280kSan Francisco, CADevOps / SREOnsite10+ YOE

Apply

About the role

Responsibilities

Own and evolve Envoy's core infrastructure platform to be opinionated and well-bounded, ensuring shared systems are not the dominant source of customer-facing downtime.
Define and own infrastructure SLOs, error budgets, and operational signals for core Kubernetes and cloud services, using them to guide prioritization, incident response, and platform investment.
Design and lead the architecture and standards for AWS infrastructure, with Terraform as the source of truth and a strong focus on consistency, safety, and auditability.
Own the Kubernetes platform and its surrounding ecosystem, including cluster architecture, scaling strategies, observability, and operational best practices.
Build and maintain paved paths for service deployment, CI/CD, networking, observability, and access control, reducing ad-hoc infrastructure decisions and the need for direct production access.
Partner closely with application, security, and developer experience teams to design infrastructure that meets product needs while enforcing clear guardrails and ownership boundaries.
Drive meaningful reductions in operational toil by improving defaults, automation, and platform ergonomics, including pragmatic adoption of AI-assisted workflows where they demonstrably improve signal, safety, or speed.
Participate in the on-call rotation and ensure incidents result in durable fixes and systemic improvements, not repeated manual intervention.

Requirements

10+ years of experience building and operating cloud infrastructure at scale, with deep expertise in AWS.
Strong hands-on experience with Terraform or similar infrastructure-as-code tools, and a clear point of view on declarative infrastructure and guardrails.
Deep experience designing, operating, and evolving Kubernetes platforms in production environments.
Proven ability to design systems that balance reliability, security, developer experience, and cost.
Experience influencing CI/CD systems and deployment workflows at an organizational or platform level.
Strong software engineering skills in at least one general-purpose language such as Go or Python, and comfort with shell tooling.
A solid understanding of distributed systems, networking, and production operations.
Excellent communication skills and the ability to lead through influence rather than authority.

Nice to Have

Experience in high-growth or startup environments where platforms evolve rapidly.
Familiarity with SOC 2, ISO 27001, or similar compliance frameworks.
Experience partnering closely with security teams on secure-by-default infrastructure and access models.
Exposure to service mesh, advanced networking patterns, or multi-region architectures.
Experience applying AI-assisted tools or workflows to improve infrastructure safety, speed, or operational signal.

Skills

AWSKubernetesTerraformCI/CDGoPythonDistributed SystemsObservabilitySLOsService Mesh

Similar roles

DevOps / SRE jobs

Perplexity

Member of Technical Staff

Hands-on technical role building AI-powered tools, infrastructure, and processes to accelerate engineering velocity and product delivery at an AI search company.

250k – 405kSan Francisco, CA +1DevOps / SREHybrid5+ YOEGoRust

Together AI

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

Design and operate multi-petabyte distributed storage systems for large-scale AI training and inference, integrating parallel filesystems and building Kubernetes-native storage platforms.

250k – 300kSan Francisco, CADevOps / SREOn-site8+ YOEGoCeph

Zoox

Staff Site Reliability Engineer

Zoox is seeking a Staff Site Reliability Engineer to lead source control, owning the technical strategy and roadmap for their Git-based monorepo. This role involves migrating from GitHub Enterprise to GitHub Cloud, building developer tooling, and partnering with various teams to enhance source control as a strategic asset.

250k – 300kFoster City, CADevOps / SREHybrid7+ YOEBuckCI/CD

Ambience Healthcare

Staff Engineer, Agent Productivity

Owns developer productivity infrastructure including dev environments, CI workflows, verification systems, internal tooling, and secure agent tool access to enable efficient engineering and AI agent development. Requires 7+ years experience with strong backend skills in TypeScript, Rust, Go, or Python.

250k – 300kSan Francisco, CADevOps / SREHybrid7+ YOEGoRust

Earnin

Staff Site Reliability Engineer

Lead EarnIn's AI-first reliability engineering strategy. Define SLOs/SLIs, build AI agents for incident response and on-call automation, and partner with engineering teams to embed AI-assisted operations across production systems on AWS.

252k – 308kMountain View, CADevOps / SREHybrid7+ YOEGoSRE