Staff Infrastructure Engineer

Staff Infrastructure Engineer architects and owns scalable cloud infrastructure (AWS/GCP) powering AI-driven financial operations, optimizes GPU workloads, drives reliability via SLOs and monitoring, and enhances developer velocity through CI/CD and platform tools. Requires 5+ years experience with distributed systems.

200k – 300kSan Francisco, CADevOps / SREOnsite5+ YOE

Apply

About the role

What You'll Do

Lead architectural decisions and technical reviews for infrastructure-critical initiatives.
Design, build, and own the cloud infrastructure (AWS/GCP) that runs Salient - from compute and networking to storage and observability.
Develop scalable harnesses that enable coding agents to operate reliably without compromising system stability or code quality.
Partner closely with the modeling team to optimize the serving and performance of GPU-intensive workloads.
Drive reliability and performance across the stack by defining SLOs, building robust monitoring and alerting, and leading incident response and postmortems.
Own developer platform investments that materially improve engineering velocity, including CI/CD, deployment tooling, environments, and internal infrastructure abstractions.
Establish infrastructure best practices, patterns, and standards as a technical authority across the engineering org.
Identify and reduce technical debt across infrastructure systems, with a focus on long-term scalability and operational health.

What You'll Bring

5+ years of software engineering experience, with 2+ years at the senior or staff level in infrastructure/platform roles, working on large-scale distributed systems.
Deep expertise in cloud platforms (AWS or GCP) - compute, networking, storage, IAM, and cost optimization.
Expert in infrastructure-as-code, with a strong track record of building scalable automation systems.
Extensive experience owning and scaling Kubernetes and CI/CD systems in high-throughput, production environments.
Track record of building and operating high-availability, high-throughput distributed systems with mature observability practices.
Strong technical communication - able to document architecture clearly and influence engineering decisions across teams.

Nice to Have

Background in security.
Exposure to serving AI/ML workloads.
Combination of big tech and startup experience.

Skills

AWSGCPKubernetesCI/CDInfrastructure As CodeDistributed SystemsObservabilitySLOsIAMCost Optimization

Similar roles

DevOps / SRE jobs

Staff Software Engineer, Infrastructure

Hands-on Infrastructure Tech Lead building and scaling AWS cloud infrastructure from scratch for an AI-driven enterprise analytics platform. Owns architecture, IaC, security/compliance (SOC 2), and operational excellence.

200k – 300kSan Francisco, CADevOps / SREHybrid7+ YOEAWSGCP

Vapi

Member of Technical Staff, DevOps

The Member of Technical Staff, DevOps will own progressive delivery, GitOps, and on-demand environment tooling to improve deployment safety and speed for engineering teams. This role requires a platform-as-a-product mindset and experience with infrastructure as code and CI/CD pipelines.

200k – 270kSan Francisco, CADevOps / SREHybrid5+ YOEGoEKS

Vapi

Member of Technical Staff, Site Reliability Engineer

Vapi is seeking a Site Reliability Engineer to drive 99.99% call completion for their Voice AI platform. This role involves running incident command, owning SLOs and error budgets, building reliability culture, and shipping code for platform services in Go or TypeScript.

200k – 270kSan Francisco, CADevOps / SREHybrid5+ YOEGoKeda

Drata

Staff Platform Engineer, Interoperability

Staff Platform Engineer building developer tooling, CI/CD automation, and scalable web applications using NodeJS, React, and AWS. Requires 10+ years experience and expertise in Temporal, Terraform, PostgreSQL, and Snowflake.

200k – 272kSan Francisco, CADevOps / SREHybrid10+ YOEGitJest

Polymath

Member of Technical Staff - Engineering

Builds infrastructure and simulation engines for training autonomous AI agents in complex environments using reinforcement learning. Requires strong engineering skills in distributed systems, containerization, networking, and data systems.

200k – 350kSan Francisco, CADevOps / SREOn-siteDockerAI Agents