Staff+ Software Engineer, Developer Productivity

405k – 625kSan Francisco, CANew York, NYSeattle, WAHybridMay 5

Summary

Leads technical strategy and builds scalable developer infrastructure including build systems, CI/CD pipelines, and tooling for large monorepo environments. Requires 3+ years leading complex projects, proficiency in Python/Rust/Go, and experience with container orchestration.

About the role

Responsibilities

Own the technical strategy and roadmap for your area, translating team-level goals into concrete execution plans
Define infrastructure architecture, ensuring the hardest problems get solved — whether by you directly or by working through others
Design and build scalable, reliable distributed infrastructure and shared libraries that support high-volume workloads across all engineering teams
Own and evolve build environments, package management, and dependency systems to enable fast, reproducible builds
Define and implement language ecosystem standards, tooling, and frameworks that drive developer productivity across research and production workloads

Requirements

3+ years (not including internships or co-ops) of experience leading large scale, complex projects or teams as an engineer or tech lead
Deep experience with build systems, CI/CD pipelines, and/or developer tooling in a large monorepo environment
Strong proficiency in Python, Rust and/or Go
Obsessed with developer productivity and reducing friction in the software development lifecycle
Experience with container orchestration and infrastructure at scale
Excellent communication skills and enjoy supporting internal partners to improve their development experience
Excited about designing foundational systems and comfortable working independently on ambiguous, high-impact technical challenges

Nice-to-Haves

15+ years (not including internships or co-ops) of experience in a Software Engineer role, building and operating large-scale developer infrastructure
Experience with CI orchestration tools (Buildkite, Jenkins, GitHub Actions, or similar) and merge queue management at scale
Experience building or operating remote build execution systems (Bazel Remote Execution API, BuildBarn, BuildBuddy, or similar)
Experience with Nix/NixOS/Docker and managing large image / package sets at scale
Experience building CLI tools, developer-facing services, and GitHub API and automation workflows

Skills

PythonRustGoCI/CDBuild SystemsBazelBuildkiteJenkinsGitHub ActionsNixDockerKubernetesMonorepoRemote Build Execution

Similar roles at this salary range

All DevOps / SRE jobs →

Anthropic

May 20

Performance Engineer, Inference Systems

Performance engineer focused on cross-layer investigations of Anthropic's inference fleet for Claude, optimizing throughput, latency, reliability, and correctness while building observability and partnering with kernel and serving teams.

350k – 850kSan Francisco, CA +2DevOps / SREHybridSQLPython

OpenAI

May 16

Tech Lead, Deployment & Operations — Custom Infrastructure

Lead deployment and operations for OpenAI’s custom silicon and systems into data center environments. Drive hardware bring-up, validation, production deployment, and fleet reliability at scale while leading a technical team.

342k – 445kSan Francisco, CADevOps / SREHybridToolingAutomation

Thinking Machines Lab

May 4

Software Engineer, Developer Productivity, AI Tools

Builds and maintains AI-powered developer productivity tools, including coding agents, secure sandboxes, and standardized environments to accelerate internal software development workflows while ensuring security and quality.

350k – 475kSan Francisco, CADevOps / SREOn-siteuvTGI

Thinking Machines Lab

May 4

Site Reliability Engineer (SRE)

Site Reliability Engineer drives end-to-end reliability for AI fine-tuning platform Tinker, including SLOs, monitoring, incident response, and multi-tenant GPU scheduling. Requires distributed systems experience, software proficiency for reliability, and production incident handling.

350k – 475kSan Francisco, CADevOps / SREOn-siteSLOsCI/CD

Thinking Machines Lab

May 4

Research Engineer, Infrastructure, Training Systems

Designs and optimizes distributed training systems scaling across thousands of GPUs for large AI models. Requires strong systems engineering, PyTorch/JAX expertise, and collaborative mindset to boost research productivity.

350k – 475kSan Francisco, CADevOps / SREOn-siteJAXXLA

Apply