Skip to content

Staff+ Software Engineer, Developer Productivity

405k – 625kSan Francisco, CANew York, NYSeattle, WAHybrid
Summary

Leads technical strategy and builds scalable developer infrastructure including build systems, CI/CD pipelines, and tooling for large monorepo environments. Requires 3+ years leading complex projects, proficiency in Python/Rust/Go, and experience with container orchestration.

About the role

Responsibilities

  • Own the technical strategy and roadmap for your area, translating team-level goals into concrete execution plans
  • Define infrastructure architecture, ensuring the hardest problems get solved — whether by you directly or by working through others
  • Design and build scalable, reliable distributed infrastructure and shared libraries that support high-volume workloads across all engineering teams
  • Own and evolve build environments, package management, and dependency systems to enable fast, reproducible builds
  • Define and implement language ecosystem standards, tooling, and frameworks that drive developer productivity across research and production workloads

Requirements

  • 3+ years (not including internships or co-ops) of experience leading large scale, complex projects or teams as an engineer or tech lead
  • Deep experience with build systems, CI/CD pipelines, and/or developer tooling in a large monorepo environment
  • Strong proficiency in Python, Rust and/or Go
  • Obsessed with developer productivity and reducing friction in the software development lifecycle
  • Experience with container orchestration and infrastructure at scale
  • Excellent communication skills and enjoy supporting internal partners to improve their development experience
  • Excited about designing foundational systems and comfortable working independently on ambiguous, high-impact technical challenges

Nice-to-Haves

  • 15+ years (not including internships or co-ops) of experience in a Software Engineer role, building and operating large-scale developer infrastructure
  • Experience with CI orchestration tools (Buildkite, Jenkins, GitHub Actions, or similar) and merge queue management at scale
  • Experience building or operating remote build execution systems (Bazel Remote Execution API, BuildBarn, BuildBuddy, or similar)
  • Experience with Nix/NixOS/Docker and managing large image / package sets at scale
  • Experience building CLI tools, developer-facing services, and GitHub API and automation workflows
Skills
PythonRustGoCI/CDBuild SystemsBazelBuildkiteJenkinsGitHub ActionsNixDockerKubernetesMonorepoRemote Build Execution
Similar roles at this salary range
All DevOps / SRE jobs →
Anthropic

Performance Engineer, Inference Systems

Performance engineer focused on cross-layer investigations of Anthropic's inference fleet for Claude, optimizing throughput, latency, reliability, and correctness while building observability and partnering with kernel and serving teams.

350k – 850kSan Francisco, CA +2DevOps / SREHybridSQLPython
OpenAI

Tech Lead, Deployment & Operations — Custom Infrastructure

Lead deployment and operations for OpenAI’s custom silicon and systems into data center environments. Drive hardware bring-up, validation, production deployment, and fleet reliability at scale while leading a technical team.

342k – 445kSan Francisco, CADevOps / SREHybridToolingAutomation
Thinking Machines Lab

Software Engineer, Developer Productivity, AI Tools

Builds and maintains AI-powered developer productivity tools, including coding agents, secure sandboxes, and standardized environments to accelerate internal software development workflows while ensuring security and quality.

350k – 475kSan Francisco, CADevOps / SREOn-siteuvTGI
Thinking Machines Lab

Site Reliability Engineer (SRE)

Site Reliability Engineer drives end-to-end reliability for AI fine-tuning platform Tinker, including SLOs, monitoring, incident response, and multi-tenant GPU scheduling. Requires distributed systems experience, software proficiency for reliability, and production incident handling.

350k – 475kSan Francisco, CADevOps / SREOn-siteSLOsCI/CD
Thinking Machines Lab

Research Engineer, Infrastructure, Training Systems

Designs and optimizes distributed training systems scaling across thousands of GPUs for large AI models. Requires strong systems engineering, PyTorch/JAX expertise, and collaborative mindset to boost research productivity.

350k – 475kSan Francisco, CADevOps / SREOn-siteJAXXLA