Staff+ Software Engineer, Developer Productivity
Leads technical strategy and builds scalable developer infrastructure including build systems, CI/CD pipelines, and tooling for large monorepo environments. Requires 3+ years leading complex projects, proficiency in Python/Rust/Go, and experience with container orchestration.
Responsibilities
- Own the technical strategy and roadmap for your area, translating team-level goals into concrete execution plans
- Define infrastructure architecture, ensuring the hardest problems get solved — whether by you directly or by working through others
- Design and build scalable, reliable distributed infrastructure and shared libraries that support high-volume workloads across all engineering teams
- Own and evolve build environments, package management, and dependency systems to enable fast, reproducible builds
- Define and implement language ecosystem standards, tooling, and frameworks that drive developer productivity across research and production workloads
Requirements
- 3+ years (not including internships or co-ops) of experience leading large scale, complex projects or teams as an engineer or tech lead
- Deep experience with build systems, CI/CD pipelines, and/or developer tooling in a large monorepo environment
- Strong proficiency in Python, Rust and/or Go
- Obsessed with developer productivity and reducing friction in the software development lifecycle
- Experience with container orchestration and infrastructure at scale
- Excellent communication skills and enjoy supporting internal partners to improve their development experience
- Excited about designing foundational systems and comfortable working independently on ambiguous, high-impact technical challenges
Nice-to-Haves
- 15+ years (not including internships or co-ops) of experience in a Software Engineer role, building and operating large-scale developer infrastructure
- Experience with CI orchestration tools (Buildkite, Jenkins, GitHub Actions, or similar) and merge queue management at scale
- Experience building or operating remote build execution systems (Bazel Remote Execution API, BuildBarn, BuildBuddy, or similar)
- Experience with Nix/NixOS/Docker and managing large image / package sets at scale
- Experience building CLI tools, developer-facing services, and GitHub API and automation workflows
Performance Engineer, Inference Systems
Performance engineer focused on cross-layer investigations of Anthropic's inference fleet for Claude, optimizing throughput, latency, reliability, and correctness while building observability and partnering with kernel and serving teams.
Tech Lead, Deployment & Operations — Custom Infrastructure
Lead deployment and operations for OpenAI’s custom silicon and systems into data center environments. Drive hardware bring-up, validation, production deployment, and fleet reliability at scale while leading a technical team.
Software Engineer, Developer Productivity, AI Tools
Builds and maintains AI-powered developer productivity tools, including coding agents, secure sandboxes, and standardized environments to accelerate internal software development workflows while ensuring security and quality.
Site Reliability Engineer (SRE)
Site Reliability Engineer drives end-to-end reliability for AI fine-tuning platform Tinker, including SLOs, monitoring, incident response, and multi-tenant GPU scheduling. Requires distributed systems experience, software proficiency for reliability, and production incident handling.
Research Engineer, Infrastructure, Training Systems
Designs and optimizes distributed training systems scaling across thousands of GPUs for large AI models. Requires strong systems engineering, PyTorch/JAX expertise, and collaborative mindset to boost research productivity.