Skip to content

Staff+ Software Engineer, Platform

405k – 485kSan Francisco, CANew York, NYSeattle, WAHybrid8+ YOE
Summary

Staff-level software engineer builds and scales platform infrastructure across teams, including dev tools, service infra, multicloud, auth, connectivity, API distributability, and ML adaptation systems. Requires 8+ years full-stack experience with Staff leadership, focusing on robust, scalable solutions in fast-paced AI environment.

About the role

What you'll do

Platform Acceleration

  • Architect and optimize critical development infrastructure including dev environments, observability, and CI/CD pipelines
  • Partner with product teams to understand workflows and eliminate friction points

Service Infra

  • Build and maintain core infrastructure: service mesh, observability systems, deployment pipelines, shared libraries
  • Enable product teams to build and operate reliable services at scale

Multicloud

  • Build infrastructure for multi-cloud providers: cloud-agnostic tooling, cross-cloud networking, multi-region deployments

Auth & Identity

  • Build scalable solutions for user authentication, authorization, RBAC, SSO
  • Work with product teams, security, support, trust & safety

Connectivity

  • Own MCP proxy, OAuth/token management, MCP spec, Python/TypeScript SDKs
  • Handle token refresh at scale, admin controls, proxy infrastructure

API Distributability

  • Transform Claude API into cloud-native managed product: cross-cloud, on-prem, enterprise security/compliance

Platform Intelligence

  • Build training systems for customer-specific Claude adaptation
  • Work on ML training infra, production ML pipelines

You might be a good fit if you

  • Have 8-10+ years of practical full-stack engineering experience, ideally 2+ years at Staff level
  • Led design/delivery of complex user-facing products across full stack
  • Technical expert in modern frontend/backend (e.g., React, TypeScript)
  • Product-focused: robust, scalable, easy-to-use solutions
  • Experience in fast-moving environments, building 0-to-1 products
  • Invest in peer mentorship/growth
  • Drive cross-team alignment, influence without authority
  • Established engineering standards, component architectures, best practices
  • Thrive in fast-paced, ambiguous environments

Strong candidates may also

  • Technical lead/architect for foundational platform systems
  • Designed/scaled billing/payments at high volumes
  • Containerization, secure execution environments
  • Identity/access management (auth, SSO, RBAC) at enterprise scale
  • ML/AI systems, LLM inference, model serving
  • Multi-cloud, cross-region architectures
  • API design focused on developer experience
Skills
ReactTypeScriptCI/CDKubernetesOAuthService MeshObservabilityMulti-cloudRBACSSOML TrainingLLM InferenceAPI GatewaysPythonTypeScript SDK
Similar roles at this salary range
All DevOps / SRE jobs →
Anthropic

Performance Engineer, Inference Systems

Performance engineer focused on cross-layer investigations of Anthropic's inference fleet for Claude, optimizing throughput, latency, reliability, and correctness while building observability and partnering with kernel and serving teams.

350k – 850kSan Francisco, CA +2DevOps / SREHybridSQLPython
OpenAI

Tech Lead, Deployment & Operations — Custom Infrastructure

Lead deployment and operations for OpenAI’s custom silicon and systems into data center environments. Drive hardware bring-up, validation, production deployment, and fleet reliability at scale while leading a technical team.

342k – 445kSan Francisco, CADevOps / SREHybridToolingAutomation
Anthropic

Staff+ Software Engineer, Developer Productivity

Leads technical strategy and builds scalable developer infrastructure including build systems, CI/CD pipelines, and tooling for large monorepo environments. Requires 3+ years leading complex projects, proficiency in Python/Rust/Go, and experience with container orchestration.

405k – 625kSan Francisco, CA +2DevOps / SREHybridGoNix
Thinking Machines Lab

Software Engineer, Developer Productivity, AI Tools

Builds and maintains AI-powered developer productivity tools, including coding agents, secure sandboxes, and standardized environments to accelerate internal software development workflows while ensuring security and quality.

350k – 475kSan Francisco, CADevOps / SREOn-siteuvTGI
Thinking Machines Lab

Site Reliability Engineer (SRE)

Site Reliability Engineer drives end-to-end reliability for AI fine-tuning platform Tinker, including SLOs, monitoring, incident response, and multi-tenant GPU scheduling. Requires distributed systems experience, software proficiency for reliability, and production incident handling.

350k – 475kSan Francisco, CADevOps / SREOn-siteSLOsCI/CD