Skip to content

Staff Engineer, Datacenter Server Lifecycle

320k – 405kSan Francisco, CANew York, NYHybrid8+ YOE
Summary

Owns end-to-end server lifecycle in datacenters at scale, from provisioning to decommissioning, with strong focus on automation, trusted compute security, and hardware operations for AI workloads. Requires hands-on server hardware experience and proficiency in Python/Rust/Go plus cloud infra like Kubernetes/AWS/GCP.

About the role

Key Responsibilities

  • Lead the build-out of automation to support datacenters containing tens of thousands of servers.
  • Define and own the end-to-end server lifecycle strategy — from provisioning and deployment through operation, maintenance, refresh, and decommissioning — and maintain automation and operational procedures for common lifecycle events (e.g., hardware failures, firmware upgrades, fleet rotations).
  • Partner closely with Infrastructure Security to design and enforce trusted compute standards across the server lifecycle.
  • Work closely with our Networking team to ensure end-to-end connectivity across all sites.
  • Build and maintain tooling to track machine health, configuration, and operational status across the full datacenter fleet.

Minimum Qualifications

  • Hands-on experience with server hardware, including rack deployment, cabling, troubleshooting, and understanding failure modes at scale.
  • End-to-end understanding of hardware lifecycle management: asset tracking, provisioning workflows, maintenance scheduling, and decommissioning practices.
  • Proficiency in at least one programming language (e.g., Python, Rust, Go, or Java).
  • Working knowledge of modern cloud infrastructure, including Kubernetes, Infrastructure as Code, AWS, and GCP.
  • Ability to communicate clearly and build consensus with a wide range of stakeholders.
  • Comfort navigating ambiguity and making progress on complex, cross-functional problems.
  • Willingness to travel occasionally to datacenter sites across North America.

Preferred Qualifications

  • 8+ years of experience in datacenter operations, hardware infrastructure management, or a closely related discipline.
  • Hands-on experience with GPU or AI accelerator hardware (e.g., NVIDIA A100/H100, AMD MI300, Google TPUs, or AWS Trainium) and an understanding of their operational demands.
  • Familiarity with modern provisioning tooling such as coreboot, LinuxBoot, or u-root.
  • Experience building or contributing to datacenter automation or fleet management platforms.
  • Experience building and deploying server operating system distributions across thousands of hosts.
  • Background in large-scale capacity planning and hardware refresh strategy, ideally at a hyperscaler or large cloud provider.
  • Experience with trusted compute and hardware security concepts such as secure boot, TPM, hardware attestation, and firmware verification — or a strong desire to develop deep expertise in this area.
Skills
PythonRustGoJavaKubernetesInfrastructure as CodeAWSGCPNVIDIA A100/H100AMD MI300Google TPUsAWS TrainiumcorebootLinuxBootu-root
Similar roles at this salary range
All DevOps / SRE jobs →
Anthropic

Staff Software Engineer, Infrastructure Asset Systems

As a Staff Software Engineer, you will build and extend systems for tracking, governing, and reporting on infrastructure assets. This involves designing data models, workflow engines, and integrations with financial and procurement systems, ensuring compliance and auditability.

320k – 405kSan Francisco, CA +1DevOps / SREHybridGoSQL
Zoox

Senior Manager, Network Engineering & Infrastructure

Lead and mentor a network engineering team responsible for designing, deploying, and operating multi-site enterprise network infrastructure across data centers, cloud, offices, and vehicle facilities. Requires 10+ years of network experience with 5+ years in senior leadership.

272k – 327kFoster City, CADevOps / SREHybridQoSCisco
Anthropic

Performance Engineer, Inference Systems

Performance engineer focused on cross-layer investigations of Anthropic's inference fleet for Claude, optimizing throughput, latency, reliability, and correctness while building observability and partnering with kernel and serving teams.

350k – 850kSan Francisco, CA +2DevOps / SREHybridSQLPython
OpenAI

Tech Lead, Deployment & Operations — Custom Infrastructure

Lead deployment and operations for OpenAI’s custom silicon and systems into data center environments. Drive hardware bring-up, validation, production deployment, and fleet reliability at scale while leading a technical team.

342k – 445kSan Francisco, CADevOps / SREHybridToolingAutomation
Anthropic

Staff Fiber Network Engineer

Owns end-to-end physical layer of private global dark-fiber backbone network, including route design, fiber acquisition, vendor management, acceptance testing, and lifecycle management. Requires deep OSP/fiber expertise, optical transport knowledge, and 8+ years experience building fiber programs.

320k – 405kSan Francisco, CA +1DevOps / SREHybridGoGIS