Software Engineer, Compute Infrastructure

230k – 405kSan Francisco, CANew York, NYSeattle, WAHybridApr 27

Summary

Builds and optimizes large-scale compute infrastructure for AI workloads, spanning hardware automation, distributed systems, Kubernetes orchestration, networking, storage, and developer tools. Requires strong systems engineering experience in performance, reliability, and production infrastructure.

About the role

Responsibilities

Build and deeply optimize reliable system software for large-scale compute systems that run some of the world's most demanding AI workloads
Design and operate infrastructure across accelerators, CPUs, NICs, switches, networking protocols, storage, data centers, cluster orchestration, scheduling, and fleet health
Profile, benchmark, and optimize training workloads across compute, memory, storage, networking, NCCL and collective communication, and cluster scheduling bottlenecks
Create hardware-aware automation that makes provisioning, firmware and driver upgrades, incident response, and day-to-day operations faster and less error-prone
Build CaaS, agent infrastructure, profiling, observability, benchmarking, and platform tools that help researchers, product engineers, and operators launch, debug, and optimize workloads with less friction
Turn operational lessons into better systems, stronger abstractions, and clearer ownership boundaries across teams
Collaborate across research, engineering, security, networking, hardware, and data center teams to make compute capacity more capable and easier to use

Qualifications

Strong software engineering skills and experience building, operating, or improving production infrastructure systems
Experience in one or more relevant areas such as distributed systems, operating systems, networking protocols, RDMA, NCCL or collective communication, storage, Kubernetes, scheduling, observability, reliability engineering, high-performance computing, GPU infrastructure, CaaS, agent infrastructure, hardware-aware performance optimization, benchmarking, developer experience, or infrastructure tooling
Ability to debug complex system behavior across software, hardware, networking, and workload layers, then turn findings into robust improvements
Comfort with ambiguity, strong ownership, and a bias toward practical, durable solutions

Skills

KubernetesNCCLRDMAdistributed systemshigh-performance computingGPU infrastructureobservabilityreliability engineeringstorage systemsnetworking protocols

Similar roles at this salary range

All DevOps / SRE jobs →

Crusoe

Jun 8

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit

Stuut

Jun 8

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS

Crusoe

Jun 5

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS

Ditto

Jun 5

Senior Software Engineer, Platform

Lead architecture and implementation of multi-cloud Kubernetes platform across AWS, Azure, and GCP. Own infrastructure provisioning, access management, networking, and lifecycle systems while mentoring engineers and defining org-wide standards.

185k – 305kUnited StatesDevOps / SRERemoteAWSGCP

Snowflake

Jun 5

Senior Software Engineer - Internal Observability

Senior engineer building AI-powered observability systems and large-scale telemetry pipelines for Snowflake's multi-cloud data platform. Requires 7+ years focused on distributed systems and cloud services.

200k – 288kMenlo Park, CADevOps / SREOn-siteC++AWS

Apply