Senior Software Engineer, AI Infrastructure

126k – 189kSeattle, WAOnsite8+ YOEJun 8

Summary

Senior engineer building and operating large-scale HPC infrastructure for AI model training. Owns job scheduling, automation, and performance optimization across GPU clusters.

About the role

Responsibilities

Independently design and deliver critical systems spanning the full stack—from the Beaker job scheduler to the execution runtime
Build innovative tooling and software-defined infrastructure to accelerate researcher velocity and automate cluster health management
Conduct root-cause analysis on complex distributed system failures and implement optimizations for distributed workloads
Provide input into the roadmap for managing large-scale HPC systems, including deployment of compute, networking, and storage
Review code/design docs, mentor team members, and drive process improvements
Communicate and collaborate with internal research staff to share system designs and support implementation

Requirements

8+ years of professional experience developing business-critical software and operating large-scale compute infrastructure
Proficiency in Go and/or Python
Bachelor’s degree in related field (advanced degree may substitute for experience)
Expert-level knowledge of Linux internals and container runtimes (Docker)
Proven track record designing, debugging, and optimizing high-scale distributed systems and databases
Exceptional writing skills and ability to drive consensus across researchers and engineers
Principled approach to engineering and excitement for non-profit research environment

Nice-to-Haves

Experience with workload schedulers (Kubernetes, Slurm) and high-performance networking (NCCL, InfiniBand)
Prior experience training or fine-tuning frontier AI models
Deep systems administration or SRE background in HPC context
Contributions to open-source infrastructure or orchestration projects
Familiarity with on-prem storage systems (WEKA, Ceph)

Skills

GoPythonLinuxDockerDistributed SystemsKubernetesSlurmNCCLInfiniBandSRE

Similar roles at this salary range

All DevOps / SRE jobs →

Aurelian

Jun 8

Senior Infrastructure Engineer

Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.

150k – 200kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Mozilla

Jun 8

Senior Site Reliability Engineer

Senior SRE to operate and evolve EKS Kubernetes platform, CI/CD pipelines, and observability stack for Thunderbird's open-source infrastructure. Requires 7+ years infrastructure experience and strong production Kubernetes and IaC skills.

123k – 144kUnited StatesDevOps / SRERemoteAWSIAM

Mozilla

Jun 8

Senior Site Reliability Engineer

Senior SRE to operate and evolve an EKS-based Kubernetes platform, CI/CD pipelines, and observability stack on AWS. Requires 7+ years infrastructure/SRE experience with production Kubernetes and IaC fluency.

123k – 144kUnited StatesDevOps / SRERemoteEKSAWS

Clickhouse

Jun 4

Senior Cloud Engineer

Design, develop, and secure ClickHouse Cloud platforms for regulated and mission-critical environments across cloud, hybrid, and on-prem deployments. Requires 6+ years building scalable distributed systems, Kubernetes expertise, and proficiency in Go or Python.

141k – 230kUnited StatesDevOps / SRERemoteGoAWS

Lightning AI

Jun 4

Senior Network Engineer

Design and operate large-scale AI data center networks using spine-leaf architectures, EVPN/VXLAN, BGP, and automation tools. Requires 5+ years of data center networking experience and hands-on work with Cumulus NOS, SONiC, and Junos.

150k – 190kUnited StatesDevOps / SRERemoteBGPEVPN

Apply