Skip to content

Senior Software Engineer, AI Infrastructure

126k – 189kSeattle, WAOnsite8+ YOE
Summary

Senior engineer building and operating large-scale HPC infrastructure for AI model training. Owns job scheduling, automation, and performance optimization across GPU clusters.

About the role

Responsibilities

  • Independently design and deliver critical systems spanning the full stack—from the Beaker job scheduler to the execution runtime
  • Build innovative tooling and software-defined infrastructure to accelerate researcher velocity and automate cluster health management
  • Conduct root-cause analysis on complex distributed system failures and implement optimizations for distributed workloads
  • Provide input into the roadmap for managing large-scale HPC systems, including deployment of compute, networking, and storage
  • Review code/design docs, mentor team members, and drive process improvements
  • Communicate and collaborate with internal research staff to share system designs and support implementation

Requirements

  • 8+ years of professional experience developing business-critical software and operating large-scale compute infrastructure
  • Proficiency in Go and/or Python
  • Bachelor’s degree in related field (advanced degree may substitute for experience)
  • Expert-level knowledge of Linux internals and container runtimes (Docker)
  • Proven track record designing, debugging, and optimizing high-scale distributed systems and databases
  • Exceptional writing skills and ability to drive consensus across researchers and engineers
  • Principled approach to engineering and excitement for non-profit research environment

Nice-to-Haves

  • Experience with workload schedulers (Kubernetes, Slurm) and high-performance networking (NCCL, InfiniBand)
  • Prior experience training or fine-tuning frontier AI models
  • Deep systems administration or SRE background in HPC context
  • Contributions to open-source infrastructure or orchestration projects
  • Familiarity with on-prem storage systems (WEKA, Ceph)
Skills
GoPythonLinuxDockerDistributed SystemsKubernetesSlurmNCCLInfiniBandSRE
Similar roles at this salary range
All DevOps / SRE jobs →
Aurelian

Senior Infrastructure Engineer

Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.

150k – 200kSeattle, WADevOps / SREOn-siteLoggingClickHouse
Mozilla

Senior Site Reliability Engineer

Senior SRE to operate and evolve EKS Kubernetes platform, CI/CD pipelines, and observability stack for Thunderbird's open-source infrastructure. Requires 7+ years infrastructure experience and strong production Kubernetes and IaC skills.

123k – 144kUnited StatesDevOps / SRERemoteAWSIAM
Mozilla

Senior Site Reliability Engineer

Senior SRE to operate and evolve an EKS-based Kubernetes platform, CI/CD pipelines, and observability stack on AWS. Requires 7+ years infrastructure/SRE experience with production Kubernetes and IaC fluency.

123k – 144kUnited StatesDevOps / SRERemoteEKSAWS
Clickhouse

Senior Cloud Engineer

Design, develop, and secure ClickHouse Cloud platforms for regulated and mission-critical environments across cloud, hybrid, and on-prem deployments. Requires 6+ years building scalable distributed systems, Kubernetes expertise, and proficiency in Go or Python.

141k – 230kUnited StatesDevOps / SRERemoteGoAWS
Lightning AI

Senior Network Engineer

Design and operate large-scale AI data center networks using spine-leaf architectures, EVPN/VXLAN, BGP, and automation tools. Requires 5+ years of data center networking experience and hands-on work with Cumulus NOS, SONiC, and Junos.

150k – 190kUnited StatesDevOps / SRERemoteBGPEVPN