Senior Software Engineer, AI Infrastructure
Senior engineer building and operating large-scale HPC infrastructure for AI model training. Owns job scheduling, automation, and performance optimization across GPU clusters.
Responsibilities
- Independently design and deliver critical systems spanning the full stack—from the Beaker job scheduler to the execution runtime
- Build innovative tooling and software-defined infrastructure to accelerate researcher velocity and automate cluster health management
- Conduct root-cause analysis on complex distributed system failures and implement optimizations for distributed workloads
- Provide input into the roadmap for managing large-scale HPC systems, including deployment of compute, networking, and storage
- Review code/design docs, mentor team members, and drive process improvements
- Communicate and collaborate with internal research staff to share system designs and support implementation
Requirements
- 8+ years of professional experience developing business-critical software and operating large-scale compute infrastructure
- Proficiency in Go and/or Python
- Bachelor’s degree in related field (advanced degree may substitute for experience)
- Expert-level knowledge of Linux internals and container runtimes (Docker)
- Proven track record designing, debugging, and optimizing high-scale distributed systems and databases
- Exceptional writing skills and ability to drive consensus across researchers and engineers
- Principled approach to engineering and excitement for non-profit research environment
Nice-to-Haves
- Experience with workload schedulers (Kubernetes, Slurm) and high-performance networking (NCCL, InfiniBand)
- Prior experience training or fine-tuning frontier AI models
- Deep systems administration or SRE background in HPC context
- Contributions to open-source infrastructure or orchestration projects
- Familiarity with on-prem storage systems (WEKA, Ceph)
Senior Infrastructure Engineer
Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.
Senior Site Reliability Engineer
Senior SRE to operate and evolve EKS Kubernetes platform, CI/CD pipelines, and observability stack for Thunderbird's open-source infrastructure. Requires 7+ years infrastructure experience and strong production Kubernetes and IaC skills.
Senior Cloud Engineer
Design, develop, and secure ClickHouse Cloud platforms for regulated and mission-critical environments across cloud, hybrid, and on-prem deployments. Requires 6+ years building scalable distributed systems, Kubernetes expertise, and proficiency in Go or Python.
Senior Network Engineer
Design and operate large-scale AI data center networks using spine-leaf architectures, EVPN/VXLAN, BGP, and automation tools. Requires 5+ years of data center networking experience and hands-on work with Cumulus NOS, SONiC, and Junos.