Performance Engineer
Performance Engineer optimizes throughput and robustness of large-scale ML distributed systems by solving novel performance issues. Requires significant software engineering experience at supercomputing scale and interest in ML.
You may be a good fit if you:
- Have significant software engineering or machine learning experience, particularly at supercomputing scale
- Are results-oriented, with a bias towards flexibility and impact
- Pick up slack, even if it goes outside your job description
- Enjoy pair programming (we love to pair!)
- Want to learn more about machine learning research
- Care about the societal impacts of your work
Strong candidates may also have experience with:
- High performance, large-scale ML systems
- GPU/Accelerator programming
- ML framework internals
- OS internals
- Language modeling with transformers
Representative projects:
- Implement low-latency high-throughput sampling for large language models
- Implement GPU kernels to adapt our models to low-precision inference
- Write a custom load-balancing algorithm to optimize serving efficiency
- Build quantitative models of system performance
- Design and implement a fault-tolerant distributed system running with a complex network topology
- Debug kernel-level network latency spikes in a containerized environment
Principal Infrastructure Engineer
Principal Infrastructure Engineer building and operating secure cloud-native and edge platforms for military collaboration software. Requires 8+ years production infrastructure experience, deep Kubernetes expertise, and ability to obtain SECRET clearance.
Staff Engineer, Distributed Storage and HPC & AI Infrastructure
Design and operate multi-petabyte distributed storage systems for large-scale AI training and inference, integrating parallel filesystems and building Kubernetes-native storage platforms.
Director of Platform & Reliability Engineering
The Director of Platform & Reliability Engineering will lead an engineering organization responsible for secure, scalable, and highly reliable products. This role involves setting the vision for internal platforms, cloud infrastructure, developer enablement, and production operations.
Staff Software Engineer, Infrastructure Asset Systems
As a Staff Software Engineer, you will build and extend systems for tracking, governing, and reporting on infrastructure assets. This involves designing data models, workflow engines, and integrations with financial and procurement systems, ensuring compliance and auditability.