Research Engineer, Infrastructure, RL Systems

350k – 475kSan Francisco, CADevOps / SREOnsiteMay 4

Summary

Designs and optimizes infrastructure for scalable reinforcement learning training of large models, improving reliability, observability, and throughput. Collaborates with researchers to productionize RL algorithms; requires strong engineering skills and deep learning framework knowledge.

About the role

What You’ll Do

Design, build, and optimize the infrastructure that powers large-scale reinforcement learning and post-training workloads.
Improve the reliability and scalability of RL training pipeline, distributed RL workloads, and training throughput.
Develop shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility for RL systems.
Collaborate with researchers to translate algorithmic ideas into production-grade training pipelines.
Build evaluation and benchmarking infrastructure that measures model progress on helpfulness, safety, and factuality.
Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure.

Skills and Qualifications

Minimum qualifications:

Bachelor’s degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or similar.
Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases.
Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.
Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.

Preferred qualifications:

Experience training or supporting large-scale language models with tens of billions of parameters or more.
Experience working with reinforcement learning workloads (e.g., PPO, DPO, RLHF, or reward modeling).
Background in high-performance or reliability engineering — distributed training frameworks and cluster orchestration (Kubernetes, Slurm).
Familiarity with monitoring and observability tools (Prometheus, Grafana, OpenTelemetry).
Contributions to large-scale ML research or infrastructure, open-source frameworks, or internal performance optimization efforts.

Logistics

Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.

Benefits: Generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.

Skills

PyTorchJAXKubernetesSlurmPrometheusGrafanaOpenTelemetryPPODPORLHF

Similar roles at this salary range

All DevOps / SRE jobs →

Thinking Machines Lab

Jun 24

Reliability Engineer, Supercomputing

Ensure reliability of large GPU supercomputing clusters by diagnosing hardware/firmware/OS issues, automating monitoring, driving firmware rollouts, and working directly with vendors.

350k – 475kSan Francisco, CADevOps / SREOn-siteBMCRust

Thinking Machines Lab

Jun 24

Network Engineer, Supercomputing

Own and debug multi-thousand-GPU network fabric (RDMA/RoCE, NVLink/NVSwitch) for large-scale AI training and inference. Requires backend language proficiency, large-scale cluster experience, and cross-stack ownership.

350k – 475kSan Francisco, CADevOps / SREOn-siteRustRDMA

Anthropic

Jun 19

Staff Software Engineer, Developer Productivity

Staff-level IC role owning end-to-end CI/CD, merge queue, and deploy pipelines for Anthropic's engineering org. Focus on AI-assisted review, test reliability, and progressive delivery at monorepo scale.

405k – 485kSan Francisco, CA +1DevOps / SREHybrid7+ YOEGoRust

Anthropic

Jun 19

Staff Software Engineer, Developer Productivity

Staff-level engineer to own end-to-end development environments at Anthropic, focusing on container lifecycle, cold-start optimization, environment isolation, and pre-push validation for AI researchers and engineers.

405k – 485kSan Francisco, CA +1DevOps / SREHybrid7+ YOEGoNix

Anthropic

Jun 17

Staff Software Engineer, Node Infra

Own technical strategy and roadmap for node lifecycle management, health automation, and scaling AI clusters across clouds and accelerators. Requires deep distributed systems expertise, ML accelerator experience, and 12+ years leading complex multi-team infrastructure initiatives.

405k – 485kSan Francisco, CA +2DevOps / SREHybrid12+ YOEGoAWS

Apply