Performance Engineer

280k – 850kSan Francisco, CANew York, NYSeattle, WAHybridFeb 12

Summary

Performance Engineer optimizes throughput and robustness of large-scale ML distributed systems by solving novel performance issues. Requires significant software engineering experience at supercomputing scale and interest in ML.

About the role

You may be a good fit if you:

Have significant software engineering or machine learning experience, particularly at supercomputing scale
Are results-oriented, with a bias towards flexibility and impact
Pick up slack, even if it goes outside your job description
Enjoy pair programming (we love to pair!)
Want to learn more about machine learning research
Care about the societal impacts of your work

Strong candidates may also have experience with:

High performance, large-scale ML systems
GPU/Accelerator programming
ML framework internals
OS internals
Language modeling with transformers

Representative projects:

Implement low-latency high-throughput sampling for large language models
Implement GPU kernels to adapt our models to low-precision inference
Write a custom load-balancing algorithm to optimize serving efficiency
Build quantitative models of system performance
Design and implement a fault-tolerant distributed system running with a complex network topology
Debug kernel-level network latency spikes in a containerized environment

Skills

Machine LearningGPU ProgrammingDistributed SystemsML FrameworksOS InternalsTransformersKubernetesLoad BalancingPerformance OptimizationHigh-Throughput Systems

Similar roles at this salary range

All DevOps / SRE jobs →

Onebrief

Jun 4

Principal Infrastructure Engineer

Principal Infrastructure Engineer building and operating secure cloud-native and edge platforms for military collaboration software. Requires 8+ years production infrastructure experience, deep Kubernetes expertise, and ability to obtain SECRET clearance.

235k – 275kUnited StatesDevOps / SRERemoteGoAWS

Sentry

Jun 4

Staff Software Engineer, AI Developer Tooling

Own AI-assisted coding tooling at Sentry. Build harnesses, context systems, and API integrations so AI agents can operate across the full software development lifecycle.

240k – 320kSan Francisco, CADevOps / SREHybridCI/CDPython

Together AI

Jun 4

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

Design and operate multi-petabyte distributed storage systems for large-scale AI training and inference, integrating parallel filesystems and building Kubernetes-native storage platforms.

250k – 300kSan Francisco, CADevOps / SREOn-siteGoCeph

Forge

Jun 4

Director of Platform & Reliability Engineering

The Director of Platform & Reliability Engineering will lead an engineering organization responsible for secure, scalable, and highly reliable products. This role involves setting the vision for internal platforms, cloud infrastructure, developer enablement, and production operations.

235k – 245kSan Francisco, CADevOps / SREHybridCI/CDKubernetes

Anthropic

Jun 3

Staff Software Engineer, Infrastructure Asset Systems

As a Staff Software Engineer, you will build and extend systems for tracking, governing, and reporting on infrastructure assets. This involves designing data models, workflow engines, and integrations with financial and procurement systems, ensuring compliance and auditability.

320k – 405kSan Francisco, CA +1DevOps / SREHybridGoSQL

Apply