Lead Research Engineer

Leads development of performance optimizations for ML models across graph, kernel, and system levels, advances Thunder compiler with new passes and tools, and ensures seamless PyTorch Lightning integration. Requires strong PyTorch expertise, optimization techniques, and distributed systems knowledge.

225k – 275kNew York, NYSan Francisco, CAML EngineeringRemote

Apply

About the role

What You’ll Do

Develop performance-oriented model optimizations at multiple levels:

Graph-level (e.g., operator fusion, kernel scheduling, memory planning)
Kernel-level (CUDA, Triton, custom operators for specialized hardware)
System-level (distributed training across GPUs/TPUs, inference serving at scale)

Advance the Thunder compiler by building optimization passes, graph transformations, and integration hooks to accelerate training and inference workloads.

Work across the software stack to ensure optimizations are accessible to end users through clean APIs, automated tooling, and seamless integration with PyTorch Lightning.

Design and implement profiling and debugging tools to analyze model execution, identify bottlenecks, and guide optimization strategies.

Collaborate with hardware vendors and ecosystem partners to ensure Thunder runs efficiently across diverse backends (NVIDIA, AMD, TPU, specialized accelerators).

Contribute to open-source projects by developing new features, improving documentation, and supporting community adoption.

Engage with researchers and engineers in the community, providing guidance on performance tuning and advocating for Thunder as the go-to optimization layer in ML workflows.

Work cross-functionally with Lightning’s product and engineering teams to ensure compiler and optimization improvements align with the broader product vision.

What You’ll Need

Strong expertise with deep learning frameworks such as PyTorch
Hands-on experience with model optimization techniques, including graph-level optimizations, quantization, pruning, mixed precision, or memory-efficient training.
Knowledge of distributed systems and parallelism strategies (data/model/pipeline parallelism, checkpointing, elastic scaling).
Familiarity with software engineering practices: designing APIs, building robust tooling, testing, CI/CD for performance-sensitive systems.
Excellent collaboration and communication skills, with the ability to partner across research, engineering, and external contributors.
Bachelor’s degree in Computer Science, Engineering, or a related field

Nice-to-Haves

Experience with CUDA, Triton, or other GPU programming models for developing custom kernels.
Deep understanding of deep learning compiler internals (IR design, operator fusion, scheduling, optimization passes) or proven work in performance-critical software.
Proven track record contributing to open-source projects in ML, HPC, or compiler domains.
Advanced degree (Master’s or PhD) in machine learning, compilers, or systems highly preferred.

Compensation

Anticipated annual base salary range: $225,000—$275,000 USD

In addition to base salary, total rewards package includes discretionary bonus, equity component, and comprehensive benefits.

Skills

PyTorchCUDATritonPytorch LightningModel OptimizationQuantizationPruningMixed PrecisionDistributed TrainingCI/CDGpu ProgrammingDeep Learning CompilersOperator FusionKernel Scheduling

Similar roles

ML Engineering jobs

Retell AI

Research Scientist - Audio

Conducts ML research on LLMs and audio models to enhance real-time voice agents' reasoning, latency, and conversational quality. Prototypes models, designs evaluations, and bridges research to production systems requiring strong PyTorch expertise and experimental mindset.

225k – 400kRedwood City, CAML EngineeringOn-siteLLMsPyTorch

Latent

Research Scientist

Owns end-to-end ML research initiatives developing novel architectures, training methods, and evaluation for clinical intelligence using longitudinal patient data. Requires strong ML foundation, PyTorch experience, and ability to drive ambiguous high-stakes problems to validated results.

225k – 300kSan Francisco, CAML EngineeringOn-siteNLPLLMs

Latent

Machine Learning Engineer

Owns end-to-end production ML systems for clinical workflows, including training/fine-tuning LLMs for medical reasoning and question answering. Requires strong ML/software engineering, PyTorch experience, and ability to handle high-stakes ambiguity with real patient impact.

225k – 300kSan Francisco, CAML EngineeringOn-siteLLMsPyTorch

character.ai

Research Engineer, Multimodal

Research Engineer advancing video/image generation models for AI characters, leading fine-tuning, novel architectures, data pipelines, and optimizations using PyTorch and multimodal techniques. Requires expertise in generation models and distributed training.

225k – 400kRedwood City, CAML EngineeringOn-siteDitLora

character.ai

Research Engineer, Post-Training (All Industry Levels)

Develops alignment algorithms, data pipelines, and sampling methods to optimize post-training AI models for performance and efficiency. Requires PhD or equivalent, ML expertise including reinforcement learning and transformers, and production code experience.

225k – 400kUnited StatesML EngineeringRemoteGCPGpus