Research Engineer

Develops performance optimizations for ML models across graph, kernel, and system levels using PyTorch and Thunder compiler. Builds tools, collaborates with partners, and contributes to open-source while requiring strong PyTorch expertise and optimization experience.

180k – 250kNew York, NYSan Francisco, CAML EngineeringRemote

Apply

About the role

What You'll Do

Develop performance-oriented model optimizations at multiple levels:

Graph-level (e.g., operator fusion, kernel scheduling, memory planning)
Kernel-level (CUDA, Triton, custom operators for specialized hardware)
System-level (distributed training across GPUs/TPUs, inference serving at scale)

Advance the Thunder compiler by building optimization passes, graph transformations, and integration hooks to accelerate training and inference workloads. Work across the software stack to ensure optimizations are accessible to end users through clean APIs, automated tooling, and seamless integration with PyTorch Lightning. Design and implement profiling and debugging tools to analyze model execution, identify bottlenecks, and guide optimization strategies. Collaborate with hardware vendors and ecosystem partners to ensure Thunder runs efficiently across diverse backends (NVIDIA, AMD, TPU, specialized accelerators). Contribute to open-source projects by developing new features, improving documentation, and supporting community adoption. Engage with researchers and engineers in the community, providing guidance on performance tuning and advocating for Thunder as the go-to optimization layer in ML workflows. Work cross-functionally with Lightning's product and engineering teams to ensure compiler and optimization improvements align with the broader product vision.

What You’ll Need

Strong expertise with deep learning frameworks such as PyTorch
Hands-on experience with model optimization techniques, including graph-level optimizations, quantization, pruning, mixed precision, or memory-efficient training.
Knowledge of distributed systems and parallelism strategies (data/model/pipeline parallelism, checkpointing, elastic scaling).
Familiarity with software engineering practices: designing APIs, building robust tooling, testing, CI/CD for performance-sensitive systems.
Excellent collaboration and communication skills, with the ability to partner across research, engineering, and external contributors.
Bachelor’s degree in Computer Science, Engineering, or a related field.

Nice-to-Haves

Experience with CUDA, Triton, or other GPU programming models for developing custom kernels.
Deep understanding of deep learning compiler internals (IR design, operator fusion, scheduling, optimization passes) or proven work in performance-critical software.
Proven track record contributing to open-source projects in ML, HPC, or compiler domains.
Advanced degree (Master’s or PhD) in machine learning, compilers, or systems highly preferred.

Skills

PyTorchPytorch LightningCUDATritonModel OptimizationQuantizationPruningMixed PrecisionDistributed TrainingGpusTpusCompiler OptimizationOperator FusionKernel SchedulingCI/CD

Similar roles

ML Engineering jobs

Baseten

Software Engineer - BIS

As a Software Engineer on the Inference Stack team, you will build the distributed runtime that powers large-scale LLM inference. This role involves working across the stack, from developer experience to low-level infrastructure, and owning systems in production.

180k – 360kSan Francisco, CAML EngineeringHybridvLLMCI/CD

Black Forest Labs

Forward Deployed Machine Learning Engineer

Deploy and optimize FLUX diffusion models for enterprise customers, architecting custom integrations and fine-tuning solutions across production environments. Requires hands-on generative AI deployment experience and strong Python skills.

180k – 300kSan Francisco, CAML EngineeringHybrid3+ YOEPythonComfyui

Build

AI Engineer (Core)

Builds core infrastructure for production AI agents including runtime, evaluation systems, retrieval, tool orchestration, observability, and reliability features for high-stakes real estate workflows. Requires strong systems engineering with Python, backend, and LLM experience.

180k – 250kSan Francisco, CAML EngineeringOn-siteRAGPython

Rillet

Applied AI Engineer

Designs and ships production AI systems including agentic workflows, RAG pipelines, and LLM integrations for an AI-native ERP platform serving finance teams. Requires 3+ years backend experience and 2+ years production AI with Python proficiency.

180k – 240kSan Francisco, CA +1ML EngineeringHybrid3+ YOERAGLLMs

Machinify

Data Scientist | Modeling

Develop and advance ML models to select healthcare claims for auditing, improving precision and recovering millions in overpayments for large health plans. Requires experience with SQL, Python/R, and building modern ML models from scratch.

180k – 230kUnited StatesML EngineeringRemoteRSQL