Skip to content

Senior GenAI Research Engineer - Optimization and Kernels

Develops advanced optimization techniques and high-performance GPU kernels for scaling LLM training. Requires expertise in CUDA, PyTorch, distributed frameworks, and NVIDIA GPU architecture.

166k – 225kSan Francisco, CAAI ResearchOnsite

About the role

Job Description

As a research engineer on the Scaling team, you will advance the scientific frontier in deep learning by creating new optimization techniques beyond the state of the art. You will collaborate with researchers and engineers to encode scientific expertise into products for customer success with state-of-the-art LLMs and AI systems.

The Impact You Will Have

  • Drive performance improvements through advanced optimization techniques including kernel fusion, mixed precision, memory layout optimization, tiling strategies, and tensorization for training-specific patterns
  • Design, implement, and optimize high-performance GPU kernels for training workloads (e.g., attention mechanisms, custom layers, gradient computation, activation functions) targeting NVIDIA architectures
  • Design and implement distributed training frameworks for large language models, including parallelism strategies (data, tensor, pipeline, ZeRO-based) and optimized communication patterns for gradient synchronization and collective operations
  • Profile, debug, and optimize end-to-end training workflows to identify and resolve performance bottlenecks, applying memory optimization techniques like activation checkpointing, gradient sharding, and mixed precision training

What We Look For

  • BS/MS/PhD in Computer Science or related field with hands-on experience writing and tuning CUDA kernels for ML training applications, or hands-on experience in distributed training frameworks (PyTorch DDP, DeepSpeed, Megatron-LM, FSDP)
  • Strong understanding of NVIDIA GPU architecture (memory hierarchy, tensor cores, warp scheduling, SM occupancy) and proficiency with CUDA debugging/profiling tools (Nsight, NVProf)
  • Deep understanding of parallelism techniques and memory optimization strategies for large-scale model training, with proven ability to debug and optimize distributed workloads
  • Strong software engineering skills in Python and PyTorch, with experience supporting production training workflows and knowledge of LLM training dynamics including hyperparameter tuning and optimization strategies

Skills

CUDAPyTorchNvidia GpuDeepspeedMegatron-LmFsdpNsightNvprofDistributed TrainingKernel Optimization

Similar roles

AI Research jobs

Senior Applied Researcher AI/ML (US)

Senior Applied Researcher develops AI/ML solutions for healthcare challenges, applying GenAI, LLMs, and techniques like RAG to build production-ready models in SaaS environments. Requires Master's in relevant field, healthcare data experience, Python proficiency, and ML frameworks expertise.

178k – 198kUnited StatesAI ResearchRemoteSQLAWS

Research Scientist

Conduct research on training and scaling models for web indexing, focusing on convergence of search, recommendations, and transformer architectures. Requires deep intuition in modern ML models and a focus on applied research.

150k – 300kCaliforniaAI ResearchOn-site7+ YOEResearchTransformers

Senior ML Research Scientist, End-to-End Autonomous Driving

Senior ML Research Scientist advances end-to-end ML models for autonomous driving, processing sensor data to behaviors using large-scale datasets and compute. Requires MS/PhD, 3+ years ML experience, Python expertise, and top-tier research publications.

184k – 276kMountain View, CAAI ResearchOn-site3+ YOEC++Lidar

Senior Research Scientist

As a Senior Research Scientist on the Video team, you will drive research initiatives and translate advanced computer vision and deepfake detection models into scalable enterprise solutions. You will focus on audio-visual deepfake detection, synthetic media identification, and real-time video processing.

185k – 215kUnited StatesAI ResearchRemote5+ YOEKerasPython

Applied Research Scientist / Engineer

Work as a fullstack applied researcher adapting multimodal video foundation models for production. Focus on controllability, personalization, and end-user quality using SFT, RL, and data-driven refinement.

200k – 450kNew York, NY +1AI ResearchHybrid7+ YOERlSft