Key Responsibilities
- Help fal maintain its frontier position on model performance for generative media models.
- Design and implement novel approaches to model serving architecture on top of our in-house inference engine, focusing on maximizing throughput while minimizing latency and resource usage.
- Develop performance monitoring and profiling tools to identify bottlenecks and optimization opportunities.
- Work closely with our Applied ML team and customers (frontier labs on the media space) and make sure their workloads benefit from our accelerator.
Requirements
- Strong foundation in systems programming with expertise in identifying and fixing bottlenecks.
- Deep understanding of cutting edge ML infrastructure stack (PyTorch, TensorRT, TransformerEngine, Nsight), including model compilation, quantization, and serving architectures.
- Fundamental view of underlying hardware (Nvidia based systems), including custom GEMM kernels with CUTLASS.
- Proficient in Triton or comparable experience in lower-level accelerator programming.
- Experience with multi-dimensional model parallelism (TP with context/sequence parallel).
- Familiar with internals of Ring Attention, FA3, FusedMLP implementations.
Compensation
$180,000 - $250,000 + equity + comprehensive benefits package