Performance Engineer
CaliforniaDevOps / SRERemote
Summary
Develop and optimize high-performance kernels for wafer-scale AI hardware, implementing ML and linear algebra operations in low-level assembly and custom languages to maximize compute utilization.
About the role
Responsibilities
- Develop design specifications for new machine learning and linear algebra kernels and mapping to the Cerebras WSE System using various parallel programming algorithms.
- Develop and debug kernel library of highly optimized low level assembly instruction and C-like domain specific language routines to implement algorithms targeting the Cerebras hardware system.
- Develop and debug high-performance kernel routines in low-level assembly and a custom C-like (CSL) language, implementing algorithms optimized for the Cerebras hardware system.
- Use mathematical models and analysis to measure the software performance and inform design decisions.
- Develop and integrate unit and system testing methodologies to verify correct functionality and performance of kernel libraries.
- Study emerging trends in Machine Learning applications and help evolve Kernel library architecture to address computational challenges of the start-of-the-art Neural Networks.
- Interact with chip and system architects to optimize instruction sets, microarchitecture, and IO of next generation systems.
Requirements
- Bachelor’s, Master’s, PhD or foreign equivalents in Computer Science, Computer Engineering, Mathematics, or related fields.
- Understanding of hardware architecture concepts — must be comfortable learning the details of a new hardware architecture.
- Skilled in C++ and Python programming languages.
- Good knowledge of library and/or API development best practices.
- Strong debugging skills and knowledge of debugging complex software stack.
Preferred Qualifications
- Experience in kernel development and/or testing.
- Familiarity with parallel algorithms and distributed memory systems.
- Experience in programming accelerators such as GPUs and FPGAs.
- Familiarity with Machine Learning neural networks and frameworks such as TensorFlow and PyTorch.
- Familiarity with HPC kernels and their optimization.
Skills
C++PythonCUDAOpenCLparallel programmingkernel developmentlow-level assemblymachine learning frameworksTensorFlowPyTorch