Model Serving Engineer
United StatesML EngineeringRemote5+ YOE
Summary
Builds and optimizes production inference pipelines for large tabular models using Triton Inference Server. Requires 5+ years in ML infrastructure, expert Python skills, and deep knowledge of inference frameworks and optimization techniques.
About the role
Key Responsibilities
- Design, build, and maintain production model serving infrastructure using Triton Inference Server as the primary framework
- Implement and optimize inference pipelines including custom backends, dynamic batching strategies, and model ensemble configurations in Triton
- Optimize Python inference code for performance, with a strong focus on GIL contention, multi-threading, and concurrency patterns
- Tune throughput and latency across the full serving stack, batching policies, thread pool sizing, model instance groups, and memory layout
- Work closely with the research team to understand new model architectures at a computational level, batching behavior, dynamic shapes, memory access patterns etc
- Own the full resource observability and control loop for production inference - instrument GPU memory, CPU, batch queue depth, and latency metrics, and actively tune model instance groups, concurrency limits, memory budgets, and batching configuration in response to observed behavior
- Evaluate and integrate alternative inference frameworks and runtimes as the model ecosystem evolves
- Contribute to GPU utilization improvements and resource efficiency across the serving fleet
Must Have
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
- 5+ years of experience in model serving, ML infrastructure, or a closely related backend engineering role
- Deep, production-level experience with Triton Inference Server, including custom Python backends, batching configuration, and model repository management
- Expert-level Python skills with a thorough understanding of the GIL, multi-threading, multiprocessing, and async concurrency patterns
- Strong understanding of neural network inference mechanics, forward passes, batching strategies, memory management, and numerical precision tradeoffs
- Hands-on experience with other inference frameworks (TorchServe, TensorFlow Serving, ONNX Runtime, vLLM, etc.) and the ability to evaluate tradeoffs between them
- Experience profiling and optimizing inference code for latency and throughput at production scale
Nice to Have
- Experience with GPU kernel-level optimizations or CUDA profiling tools
- Familiarity with model quantization, pruning, or compilation toolchains (TensorRT, torch.compile, ONNX)
- Experience with KServe or other Kubernetes-native serving platforms
- Experience serving tabular or structured data models, including classical ML models such as XGBoost and CatBoost
- Experience with observability tooling such as Prometheus, Grafana, or Datadog in the context of inference monitoring
Skills
Triton Inference ServerPythonGILmulti-threadingmultiprocessingasync concurrencyTorchServeTensorFlow ServingONNX RuntimevLLMKubernetesKServeTensorRTCUDAXGBoost