Skip to content

Model Serving Engineer

United StatesML EngineeringRemote5+ YOE
Summary

Builds and optimizes production inference pipelines for large tabular models using Triton Inference Server. Requires 5+ years in ML infrastructure, expert Python skills, and deep knowledge of inference frameworks and optimization techniques.

About the role

Key Responsibilities

  • Design, build, and maintain production model serving infrastructure using Triton Inference Server as the primary framework
  • Implement and optimize inference pipelines including custom backends, dynamic batching strategies, and model ensemble configurations in Triton
  • Optimize Python inference code for performance, with a strong focus on GIL contention, multi-threading, and concurrency patterns
  • Tune throughput and latency across the full serving stack, batching policies, thread pool sizing, model instance groups, and memory layout
  • Work closely with the research team to understand new model architectures at a computational level, batching behavior, dynamic shapes, memory access patterns etc
  • Own the full resource observability and control loop for production inference - instrument GPU memory, CPU, batch queue depth, and latency metrics, and actively tune model instance groups, concurrency limits, memory budgets, and batching configuration in response to observed behavior
  • Evaluate and integrate alternative inference frameworks and runtimes as the model ecosystem evolves
  • Contribute to GPU utilization improvements and resource efficiency across the serving fleet

Must Have

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
  • 5+ years of experience in model serving, ML infrastructure, or a closely related backend engineering role
  • Deep, production-level experience with Triton Inference Server, including custom Python backends, batching configuration, and model repository management
  • Expert-level Python skills with a thorough understanding of the GIL, multi-threading, multiprocessing, and async concurrency patterns
  • Strong understanding of neural network inference mechanics, forward passes, batching strategies, memory management, and numerical precision tradeoffs
  • Hands-on experience with other inference frameworks (TorchServe, TensorFlow Serving, ONNX Runtime, vLLM, etc.) and the ability to evaluate tradeoffs between them
  • Experience profiling and optimizing inference code for latency and throughput at production scale

Nice to Have

  • Experience with GPU kernel-level optimizations or CUDA profiling tools
  • Familiarity with model quantization, pruning, or compilation toolchains (TensorRT, torch.compile, ONNX)
  • Experience with KServe or other Kubernetes-native serving platforms
  • Experience serving tabular or structured data models, including classical ML models such as XGBoost and CatBoost
  • Experience with observability tooling such as Prometheus, Grafana, or Datadog in the context of inference monitoring
Skills
Triton Inference ServerPythonGILmulti-threadingmultiprocessingasync concurrencyTorchServeTensorFlow ServingONNX RuntimevLLMKubernetesKServeTensorRTCUDAXGBoost