Skip to content

Principal AI Platform Engineer (US)

179k – 199kUnited StatesRemote
Summary

Principal AI/ML Platform Engineer building and maintaining GenAI infrastructure including model gateways, vector DBs, observability, and secure access controls for production LLM workloads.

About the role

Key Responsibilities

  • Design, build, and maintain the core infrastructure layer supporting GenAI products, including model gateways, prompt/versioning stores, vector databases, and LLM evaluation tools.
  • Implement secure access controls and authentication mechanisms integrated by default into the AI platform components.
  • Develop and manage observability, monitoring, and logging solutions for GenAI workloads and infrastructure.
  • Collaborate closely with product and engineering teams to integrate GenAI infrastructure with agent frameworks, and downstream applications.
  • Optimize infrastructure for scalability, high availability, cost efficiency for production workloads.

Qualifications & Skills

  • Extensive experience building and maintaining AI platform infrastructure, Kubernetes, and container security.
  • Demonstrated expertise in observability and monitoring frameworks, with a focus on real-time performance (e.g., experience with OpenTelemetry, MLFlow).
  • Experience with AI infrastructure components such as vector databases, prompt/versioning stores, and AI IDEs.

Preferred Experience

  • Familiarity with vLLM, SGLang or similar framework to host LLM inference workloads.
  • Experience with CI/CD pipelines and automation for AI model deployment and platform operations.
  • Strong knowledge of authentication and authorization frameworks integrated into AI platforms.
Skills
KubernetesOpenTelemetryMLflowVector DatabasesPrompt StoresLLM Evaluation ToolsvLLMSGLangCI/CDAuthentication and Authorization Frameworks
Similar roles at this salary range
All ML Engineering jobs →
Databricks

Staff Software Engineer, AI Runtime

Staff Software Engineer building and scaling Databricks' managed large-scale GPU training platform (AIR). Focus on distributed training performance, scheduling, fault tolerance, and developer experience for thousands of accelerators.

190k – 265kMountain View, CA +1ML EngineeringOn-siteFSDPRoCE
Databricks

Senior Software Engineer, AI Runtime

Senior Software Engineer building and scaling Databricks' managed GPU training platform (AI Runtime) for large-scale distributed AI model training. Requires 5+ years in distributed systems and hands-on experience with GPU training frameworks.

160k – 225kMountain View, CA +1ML EngineeringOn-siteFSDPRoCE
Pinterest

Sr. Machine Learning Engineer, Computer Vision

Build and prototype diffusion-based text-to-image generative models (Pinterest Canvas) using large-scale visual-text datasets. Requires 5+ years industry computer vision experience and an M.S. or Ph.D.

161k – 332kSan Francisco, CAML EngineeringRemoteRLHFPyTorch
Checkr

Machine Learning Engineer

Build and ship production ML/AI services powering background checks. Own end-to-end ML systems using LLMs, Python, and modern MLOps practices.

168k – 198kSan Francisco, CAML EngineeringOn-siteNLPdbt
Chime

Senior AI/ML Engineer

Senior AI/ML Engineer building transformer and deep learning models on financial and behavioral data to power personalized growth and marketing experiences at Chime. Requires strong production ML experience with PyTorch, AWS, and large-scale data infrastructure.

172k – 238kChicago, IL +3ML EngineeringHybridSQLAWS