Principal AI Platform Engineer (US)
Principal AI/ML Platform Engineer building and maintaining GenAI infrastructure including model gateways, vector DBs, observability, and secure access controls for production LLM workloads.
Key Responsibilities
- Design, build, and maintain the core infrastructure layer supporting GenAI products, including model gateways, prompt/versioning stores, vector databases, and LLM evaluation tools.
- Implement secure access controls and authentication mechanisms integrated by default into the AI platform components.
- Develop and manage observability, monitoring, and logging solutions for GenAI workloads and infrastructure.
- Collaborate closely with product and engineering teams to integrate GenAI infrastructure with agent frameworks, and downstream applications.
- Optimize infrastructure for scalability, high availability, cost efficiency for production workloads.
Qualifications & Skills
- Extensive experience building and maintaining AI platform infrastructure, Kubernetes, and container security.
- Demonstrated expertise in observability and monitoring frameworks, with a focus on real-time performance (e.g., experience with OpenTelemetry, MLFlow).
- Experience with AI infrastructure components such as vector databases, prompt/versioning stores, and AI IDEs.
Preferred Experience
- Familiarity with vLLM, SGLang or similar framework to host LLM inference workloads.
- Experience with CI/CD pipelines and automation for AI model deployment and platform operations.
- Strong knowledge of authentication and authorization frameworks integrated into AI platforms.
Staff Software Engineer, AI Runtime
Staff Software Engineer building and scaling Databricks' managed large-scale GPU training platform (AIR). Focus on distributed training performance, scheduling, fault tolerance, and developer experience for thousands of accelerators.
Senior Software Engineer, AI Runtime
Senior Software Engineer building and scaling Databricks' managed GPU training platform (AI Runtime) for large-scale distributed AI model training. Requires 5+ years in distributed systems and hands-on experience with GPU training frameworks.
Sr. Machine Learning Engineer, Computer Vision
Build and prototype diffusion-based text-to-image generative models (Pinterest Canvas) using large-scale visual-text datasets. Requires 5+ years industry computer vision experience and an M.S. or Ph.D.
Senior AI/ML Engineer
Senior AI/ML Engineer building transformer and deep learning models on financial and behavioral data to power personalized growth and marketing experiences at Chime. Requires strong production ML experience with PyTorch, AWS, and large-scale data infrastructure.