Senior Platform AI Engineer
San Francisco, CAML EngineeringHybrid7+ YOE
Summary
Senior engineer building and operating AI/ML production infrastructure including MCP servers, agent orchestration, LLM workflows, and RAG systems. Requires 7+ years experience with 2+ years in AI/ML infrastructure.
About the role
What you'll do
MCP Server Development & AI-Optimized API Design
- Design and build MCP (Model Context Protocol) servers that expose Drata's platform to AI agents
- Make architectural decisions about tool granularity, naming conventions for agent disambiguation, response compression for LLM context windows, and workspace isolation for multi-tenant access
- Write semantic parameter descriptions, contextual hints, and tool schemas optimized for model comprehension
Agent Orchestration & Workflow Infrastructure
- Build and operate the infrastructure for deploying multi-step agent workflows
- Manage state management across complex reasoning chains, tool routing and execution runtimes, and long-running agentic processes
- Design systems that handle agent failure modes gracefully: retries on ambiguous tool outputs, fallback strategies, and observability into multi-step execution traces
LLM Operations & Model Lifecycle Management
- Own the operational side of LLM workflows: model upgrades across production pipelines, prompt versioning and A/B testing, AI workflow deployment
- Manage token capacity planning, model costs, context limits, batching strategies, and rate governance
- Investigate AI workflow failures to distinguish between prompt template issues, model behavior changes, or infrastructure problems
Production AI Infrastructure & RAG Systems
- Operate and evolve production AI stack: vector storage and indexing, document parsing pipelines, multi-region deployment, and cost optimization
- Make RAG architecture decisions around embedding strategies, retrieval filtering, and data model coordination
- Implement caching layers and token-aware request routing to manage spend
Platform Enablement & Developer Experience
- Build CI/CD patterns specific to AI workflows (reproducible deployments, SDK version compatibility, workflow rollback semantics)
- Own AI-specific observability — token usage dashboards, response quality metrics, agent execution traces, and cost-per-workflow tracking
- Enable product engineering teams to ship AI features faster
What you'll bring
- 7+ years of software engineering experience, with 2+ years building or operating AI/ML infrastructure in production
- Strong in Python (AI services built in Python); TypeScript/Node.js a plus
- Experience with LLM APIs, vector databases, or AI orchestration platforms
- Experience across the stack: writing Terraform, debugging prompt templates, designing agent orchestration frameworks
- Experience in cloud infrastructure (AWS preferred — ECS, S3, Bedrock), container orchestration, infrastructure-as-code, CI/CD pipeline design, API design, workflow orchestration engines, and distributed systems
- Experience with AI-specific tooling: LLM APIs (Claude, OpenAI, etc), model serving frameworks (vLLM, SageMaker), vector databases, embedding pipelines, prompt management platforms, or agent frameworks
- Clear communication about technical tradeoffs, especially AI-specific infrastructure decisions
Skills
PythonTypeScriptAWSTerraformDockerKubernetesLLM APIsVector DatabasesRAGCI/CD