Skip to content

Senior Platform AI Engineer

San Francisco, CAML EngineeringHybrid7+ YOE
Summary

Senior engineer building and operating AI/ML production infrastructure including MCP servers, agent orchestration, LLM workflows, and RAG systems. Requires 7+ years experience with 2+ years in AI/ML infrastructure.

About the role

What you'll do

MCP Server Development & AI-Optimized API Design

  • Design and build MCP (Model Context Protocol) servers that expose Drata's platform to AI agents
  • Make architectural decisions about tool granularity, naming conventions for agent disambiguation, response compression for LLM context windows, and workspace isolation for multi-tenant access
  • Write semantic parameter descriptions, contextual hints, and tool schemas optimized for model comprehension

Agent Orchestration & Workflow Infrastructure

  • Build and operate the infrastructure for deploying multi-step agent workflows
  • Manage state management across complex reasoning chains, tool routing and execution runtimes, and long-running agentic processes
  • Design systems that handle agent failure modes gracefully: retries on ambiguous tool outputs, fallback strategies, and observability into multi-step execution traces

LLM Operations & Model Lifecycle Management

  • Own the operational side of LLM workflows: model upgrades across production pipelines, prompt versioning and A/B testing, AI workflow deployment
  • Manage token capacity planning, model costs, context limits, batching strategies, and rate governance
  • Investigate AI workflow failures to distinguish between prompt template issues, model behavior changes, or infrastructure problems

Production AI Infrastructure & RAG Systems

  • Operate and evolve production AI stack: vector storage and indexing, document parsing pipelines, multi-region deployment, and cost optimization
  • Make RAG architecture decisions around embedding strategies, retrieval filtering, and data model coordination
  • Implement caching layers and token-aware request routing to manage spend

Platform Enablement & Developer Experience

  • Build CI/CD patterns specific to AI workflows (reproducible deployments, SDK version compatibility, workflow rollback semantics)
  • Own AI-specific observability — token usage dashboards, response quality metrics, agent execution traces, and cost-per-workflow tracking
  • Enable product engineering teams to ship AI features faster

What you'll bring

  • 7+ years of software engineering experience, with 2+ years building or operating AI/ML infrastructure in production
  • Strong in Python (AI services built in Python); TypeScript/Node.js a plus
  • Experience with LLM APIs, vector databases, or AI orchestration platforms
  • Experience across the stack: writing Terraform, debugging prompt templates, designing agent orchestration frameworks
  • Experience in cloud infrastructure (AWS preferred — ECS, S3, Bedrock), container orchestration, infrastructure-as-code, CI/CD pipeline design, API design, workflow orchestration engines, and distributed systems
  • Experience with AI-specific tooling: LLM APIs (Claude, OpenAI, etc), model serving frameworks (vLLM, SageMaker), vector databases, embedding pipelines, prompt management platforms, or agent frameworks
  • Clear communication about technical tradeoffs, especially AI-specific infrastructure decisions
Skills
PythonTypeScriptAWSTerraformDockerKubernetesLLM APIsVector DatabasesRAGCI/CD