What you'll do

Design and build MCP (Model Context Protocol) servers that expose Drata's platform to AI agents
Make architectural decisions about tool granularity, naming conventions for agent disambiguation, response compression for LLM context windows, and workspace isolation for multi-tenant access
Write semantic parameter descriptions, contextual hints, and tool schemas optimized for model comprehension

Build and operate the infrastructure for deploying multi-step agent workflows
Manage state management across complex reasoning chains, tool routing and execution runtimes, and long-running agentic processes
Design systems that handle agent failure modes gracefully: retries on ambiguous tool outputs, fallback strategies, and observability into multi-step execution traces

Own the operational side of LLM workflows: model upgrades across production pipelines, prompt versioning and A/B testing, AI workflow deployment
Manage token capacity planning, model costs, context limits, batching strategies, and rate governance
Investigate AI workflow failures to distinguish between prompt template issues, model behavior changes, or infrastructure problems

Operate and evolve production AI stack: vector storage and indexing, document parsing pipelines, multi-region deployment, and cost optimization
Make RAG architecture decisions around embedding strategies, retrieval filtering, and data model coordination
Implement caching layers and token-aware request routing to manage spend

Build CI/CD patterns specific to AI workflows (reproducible deployments, SDK version compatibility, workflow rollback semantics)
Own AI-specific observability — token usage dashboards, response quality metrics, agent execution traces, and cost-per-workflow tracking
Enable product engineering teams to ship AI features faster

What you'll bring

7+ years of software engineering experience, with 2+ years building or operating AI/ML infrastructure in production
Strong in Python (AI services built in Python); TypeScript/Node.js a plus
Experience with LLM APIs, vector databases, or AI orchestration platforms
Experience across the stack: writing Terraform, debugging prompt templates, designing agent orchestration frameworks
Experience in cloud infrastructure (AWS preferred — ECS, S3, Bedrock), container orchestration, infrastructure-as-code, CI/CD pipeline design, API design, workflow orchestration engines, and distributed systems
Experience with AI-specific tooling: LLM APIs (Claude, OpenAI, etc), model serving frameworks (vLLM, SageMaker), vector databases, embedding pipelines, prompt management platforms, or agent frameworks
Clear communication about technical tradeoffs, especially AI-specific infrastructure decisions