Staff Applied Machine Learning Engineer
Build and operate production ML systems for ranking, recommendations, search, and customer intelligence signals used across product, growth, risk, and decisioning teams. Requires 12+ years of production ML experience and deep expertise in intelligent systems.
Responsibilities
- Build and operate production ML systems that turn customer and product context into trusted signals, rankings, recommendations, and decision capabilities.
- Design production data and signal contracts that define intended use, freshness, provenance, confidence, eligibility, and calibration for downstream consumers.
- Own ranking, retrieval, recommendation, search, propensity, and next-best-action systems end to end, from feature and candidate generation through serving, experimentation, monitoring, and feedback loops.
- Evaluate customer and business impact beyond short-term conversion, including trust, fairness, access, risk, compliance, long-term engagement, and segment-level performance.
- Partner across product, growth, data, platform, modeling, risk, and compliance to translate ambiguous goals into measurable ML system designs.
- Use AI and agents to accelerate development, analysis, testing, documentation, and operations while exposing reusable capabilities to product services, internal tools, and AI-assisted workflows.
Requirements
- 12+ years building and operating production software and ML systems for business-critical products.
- Deep expertise in intelligent systems such as ranking/retrieval, recommendations, search, personalization, growth and lifecycle ML, customer intelligence, propensity/churn/LTV, next-best-action, or model-derived risk signals.
- Strong production ML judgment across feature pipelines, model serving, experimentation, monitoring, feedback loops, online/offline consistency, and reliable signal interfaces.
- Ability to evaluate impact beyond short-term conversion, including trust, fairness, access, risk, compliance, and long-term engagement.
- Experience using AI-assisted engineering tools with appropriate verification, testing, and review for customer-impacting systems.
Nice to Have
- Experience with semantic retrieval, embeddings, two-tower models, graph features, LLM-powered retrieval or decision systems, entity resolution, or real-time personalization.
- Experience with experimentation, online evaluation, interleaving, counterfactual evaluation, multi-objective optimization, or long-term holdouts.
- Experience building reusable feature/signal platforms, decision services, customer intelligence layers, model-derived data products, or agent-assisted operations.
Technologies
- Python, Java, Kotlin, SQL
- TensorFlow, PyTorch, XGBoost/LightGBM, ranking/retrieval systems, embeddings, semantic search, recommendation frameworks
- Event streams, batch pipelines, feature stores, model-serving infrastructure, workflow orchestration, experimentation systems, and data warehouses/lakehouses
- Cloud infrastructure, Kubernetes, observability tooling, coding agents, evaluation harnesses, and agent-assisted operations tooling
Staff Software Engineer, Cribl AI
Staff-level AI/ML engineer building and productionizing generative AI features across backend and frontend for Cribl's observability platform. Requires 6+ years experience, AI/ML and MLOps background, and TypeScript/JavaScript proficiency.
Member of Technical Staff — Model Optimization and Inference
Optimize inference for real-time multimodal AI avatars. Specialize in LLM and diffusion model serving, KV cache strategies, quantization, and low-latency frameworks like vLLM and TensorRT-LLM.
Researcher: Agent Post-Training, API & Power-Users
Improve agentic model capabilities for API and power users by designing experiments, building evals from real workflows, and driving post-training interventions from discovery through launch.
Staff Applied Scientist - Dashboards
Staff Applied Scientist defining evaluation strategy and quality metrics for Datadog's AI-native Dashboards product. Owns ML/GenAI evaluation systems, builds datasets and harnesses, and drives improvements in retrieval, tool selection, and agent performance.