Skip to content

Sr. Machine Learning Engineer

Architects high-scale distributed systems and agentic AI for cybersecurity platform, processing massive data with Kafka, Spark/Flink, and Kubernetes. Requires 5-8 years backend experience in Java/Python/Go and expertise in agentic frameworks, RAG, and cloud infrastructure.

191k – 220kSunnyvale, CAML EngineeringOnsite5+ YOE

About the role

Your Impact

  • Asynchronous Systems: Architect and optimize high-throughput, event-driven systems using Apache Kafka to handle real-time data flows.
  • Data Processing at Scale: Build and maintain large-scale data pipelines using Apache Spark or Flink to provide the high-volume analytics that power our AI.
  • Agentic Systems at Scale: Design sophisticated AI Agents capable of autonomous planning, memory management, and high-reliability tool-use across distributed environments.
  • Infrastructure & Orchestration: Lead the architectural design of containerized services on Kubernetes, ensuring high availability and scalability across Cloud Infrastructure (AWS/Azure/GCP).

Your Toolkit

  • 5–8 years of experience in backend engineering using Java, Python, or Go.
  • Expertise in distributed systems, asynchronous architectures (Kafka), and large-scale data processing (Spark/Flink).
  • Hands on experience with agentic frameworks (e.g., AutoGen, CrewAI, or custom orchestration layers), RAG, MCP, fine tuning models and prompt engineering.
  • Agentic observability using Langfuse, Evals frameworks for Testing/Resilience.

Bonus Points:

  • Advanced IaC: Expertise in building reusable Terraform modules and managing complex multi-region cloud deployments.
  • Vector DB Optimization: Deep experience in indexing strategies (HNSW vs IVF) and performance tuning for high-concurrency vector databases at scale.
  • AI Ops: Experience with LLM deployment optimization (e.g., vLLM, TensorRT-LLM) or managing proprietary model inference endpoints.

Skills

JavaPythonGoApache KafkaSparkApache FlinkKubernetesAutogenCrewaiRAGLangfuseTerraformAWSAzureGCP

Similar roles

ML Engineering jobs

Senior Software Engineer, GenAI Platform

Leads development of Reddit's large-scale GenAI Platform, including LLM Gateway, RAG applications, and agentic AI workflows. Requires 5+ years in ML/AI platform engineering with expertise in cloud, Kubernetes, and MLOps practices.

191k – 267kUnited StatesML EngineeringRemote5+ YOEGoAWS

Senior Software Engineer II, ML/AI Platform

As a Senior Software Engineer II on the ML/AI Platform team, you will build and define the internal platform for training and deploying AI models across the organization. This role involves developing SDKs and supporting infrastructure for AI model fine-tuning and batch inference.

192k – 242kUnited StatesML EngineeringRemote3+ YOEGoC++

Senior Software Engineer II, AI Labs & Foundations

Instacart is seeking a Senior Software Engineer II for their AI Labs & Foundations team to design, build, and operate high-scale production AI systems. This role involves working on cutting-edge AI experiences like conversational shopping agents and voice AI, requiring expertise in robust software engineering and production AI/ML.

192k – 242kUnited StatesML EngineeringRemote5+ YOEAIRAG

Senior Applied Research Engineer 2

Drive AI system quality through experimentation and applied research on RAG, agents, and retrieval systems for Drata's compliance platform. Own evaluation frameworks, prototype GenAI workflows, and collaborate with engineers to productionize validated approaches.

192k – 260kSan Francisco, CAML EngineeringHybrid6+ YOERAGNLP

Senior Machine Learning Research Engineer

Build cutting-edge speech and audio ML models, production inference systems, and resilient pipelines. Own full ML lifecycle from research to deployment on terabytes of audio data.

190k – 260kSan Francisco, CAML EngineeringOn-site5+ YOEDspPython