ML Infrastructure Engineer
160k – 200kSan Francisco, CAOnsite3+ YOE
Summary
Builds and maintains ML infrastructure for training pipelines handling massive 3D data and real-time inference serving integrated with CAD software. Requires 3+ years experience with Python, PyTorch, ML orchestration tools, data versioning, and inference optimization.
About the role
Responsibilities
- Design and build a centralized system for versioning training data, generated datasets, and model artifacts, with full lineage tracking from raw source data through to trained model outputs.
- Develop and maintain reliable, reproducible ML training and data generation pipelines.
- Refactor and harden existing training and data generation scripts into composable, testable, and maintainable components.
- Create CI/CD workflows for validating data pipelines and model training runs, including automated correctness checks and regression detection.
- Build tooling that enables ML engineers to launch, monitor, and debug training jobs with minimal friction.
- Optimize and scale real-time model inference services to meet latency and throughput requirements in production, including profiling, batching strategies, and resource-efficient serving.
- Own the deployment path from trained model artifact to production endpoint, ensuring reliable rollouts, rollback, and monitoring.
Requirements
- 3+ years of work experience in relevant fields.
- Bachelor's or Master's degree in Computer Science, Engineering, or equivalent experience.
- Strong communication skills and the ability to work closely with ML researchers and engineers to understand their workflows and translate them into robust systems.
- Experience designing and building data versioning, artifact management, or dataset lineage systems (e.g., DVC, LakeFS, Weights & Biases, or custom solutions).
- Hands-on experience with ML pipeline orchestration tools (e.g., Airflow, Prefect, Metaflow, or similar).
- Experience with model serving and inference optimization — profiling latency, reducing memory footprint, or scaling serving infrastructure to meet real-time constraints.
- Ability to read and refactor ML training code — you don't need to design model architectures, but you need to understand what training pipelines are doing well enough to make them reliable.
- Proficient with Python, PyTorch.
Bonus Qualifications
- Familiarity with AWS infrastructure services.
- Experience with containerized ML workflows and GPU-accelerated training environments.
- Experience with model optimization techniques (e.g., quantization, TensorRT, ONNX Runtime, distillation).
- Knowledge of infrastructure-as-code tools (e.g., AWS CDK, Terraform).
- Experience building or operating ML systems that handle large unstructured datasets (imagery, 3D data, sensor data).
Skills
PythonPyTorchAirflowPrefectMetaflowDVCLakeFSWeights & BiasesAWSTensorRTONNX RuntimeAWS CDKTerraform
Similar roles at this salary range
All ML Engineering jobs →Senior Machine Learning Operations Engineer
Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.
167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP
AI Engineer, Evaluation
Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.
150k – 250kSan Francisco, CA +1ML EngineeringHybrid2+ YOEPythonAI Systems