Member of Technical Staff, AI Training Infrastructure

Builds and optimizes scalable infrastructure for AI model training on large GPU clusters. Requires expertise in distributed systems, Python/C++, and ML frameworks.

175k – 220kSan Mateo, CADevOps / SREOnsite

Apply

About the role

Responsibilities

Design, build, and maintain scalable infrastructure for AI model training pipelines.
Optimize distributed training systems for large-scale GPU clusters.
Collaborate with AI researchers and engineers to deploy training workflows efficiently.

Requirements

Strong experience with distributed systems and high-performance computing.
Proficiency in Python, C++, and infrastructure-as-code tools.
Deep knowledge of AI training frameworks and GPU optimization.

Nice-to-Haves

Experience with Kubernetes and cloud platforms like AWS or GCP.
Background in machine learning operations (MLOps).

Skills

PythonC++KubernetesDistributed SystemsGpu ProgrammingPyTorchTensorFlowMLOpsAWSGCP

Similar roles

DevOps / SRE jobs

Sage

Senior/Staff Site Reliability Engineer

Leads design, operation, and evolution of highly reliable, scalable production infrastructure including cloud, databases, and observability. Drives incident response, SRE practices, automation, and capacity planning for large-scale distributed systems. Requires 7-12+ years in SRE/infrastructure engineering.

175k – 230kNew York, NYDevOps / SREHybrid7+ YOEGoAWS

Fireworks AI

Member of Technical Staff, Cloud Infrastructure

Builds and maintains scalable cloud infrastructure, focusing on reliability and performance. Requires expertise in cloud platforms, IaC tools like Terraform and Kubernetes, and systems programming.

175k – 220kNew York, NY +1DevOps / SREHybridAWSGCP

Fireworks AI

Member of Technical Staff, Performance Optimization

Optimizes performance of high-scale systems by analyzing latency, throughput, and resource usage. Requires expertise in profiling, systems programming, and distributed scaling techniques.

175k – 220kSan Mateo, CADevOps / SREOn-siteGoC++

Rad AI

Staff Software Engineer, Infrastructure

Designs and operates scalable cloud infrastructure on AWS, focusing on Kubernetes orchestration, reliability practices, and observability for AI healthcare products. Requires 8+ years experience with IaC, containerization, and cross-team leadership.

175k – 230kUnited StatesDevOps / SRERemote8+ YOEAWSGCP

Grafana Labs

Staff Software Engineer - Platform, SysEng

Staff Backend Engineer on the Platform SysEng squad building and scaling the internal engineering platform that powers Grafana Cloud services. Owns distributed systems design, Kubernetes infrastructure, reliability/SLOs, and performance at massive scale.

175k – 210kUnited StatesDevOps / SRERemote7+ YOEGoIac