Capacity Ops Associate

Manages GPU fleet operations, including node maintenance, capacity fulfillment, and technical orchestration between SRE/infra teams and customers. Requires 2+ years experience, Kubernetes familiarity, and strong communication skills.

120k – 160kSan Francisco, CANew York, NYDevOps / SREHybrid2+ YOE

Apply

About the role

Example Initiatives

The "Lost Node" Investigation: Debugging cluster-level blockers to solve why pods aren't scheduling despite available capacity.
Regional Compliance Guard: Auditing and correcting scheduling policies to ensure customer data stays within specified geographical constraints (e.g., EU-only vs US-only).
High-Stakes Maintenance Orchestration: Coordinating critical maintenance cycles both externally (with vendors) and internally (with Baseten SREs) to evacuate workloads from unhealthy nodes and integrate replacement hardware with zero customer disruption.

Responsibilities

Fleet Maintenance: Manage daily node operations including tainting/untainting, node draining, and PVC repairs to ensure GPU fleet health and operational cost control.
GTM & Capacity Fulfillment: Partner with Sales and account teams to scope and fulfill customer capacity requests, translating complex timelines into concrete infrastructure actions and clear ETAs.
Process & Observability Engineering: Identify recurring gaps in the capacity lifecycle (intake, triage, comms) and drive fixes by defining lightweight processes and improving system observability.
Technical Orchestration: Act as the operational bridge between SRE and Infra teams, executing discrete changes and verifying system status during high-stakes maintenance windows.
Technical Documentation: Contribute to the internal knowledge base for GPU-specific issues (H100/A100/B200) to accelerate future incident resolution.
Automation & Tooling: Identify repetitive workflows and partner with engineering to build scripts, dashboards, and internal tools that reduce manual intervention and shorten time-to-mitigation.
Knowledge Excellence: Maintain a living database of GPU-specific intelligence (H100/B200) and market moves to accelerate incident resolution and support strategic briefings for leadership.

Requirements

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
2+ years of professional work experience, ideally in a customer-facing technical role or as a junior SRE/Cloud Engineer.
Strong familiarity with Kubernetes and the lifecycle of cloud-based container orchestration.
Strong ownership mindset and attention to detail, demonstrated through fast detection, clear communication, and reliable follow-through.
Demonstrated ability to communicate complex technical blockers clearly to both internal engineering teams and external vendors.

Benefits

Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents.
Generous PTO policy including company wide Winter Break.
Paid parental leave.
Company-facilitated 401(k).
Exposure to a variety of ML startups.

Skills

KubernetesGPUH100A100B200SRECloud EngineeringNode OperationsPvcContainer Orchestration

Similar roles

DevOps / SRE jobs

Applied Intuition

Software Engineer - Developer Infrastructure

Builds and improves core libraries, frameworks, and developer tools like Bazel and Buildkite CI/CD to boost engineering productivity. Requires 2+ years experience, Bachelor's in CS, and expertise in Go/C++/Python/TypeScript.

120k – 300kSunnyvale, CADevOps / SREOn-site2+ YOEGoC++

The Voleon Group

Site Reliability Engineer

Site Reliability Engineer improves, manages, and monitors production-critical infrastructure and data pipelines in a finance AI/ML firm. Collaborates on fault-tolerance, deployments, automation, and on-call incident response using Python, Linux, and cloud tools. Requires 2+ years experience and quantitative degree.

120k – 160kNew York, NY +1DevOps / SRERemote2+ YOERGo

EliseAI

Platform Operations Engineer

Leads cross-functional technical projects to optimize tech stack, build custom automation and analytics solutions for business operations, and integrate systems using AWS, APIs, and databases. Requires 2+ years experience with Python/SQL proficiency and onsite presence in New York.

120k – 200kNew York, NYDevOps / SREOn-site2+ YOESQLAWS

Nominal

Devops - Internal Platform & Tools

Develop and maintain internal developer tooling and infrastructure across air-gapped, cloud, and on-prem environments. Collaborate on deployments, AI adoption, and custom tools to accelerate developers and clients, requiring 2+ years in infrastructure and cloud services.

130k – 230kNew York, NY +2DevOps / SREOn-site2+ YOEAWSGCP

Topaz Labs

Software Engineer, DevOps / Infrastructure

DevOps Engineer builds and maintains CI/CD pipelines, ML model infrastructure, and automated testing for AI image/video software products. Requires 2+ years experience, C++ build tools expertise, and cloud platforms like AWS/Azure.

110k – 160kDallas, TXDevOps / SREOn-site2+ YOEQtGo