# Capacity Ops Associate
**Company:** [Baseten](https://hotfix.jobs/companies/baseten)
**Location:** San Francisco, CA, New York, NY
**Salary:** $120K-$160K
**Experience:** 2+ years
**Skills:** Kubernetes, GPU, H100, A100, B200, SRE, Cloud Engineering, Node Operations, Pvc, Container Orchestration
**Posted:** 2026-03-10
> Manages GPU fleet operations, including node maintenance, capacity fulfillment, and technical orchestration between SRE/infra teams and customers. Requires 2+ years experience, Kubernetes familiarity, and strong communication skills.
## Job Description
## Example Initiatives
- **The "Lost Node" Investigation:** Debugging cluster-level blockers to solve why pods aren't scheduling despite available capacity.
- **Regional Compliance Guard:** Auditing and correcting scheduling policies to ensure customer data stays within specified geographical constraints (e.g., EU-only vs US-only).
- **High-Stakes Maintenance Orchestration:** Coordinating critical maintenance cycles both externally (with vendors) and internally (with Baseten SREs) to evacuate workloads from unhealthy nodes and integrate replacement hardware with zero customer disruption.

## Responsibilities
- **Fleet Maintenance:** Manage daily node operations including tainting/untainting, node draining, and PVC repairs to ensure GPU fleet health and operational cost control.
- **GTM & Capacity Fulfillment:** Partner with Sales and account teams to scope and fulfill customer capacity requests, translating complex timelines into concrete infrastructure actions and clear ETAs.
- **Process & Observability Engineering:** Identify recurring gaps in the capacity lifecycle (intake, triage, comms) and drive fixes by defining lightweight processes and improving system observability.
- **Technical Orchestration:** Act as the operational bridge between SRE and Infra teams, executing discrete changes and verifying system status during high-stakes maintenance windows.
- **Technical Documentation:** Contribute to the internal knowledge base for GPU-specific issues (H100/A100/B200) to accelerate future incident resolution.
- **Automation & Tooling:** Identify repetitive workflows and partner with engineering to build scripts, dashboards, and internal tools that reduce manual intervention and shorten time-to-mitigation.
- **Knowledge Excellence:** Maintain a living database of GPU-specific intelligence (H100/B200) and market moves to accelerate incident resolution and support strategic briefings for leadership.

## Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- 2+ years of professional work experience, ideally in a customer-facing technical role or as a junior SRE/Cloud Engineer.
- Strong familiarity with Kubernetes and the lifecycle of cloud-based container orchestration.
- Strong ownership mindset and attention to detail, demonstrated through fast detection, clear communication, and reliable follow-through.
- Demonstrated ability to communicate complex technical blockers clearly to both internal engineering teams and external vendors.

## Benefits
- Competitive compensation, including meaningful equity.
- 100% coverage of medical, dental, and vision insurance for employee and dependents.
- Generous PTO policy including company wide Winter Break.
- Paid parental leave.
- Company-facilitated 401(k).
- Exposure to a variety of ML startups.
**Apply:** https://hotfix.jobs/jobs/capacity-ops-associate-at-baseten-60dadeef-de65-4b6d-b53b-583b5e52b744
**Canonical:** https://hotfix.jobs/jobs/capacity-ops-associate-at-baseten-60dadeef-de65-4b6d-b53b-583b5e52b744