Skip to content

AI Infrastructure Engineer

Builds and maintains AI infrastructure using Ansible, Terraform, and Kubernetes, ensuring scalability, reliability, and high availability. Handles on-call incident response, monitoring, debugging, and infrastructure growth planning. Requires 5+ years experience and CS bachelor's.

190k – 270kSan Francisco, CADevOps / SREOnsite5+ YOE

About the role

Responsibilities

  • Participate in on-call rotation (Pagerduty) to respond to production incidents
  • Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users
  • Build monitoring systems to ensure the highest quality service for our customers
  • Design and implement operational processes (such as deployments and upgrades)
  • Debug production issues across all services and levels of the stack
  • Identify improvements for the product architecture from the reliability, performance and availability perspectives
  • Plan the growth of Together AI's infrastructure

Requirements

  • 5+ years of professional AI Infra or related experience
  • Bachelor's degree in Computer Science or a related field or equivalent work experience
  • Knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes
  • Proficiency in programming/scripting languages
  • Direct experience in monitoring and observability practices
  • Knowledge of cloud services
  • Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts

Compensation

US base salary range: $190,000 - $270,000 + equity + benefits.

Skills

AnsibleTerraformKubernetesPagerdutyMonitoringObservabilityCloud ServicesDistributed SystemsLinuxNetworking

Similar roles

DevOps / SRE jobs

Founding Software Engineer, Platform

Founding Platform Engineer responsible for building and operating core backend systems, including deployment, observability, incident response, and security foundations. Requires 5+ years experience with Node/TypeScript, AWS, Kubernetes, and strong ownership in a hybrid NYC role.

190k – 260kNew York, NYDevOps / SREHybrid5+ YOEAWSCI/CD

Software Engineer (Infrastructure)

Owns backend infrastructure including Postgres, job queues, OpenSearch, Redis, and data pipelines at a fast-growing SaaS startup. Scales systems handling millions of events daily, focusing on reliability, tradeoffs, and production excellence. Requires deep scaling expertise in production systems.

190k – 230kSan Francisco, CA +1DevOps / SRERemoteAWSRedis

Software Engineer, Full-Stack — Developer Experience

Build and operate scalable CI and Bazel-based build systems that accelerate engineering velocity and reliability for OpenAI's products and infrastructure.

185k – 490kSan Francisco, CA +2DevOps / SREOn-site5+ YOEBazelKafka

Platform Engineer

Build scalable infrastructure and data pipelines for AI/ML applications in legal tech. Collaborate with ML and dev teams to optimize performance and enhance developer productivity; requires cloud expertise and bachelor's/master's in CS.

195k – 300kSan Mateo, CADevOps / SREHybridGoAWS

Electrical Field Engineer - Data Center

Electrical Field Engineer supports on-site installation, testing, and commissioning of data center power systems like switchgear, transformers, UPS, and generators. Requires 5+ years experience, Bachelor's in Electrical Engineering, and 50%+ travel to sites.

196k – 235kTexas +6DevOps / SRERemote5+ YOEBessUps Systems