Skip to content

Software Engineer, Productivity - Inference Runtime

230k – 385kSan Francisco, CAOnsite
Summary

Builds and improves CI/CD, testing, validation, and release tooling for OpenAI's inference runtime teams to ensure reliable, performant model deployments across ChatGPT, API, and research workloads. Requires strong Python skills, developer productivity experience, and high ownership in ambiguous environments.

About the role

Responsibilities

  • Improve systems that ensure inference engine releases are correct, performant, and regression-free by evolving tooling and infrastructure for deploy gate validation
  • Bring rigor to release, validation, branching, and deployment processes across the inference stack
  • Improve canary, async, and large-scale validation workflows for inference systems
  • Harden CI, testing, and validation infrastructure so failures are actionable and trustworthy
  • Reduce noisy or flaky failures caused by infrastructure instability, GPU scheduling, or test environment issues
  • Build automation for failure triage, ownership detection, debugging, and escalation
  • Partner closely with inference teams, research developer productivity, engine acceleration, and infrastructure teams to improve release quality and rollout safety
  • Reduce developer friction in testing, debugging, and release workflows so engineers can move faster with confidence

Requirements

  • Strong experience with CI/CD systems, testing infrastructure, release tooling, developer productivity, or large-scale build and validation systems
  • Comfortable working in Python-heavy environments and debugging complex distributed systems
  • C++ experience is helpful, especially for working near inference engine code, CI build issues, or performance-sensitive systems (not required)
  • High ownership, developer empathy, pragmatic, collaborative
  • Comfortable operating in ambiguous areas

Nice-to-Haves

  • Excited to learn about large-scale inference systems
  • Prior experience in inference environments (not required)
Skills
PythonCI/CDKubernetesC++Testing InfrastructureRelease EngineeringObservabilityDistributed SystemsGPUAutomation
Similar roles at this salary range
All DevOps / SRE jobs →
Crusoe

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit
Stuut

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Crusoe

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS
Ditto

Senior Software Engineer, Platform

Lead architecture and implementation of multi-cloud Kubernetes platform across AWS, Azure, and GCP. Own infrastructure provisioning, access management, networking, and lifecycle systems while mentoring engineers and defining org-wide standards.

185k – 305kUnited StatesDevOps / SRERemoteAWSGCP
Snowflake

Senior Software Engineer - Internal Observability

Senior engineer building AI-powered observability systems and large-scale telemetry pipelines for Snowflake's multi-cloud data platform. Requires 7+ years focused on distributed systems and cloud services.

200k – 288kMenlo Park, CADevOps / SREOn-siteC++AWS