OpenAI DevOps / SRE Jobs
Open devops / sre roles at OpenAI, pulled live from their hiring system.
View devops / sre jobs across all companies
DevOps / SRE roles at OpenAI roles cluster around $255k, with most listings between $230k and $293k. 56% of open devops / sre roles call out Kubernetes; Python and Distributed Systems appear in roughly a third. Most of these devops / sre roles are on-site or hybrid; 5% are fully remote.
Tech Lead, Deployment & Operations — Custom Infrastructure
Lead deployment and operations for OpenAI’s custom silicon and systems into data center environments. Drive hardware bring-up, validation, production deployment, and fleet reliability at scale while leading a technical team.
Datacenter NetDeploy Lead - Stargate
Leads end-to-end physical network deployments in data centers, overseeing vendor execution, fiber/cabling installation, testing, validation, and handover to operations. Requires 10+ years in data center network infrastructure delivery, strong cabling and topology knowledge, and cross-team coordination skills.
Software Engineer, Frontier Systems
Builds infrastructure to monitor, detect, remediate, and verify hardware health across global GPU/CPU clusters at hyperscale. Owns node lifecycle workflows and partners with teams to ensure compute reliability for AI training and inference. Requires 7+ years experience with Python, distributed systems, and operational tooling.
Software Engineer, Productivity - Inference Runtime
Builds and improves CI/CD, testing, validation, and release tooling for OpenAI's inference runtime teams to ensure reliable, performant model deployments across ChatGPT, API, and research workloads. Requires strong Python skills, developer productivity experience, and high ownership in ambiguous environments.
Software Engineer, Core Network Engineering
Builds and operates high-performance networking infrastructure for OpenAI's large-scale AI training and inference, focusing on host networking, datacenter fabrics, and WAN systems. Optimizes latency, reliability, and scalability using technologies like RDMA, InfiniBand, and RoCE; requires strong systems programming in C++, Python, or Go.
Networking Operating System Firmware Engineer
Develops and maintains custom networking operating system firmware for AI supercomputers, integrating Linux kernel, switch ASICs, and control-plane services. Requires deep expertise in SONiC, SAI, routing protocols, and platform bring-up across hardware and software boundaries.
Performance & Systems Engineer, Codex
Optimizes performance across Codex AI system's stack including LLM inference, cloud orchestration, and agent behavior to reduce latency and costs. Collaborates with researchers and engineers on high-impact improvements in a high-ownership role.
Software Engineer, Productivity - Model Performance
Builds and improves developer tools, CI/CD pipelines, and testing workflows to boost productivity for OpenAI's model performance engineering teams. Requires strong Python skills, experience with developer infrastructure, and ability to work in ambiguous environments.
Software Engineer, Productivity - Networking
Enhances developer productivity for OpenAI's networking team by improving build systems, CI/CD pipelines, test harnesses, and workflows for C++ and Python codebases in multi-server environments. Requires experience with developer tools and infrastructure automation.
Compute Optimization Researcher/Engineer
Develops optimization models, forecasting frameworks, and planning systems to maximize compute capacity utilization across GPU clusters, data centers, and cloud providers. Requires PhD and 5+ years in optimization or infrastructure planning with strong Python and solver expertise.
Tokens-as-a-Service (Taas) Software Engineer
Builds systems and tooling to measure, monitor, and optimize token throughput from GPU infrastructure for OpenAI workloads. Integrates partner compute environments, benchmarks performance, analyzes tokenomics, and develops operational metrics and dashboards. Requires strong distributed systems and infrastructure engineering experience.
Software Engineer, Compute Infrastructure
Builds and optimizes large-scale compute infrastructure for AI workloads, spanning hardware automation, distributed systems, Kubernetes orchestration, networking, storage, and developer tools. Requires strong systems engineering experience in performance, reliability, and production infrastructure.
Systems Engineer (Network / Storage / Systems)
Systems Engineer architects, validates, and operationalizes networking, storage, and hardware infrastructure for large-scale AI compute environments. Requires 7+ years in systems engineering with expertise in hardware bring-up, debugging, and vendor management in fast-paced settings.
CPU Storage Tech Lead
Leads technical strategy for CPU platforms, memory, and storage architectures in large-scale AI data centers. Evaluates vendor roadmaps, drives platform decisions, and ensures optimization for AI training and inference with 10+ years experience in server hardware and hyperscale infrastructure.
CPU/Storage/PoP-WAN Program Manager
Leads execution of CPU, storage, PoP, and WAN infrastructure programs to activate compute clusters and expand global networks. Requires 8+ years in technical program management with deep knowledge of hardware, networking, and data center deployments.
Data Center Controls Network Engineer
Designs, validates, and scales secure OT network architectures for high-density AI data centers, including controls systems, telemetry, and integration with IT infrastructure. Requires 8+ years in OT networking, industrial protocols, and resilient topologies in mission-critical environments.
Workload Porting & Performance Engineer
Evaluates new hardware platforms by porting benchmarks and workloads, analyzes performance across compute/memory/networking, identifies bottlenecks, and optimizes for AI systems. Requires expertise in performance analysis, system architecture, and debugging across hardware/software boundaries.
3P Architect
Defines rack- and cluster-level reference architectures for AI infrastructure, translates workload requirements into designs, collaborates with partners and modeling teams to evaluate tradeoffs, and drives vendor roadmaps to address technology gaps.
Performance Modeling Engineer ~2
Develop and maintain performance modeling tools to analyze AI system behavior, evaluate tradeoffs in compute, memory, networking, and storage. Requires 1-2 years experience in software engineering or systems analysis, strong programming, and analytical skills.
Performance Modeling Engineer
Develops and maintains performance modeling tools and frameworks to evaluate AI system behavior, analyze tradeoffs in compute, memory, networking, and storage. Collaborates with architects on simulations and insights for infrastructure design; requires strong software/modeling background and system architecture knowledge.
Software Engineer, Engineering Acceleration | Consumer Devices
Builds and operates CI/CD systems, developer workflows, and internal platforms to accelerate engineering velocity for consumer device software across device and cloud. Requires 7+ years experience with deep CI/CD and platform expertise.
Software Engineer, Kernel Performance & AI Tooling
Develops kernel performance optimizations, AI-assisted tooling, and observability infrastructure for AI-native hardware. Requires strong low-level systems experience, kernel/accelerator expertise, and familiarity with AI workflows for engineering acceleration.
Software Engineer, Infrastructure, Consumer Devices
Designs and builds scalable cloud infrastructure platforms powering OpenAI's consumer products, focusing on Kubernetes orchestration, reliability, and growth. Requires 8+ years experience leading large-scale systems with strong systems thinking.
ChatGPT Performance Engineer
Performance Engineer optimizes infrastructure and application performance for ChatGPT and OpenAI API, focusing on latency, throughput, and efficiency at scale. Requires 7+ years in high-scale systems with expertise in profiling, tracing, and cross-layer optimizations.