Skip to content

Software Engineer, Core Network Engineering

230k – 342kSan Francisco, CAOnsite
Summary

Builds and operates high-performance networking infrastructure for OpenAI's large-scale AI training and inference, focusing on host networking, datacenter fabrics, and WAN systems. Optimizes latency, reliability, and scalability using technologies like RDMA, InfiniBand, and RoCE; requires strong systems programming in C++, Python, or Go.

About the role

Responsibilities

  • Design, build, and operate networking systems that support large-scale AI training and inference infrastructure
  • Improve performance, reliability, and scalability across host networking, datacenter fabrics, and WAN systems
  • Develop automation for provisioning, configuration management, validation, upgrades, and lifecycle management of networking infrastructure
  • Build tooling and observability systems for network health, performance analysis, debugging, and automated remediation
  • Optimize network performance across technologies such as RDMA, RoCE, InfiniBand, Ethernet, and high-performance GPU interconnects
  • Define and operationalize networking protocols, readiness criteria, and continuous validation systems
  • Partner closely with compute, storage, hardware, and infrastructure teams to ensure networking scales predictably with fleet growth
  • Contribute to architecture decisions around topology design, capacity planning, failure domains, and network reliability
  • Diagnose complex distributed systems and networking issues across large heterogeneous compute environments

Requirements

  • Experience building or operating large-scale networking or distributed systems infrastructure
  • Comfortable working close to the hardware/software boundary
  • Experience with Linux networking, kernel systems, NICs, RDMA, or performance-sensitive infrastructure software
  • Worked with high-performance networking technologies such as InfiniBand, RoCE, DPDK, or large-scale Ethernet fabrics
  • Experience with datacenter networking, WAN systems, or host networking stacks
  • Enjoy debugging complex systems and performance bottlenecks across multiple layers of the stack
  • Comfortable writing production software in languages such as C++, Python, or Go
  • Strong systems fundamentals across networking, operating systems, distributed systems, or infrastructure engineering
Skills
Linux networkingRDMAInfiniBandRoCEDPDKC++PythonGoKubernetesEthernet
Similar roles at this salary range
All DevOps / SRE jobs →
Crusoe

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit
Stuut

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Crusoe

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS
Ditto

Senior Software Engineer, Platform

Lead architecture and implementation of multi-cloud Kubernetes platform across AWS, Azure, and GCP. Own infrastructure provisioning, access management, networking, and lifecycle systems while mentoring engineers and defining org-wide standards.

185k – 305kUnited StatesDevOps / SRERemoteAWSGCP
Snowflake

Senior Software Engineer - Internal Observability

Senior engineer building AI-powered observability systems and large-scale telemetry pipelines for Snowflake's multi-cloud data platform. Requires 7+ years focused on distributed systems and cloud services.

200k – 288kMenlo Park, CADevOps / SREOn-siteC++AWS