Software Engineer, Core Network Engineering
Builds and operates high-performance networking infrastructure for OpenAI's large-scale AI training and inference, focusing on host networking, datacenter fabrics, and WAN systems. Optimizes latency, reliability, and scalability using technologies like RDMA, InfiniBand, and RoCE; requires strong systems programming in C++, Python, or Go.
Responsibilities
- Design, build, and operate networking systems that support large-scale AI training and inference infrastructure
- Improve performance, reliability, and scalability across host networking, datacenter fabrics, and WAN systems
- Develop automation for provisioning, configuration management, validation, upgrades, and lifecycle management of networking infrastructure
- Build tooling and observability systems for network health, performance analysis, debugging, and automated remediation
- Optimize network performance across technologies such as RDMA, RoCE, InfiniBand, Ethernet, and high-performance GPU interconnects
- Define and operationalize networking protocols, readiness criteria, and continuous validation systems
- Partner closely with compute, storage, hardware, and infrastructure teams to ensure networking scales predictably with fleet growth
- Contribute to architecture decisions around topology design, capacity planning, failure domains, and network reliability
- Diagnose complex distributed systems and networking issues across large heterogeneous compute environments
Requirements
- Experience building or operating large-scale networking or distributed systems infrastructure
- Comfortable working close to the hardware/software boundary
- Experience with Linux networking, kernel systems, NICs, RDMA, or performance-sensitive infrastructure software
- Worked with high-performance networking technologies such as InfiniBand, RoCE, DPDK, or large-scale Ethernet fabrics
- Experience with datacenter networking, WAN systems, or host networking stacks
- Enjoy debugging complex systems and performance bottlenecks across multiple layers of the stack
- Comfortable writing production software in languages such as C++, Python, or Go
- Strong systems fundamentals across networking, operating systems, distributed systems, or infrastructure engineering
Lead Site Reliability Engineer
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
Staff Network Engineer, Operations
Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.
Senior Software Engineer, Platform
Lead architecture and implementation of multi-cloud Kubernetes platform across AWS, Azure, and GCP. Own infrastructure provisioning, access management, networking, and lifecycle systems while mentoring engineers and defining org-wide standards.
Senior Software Engineer - Internal Observability
Senior engineer building AI-powered observability systems and large-scale telemetry pipelines for Snowflake's multi-cloud data platform. Requires 7+ years focused on distributed systems and cloud services.