Network Engineer, Capacity and Efficiency
Network engineer focused on observability, telemetry, cost modeling, and efficiency optimization for large-scale AI infrastructure networks across data centers, backbones, and cloud providers. Requires 5+ years experience with production networking, BGP/EVPN/QoS fluency, and Python/Go tooling.
Responsibilities
-
Build the network observability stack. Design and deploy telemetry pipelines — sFlow/IPFIX, gNMI streaming, eBPF host probes — that turn packet counters into per-flow, per-tenant, per-workload cost and utilization data. Own the SLIs for backbone and DCN fabric health.
-
Hunt for efficiency. Analyze inter-region traffic patterns, identify hot links and stranded capacity, and quantify the dollar impact. Build the models that tell us whether we should buy more capacity, or move the workload.
-
Own QoS and traffic engineering. Design and operate traffic classification, marking, and shaping across the backbone. Make sure bulk checkpoint transfers don’t starve latency-sensitive inference, and that we’re not paying premium cross-region rates for traffic that could take the cheap path.
-
Drive cost attribution. Tie network spend — egress, interconnect ports, transit, optical leases — back to the teams and workloads that generate it. Make network cost a first-class input to capacity planning and workload placement decisions.
-
Influence decisions you don't own. Convince other teams to act on data: research on traffic patterns, finance on interconnects, Systems Networking on QoS.
-
Automate. Extend intent-based network configuration systems and write tooling for efficiency findings.
Requirements
-
5+ years operating large-scale production networks — data center fabrics (spine-leaf, Clos), backbone/WAN, or hyperscaler-adjacent environments.
-
Fluent across the stack: BGP (including policy and communities), ECMP, VXLAN/EVPN or equivalent overlays, QoS (DSCP, queuing, shaping), and L1/optical basics (DWDM, coherent, LAGs).
-
Deep knowledge of at least one major CSP’s networking model: AWS (VPC, TGW, Direct Connect, Gateway Load Balancer) or GCP (Shared VPC, Interconnect, Cloud Router, Network Connectivity Center).
-
Built or operated network telemetry at scale: streaming telemetry (gNMI/OpenConfig), flow export (sFlow, IPFIX, NetFlow), or eBPF-based host-side instrumentation.
-
Comfortable writing Python or Go to build tooling, telemetry pipelines, infrastructure-as-code, config management for network devices and automation.
-
Think quantitatively by default. Build cost models from counter data.
-
Communicate crisply with finance, network engineers, etc.
Nice-to-haves
-
SRE experience for large-scale network infrastructure — SLOs/SLIs, capacity planning, incident response.
-
Background on cloud provider's networking team or cloud networking product team.
-
Familiarity with AI/ML infrastructure traffic patterns (all-reduce, checkpoint/weight transfer, inference serving).
-
Experience with HPC fabrics like InfiniBand, RoCE v2, lossless Ethernet.
-
Traffic engineering for large backbones.
-
Multi-cloud connectivity and billing models.
-
Cost/chargeback systems or FinOps exposure.
Staff Software Engineer, Infrastructure Asset Systems
As a Staff Software Engineer, you will build and extend systems for tracking, governing, and reporting on infrastructure assets. This involves designing data models, workflow engines, and integrations with financial and procurement systems, ensuring compliance and auditability.
Senior Manager, Network Engineering & Infrastructure
Lead and mentor a network engineering team responsible for designing, deploying, and operating multi-site enterprise network infrastructure across data centers, cloud, offices, and vehicle facilities. Requires 10+ years of network experience with 5+ years in senior leadership.
Performance Engineer, Inference Systems
Performance engineer focused on cross-layer investigations of Anthropic's inference fleet for Claude, optimizing throughput, latency, reliability, and correctness while building observability and partnering with kernel and serving teams.
Tech Lead, Deployment & Operations — Custom Infrastructure
Lead deployment and operations for OpenAI’s custom silicon and systems into data center environments. Drive hardware bring-up, validation, production deployment, and fleet reliability at scale while leading a technical team.
Staff Fiber Network Engineer
Owns end-to-end physical layer of private global dark-fiber backbone network, including route design, fiber acquisition, vendor management, acceptance testing, and lifecycle management. Requires deep OSP/fiber expertise, optical transport knowledge, and 8+ years experience building fiber programs.