Network Engineer, Design & Engineering

180k – 300kNew York, NYSan Francisco, CACaliforniaAustin, TXDevOps / SREOnsite5+ YOEJun 10

Summary

Design end-to-end datacenter network architectures for AI training and inference workloads. Own topology selection, fabric design, physical infrastructure integration, and produce deployable HLDs/LLDs across multiple GPU platforms and customer requirements.

About the role

Focus

End-to-End Network Design: Own the design lifecycle from customer requirements through deployable architecture. Produce topology designs, IP/addressing schemes, routing policy, and fabric configuration specifications for AI training and inference fabrics. Design front-end (out-of-band management, customer access), back-end (GPU-to-GPU training fabric), and storage network architectures.

Multi-Customer Architecture Adaptability: Design network architectures that adapt to different GPU platforms (NVIDIA, AMD, custom accelerators), server form factors, and workload profiles. Each customer engagement may require a different rack layout, power envelope, cable infrastructure approach, and fabric topology.

Physical Infrastructure Integration: Translate logical network designs into physical reality. Work cross functionally on rack elevation planning, power distribution constraints, structured cabling architecture (fiber trunk design, patch panel layouts, cable pathway routing), and cooling/airflow considerations that impact network equipment placement. Ensure designs are buildable within the physical constraints of each facility.

Design Documentation & Handover: Produce comprehensive design packages that enable deployment teams to execute independently. This includes High-Level Designs (HLDs), Low-Level Designs (LLDs), cutsheet specifications, bill of materials, cabling matrices, and design decision records. Your documentation is the contract between design intent and deployment execution.

RDMA & High-Performance Fabric Design: Design lossless Ethernet fabrics optimized for RDMA (RoCEv2) workloads including PFC configuration, ECN tuning, traffic class design, and congestion management. Understand the relationship between fabric topology, ECMP behavior, and collective communication patterns in distributed training workloads.

Cross-Functional Design Collaboration: Partner with Hardware Engineering on server/GPU platform integration, DC Operations on facility constraints and power planning, ICT on structured cabling feasibility and fiber budgets, Software Engineering on automation requirements and DCIM data modeling, and Validation teams on test plans and acceptance criteria. Your designs must satisfy constraints across all of these domains.

Design Review & Standards: Participate in and lead design review sessions. Contribute to the development of reference architectures, design standards, and reusable design patterns that accelerate future deployments. Challenge assumptions — both your own and others’ — to ensure designs are technically rigorous and operationally sound.

About You

Design-First Network Engineer: 5+ years of network engineering experience with a demonstrated focus on network design and architecture rather than purely operational roles. You’ve designed datacenter network fabrics from requirements through deployment — not just configured them. You can articulate why a design decision was made, what tradeoffs were considered, and what constraints drove the outcome.

Deep L1–L3 Expertise: Strong command of datacenter network fundamentals including CLOS/fat-tree topologies, BGP (eBGP underlay, iBGP/eBGP overlay), EVPN/VXLAN, IP addressing and subnetting at scale, and physical layer design (optics selection, fiber types, link budgets). You understand how L1 decisions cascade into L2/L3 behavior and design accordingly.

RDMA & AI Fabric Understanding: Working knowledge of RDMA network design (InfiniBand and/or RoCEv2), lossless Ethernet configuration (PFC, ECN, DCQCN), and the network performance requirements of distributed AI training workloads. You understand why fabric design decisions directly impact training job completion time.

GPU Cluster Architecture Exposure: Experience designing networks around specific GPU platforms (NVIDIA DGX/HGX, AMD MI-series, custom accelerator platforms). Understanding of how GPU topology, NVLink/NVSwitch architecture, and host networking configuration interact with fabric design.

Physical Infrastructure Fluency: Ability to reason about network design in the context of physical constraints. You’ve worked through rack layout planning, power budget allocation, structured cabling architecture, and equipment placement decisions. You don’t design networks in a vacuum — you understand that every logical decision has a physical consequence.

First Principles Thinker: You break complex design problems into fundamental components and reason through them systematically. When faced with a new GPU platform, an unfamiliar facility constraint, or a novel customer requirement, you decompose the problem rather than reaching for the nearest template. You challenge assumptions — including your own — and can defend your design decisions with rigorous reasoning.

Documentation Rigor: You produce design documentation that is clear, complete, and actionable. Your HLDs and LLDs enable deployment teams to execute without requiring you in the room. You see documentation as a design artifact, not an afterthought.

Cross-Functional Collaboration: Excellent at working across engineering disciplines. You communicate design intent clearly to non-network stakeholders (hardware, facilities, cabling) and incorporate their constraints into your designs. You earn trust through technical depth and follow-through.

Nice to Haves

Hyperscale or Large-Scale Design Background: Experience designing networks at hyperscale companies (Meta, Google, Microsoft, AWS) or large AI infrastructure providers. You’ve seen what disciplined design processes look like at scale and can adapt those patterns to a fast-growing...

Skills

BGPEVPN/VXLANCLOS/fat-tree topologiesRDMARoCEv2PFCECNDCQCNInfiniBandNVIDIA DGX/HGXAMD MI-seriesHigh-Level Design (HLD)Low-Level Design (LLD)Structured cablingIP addressing

Similar roles at this salary range

All DevOps / SRE jobs →

Plaid

Jun 19

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Cribl

Jun 17

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Apply