Network Engineer, Design & Engineering
Design end-to-end datacenter network architectures for AI training and inference workloads. Own topology selection, fabric design, physical infrastructure integration, and produce deployable HLDs/LLDs across multiple GPU platforms and customer requirements.
Focus
End-to-End Network Design: Own the design lifecycle from customer requirements through deployable architecture. Produce topology designs, IP/addressing schemes, routing policy, and fabric configuration specifications for AI training and inference fabrics. Design front-end (out-of-band management, customer access), back-end (GPU-to-GPU training fabric), and storage network architectures.
Multi-Customer Architecture Adaptability: Design network architectures that adapt to different GPU platforms (NVIDIA, AMD, custom accelerators), server form factors, and workload profiles. Each customer engagement may require a different rack layout, power envelope, cable infrastructure approach, and fabric topology.
Physical Infrastructure Integration: Translate logical network designs into physical reality. Work cross functionally on rack elevation planning, power distribution constraints, structured cabling architecture (fiber trunk design, patch panel layouts, cable pathway routing), and cooling/airflow considerations that impact network equipment placement. Ensure designs are buildable within the physical constraints of each facility.
Design Documentation & Handover: Produce comprehensive design packages that enable deployment teams to execute independently. This includes High-Level Designs (HLDs), Low-Level Designs (LLDs), cutsheet specifications, bill of materials, cabling matrices, and design decision records. Your documentation is the contract between design intent and deployment execution.
RDMA & High-Performance Fabric Design: Design lossless Ethernet fabrics optimized for RDMA (RoCEv2) workloads including PFC configuration, ECN tuning, traffic class design, and congestion management. Understand the relationship between fabric topology, ECMP behavior, and collective communication patterns in distributed training workloads.
Cross-Functional Design Collaboration: Partner with Hardware Engineering on server/GPU platform integration, DC Operations on facility constraints and power planning, ICT on structured cabling feasibility and fiber budgets, Software Engineering on automation requirements and DCIM data modeling, and Validation teams on test plans and acceptance criteria. Your designs must satisfy constraints across all of these domains.
Design Review & Standards: Participate in and lead design review sessions. Contribute to the development of reference architectures, design standards, and reusable design patterns that accelerate future deployments. Challenge assumptions — both your own and others’ — to ensure designs are technically rigorous and operationally sound.
About You
Design-First Network Engineer: 5+ years of network engineering experience with a demonstrated focus on network design and architecture rather than purely operational roles. You’ve designed datacenter network fabrics from requirements through deployment — not just configured them. You can articulate why a design decision was made, what tradeoffs were considered, and what constraints drove the outcome.
Deep L1–L3 Expertise: Strong command of datacenter network fundamentals including CLOS/fat-tree topologies, BGP (eBGP underlay, iBGP/eBGP overlay), EVPN/VXLAN, IP addressing and subnetting at scale, and physical layer design (optics selection, fiber types, link budgets). You understand how L1 decisions cascade into L2/L3 behavior and design accordingly.
RDMA & AI Fabric Understanding: Working knowledge of RDMA network design (InfiniBand and/or RoCEv2), lossless Ethernet configuration (PFC, ECN, DCQCN), and the network performance requirements of distributed AI training workloads. You understand why fabric design decisions directly impact training job completion time.
GPU Cluster Architecture Exposure: Experience designing networks around specific GPU platforms (NVIDIA DGX/HGX, AMD MI-series, custom accelerator platforms). Understanding of how GPU topology, NVLink/NVSwitch architecture, and host networking configuration interact with fabric design.
Physical Infrastructure Fluency: Ability to reason about network design in the context of physical constraints. You’ve worked through rack layout planning, power budget allocation, structured cabling architecture, and equipment placement decisions. You don’t design networks in a vacuum — you understand that every logical decision has a physical consequence.
First Principles Thinker: You break complex design problems into fundamental components and reason through them systematically. When faced with a new GPU platform, an unfamiliar facility constraint, or a novel customer requirement, you decompose the problem rather than reaching for the nearest template. You challenge assumptions — including your own — and can defend your design decisions with rigorous reasoning.
Documentation Rigor: You produce design documentation that is clear, complete, and actionable. Your HLDs and LLDs enable deployment teams to execute without requiring you in the room. You see documentation as a design artifact, not an afterthought.
Cross-Functional Collaboration: Excellent at working across engineering disciplines. You communicate design intent clearly to non-network stakeholders (hardware, facilities, cabling) and incorporate their constraints into your designs. You earn trust through technical depth and follow-through.
Nice to Haves
Hyperscale or Large-Scale Design Background: Experience designing networks at hyperscale companies (Meta, Google, Microsoft, AWS) or large AI infrastructure providers. You’ve seen what disciplined design processes look like at scale and can adapt those patterns to a fast-growing...
Staff Site Reliability Engineer, Release Engineering
Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.
Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.