Product Manager, Networking
Own the product roadmap and tooling for internal networking systems supporting large-scale GPU clusters, including design automation, observability, digital twins, and performance profiling. Requires 5+ years PM experience with deep hands-on knowledge of data center networking and infrastructure tooling.
Responsibilities
- Own the product roadmap for all internal networking tooling: design automation, provisioning, observability, performance analysis, and incident remediation workflows across frontend, backend, OOB, and BMS networks.
- Drive the strategy and requirements for digital twin tooling that models physical fabric topology, enabling engineers to validate designs, simulate failures, and test config changes before touching production.
- Define and ship BOM generators that produce accurate, version-controlled bills of materials for frontend Ethernet, backend Ethernet, InfiniBand, and OOB networks tied directly to cluster topology specs.
- Own the configuration generation pipeline: translate high-level cluster designs into device-ready configs across switches, routers, and OOB management infrastructure, with correctness guarantees and rollback support.
- Build the observability stack requirements for network telemetry ingestion (gNMI, SNMP, streaming) into dashboards and alerting systems that give operators sub-minute visibility into fabric health and performance degradation.
- Define performance profiling tooling that surfaces InfiniBand and RoCEv2 congestion, all-reduce bottlenecks, and east-west bandwidth saturation at the GPU job level.
- Work with network engineers and site operations to map the full lifecycle of a network event from detection through remediation, then build the tooling that compresses mean time to resolution.
- Partner with infrastructure and software engineering teams to integrate networking tooling into the broader cluster lifecycle: from site design through rack-and-stack, burn-in, and steady-state operations.
- Define the data model and schema standards that sit underneath all networking tools, ensuring BOM data, topology data, telemetry data, and config state are coherent and queryable across systems.
- Conduct working sessions with network engineers, site leads, and operations staff to identify the highest-friction workflows, then prioritize ruthlessly based on operational impact.
Requirements
- 5+ years of product management experience with at least 3 years focused on infrastructure, networking, or platform tooling.
- Direct working knowledge of data center networking technologies: spine-leaf topology, EVPN/VXLAN, BGP, 400G/800G Ethernet, and high-radix switch platforms from vendors such as Arista, Cisco Nexus, or Nvidia Spectrum.
- Hands-on familiarity with high-performance interconnects: InfiniBand (HDR/NDR), RoCEv2, and the operational realities of running large-scale RDMA fabrics under AI training workloads.
- Working knowledge of network telemetry protocols and frameworks: gNMI/gRPC streaming, SNMP, OpenConfig, and at least one observability stack built on top of them (Prometheus, InfluxDB, Grafana, or equivalent).
- Experience shipping internal tooling or developer-facing platform products.
- Ability to write detailed technical specifications that engineering teams can execute against without follow-up clarification.
- Demonstrated track record of reducing operational toil through automation: config generation, provisioning workflows, or similar.
Nice-to-Haves
- Experience at a hyperscaler, neocloud, or large-scale GPU infrastructure company where you owned networking tooling end-to-end (AWS, Azure, GCP, Oracle, CoreWeave, Lambda, or equivalent).
- Prior background as a network engineer or network automation engineer before moving into product: you have personally written Ansible playbooks, Nornir scripts, YANG models, or equivalent configuration automation.
- Familiarity with digital twin or network simulation frameworks: emulated environments built on tools like GNS3, EVE-NG, Containerlab, or proprietary fabric simulation systems.
- Experience defining or operating out-of-band management networks: IPMI/BMC, console servers, and the tooling used to reach devices when the in-band network is down.
- Understanding of BMS integration patterns in hyperscale facilities: BACnet, Modbus, SNMP-based BMS interfaces, and the data normalization challenges that come with multi-vendor BMS environments.
- Exposure to AI/ML workload network requirements: collective communication libraries (NCCL, RCCL), all-reduce topologies, and how fabric decisions impact training throughput and model FLOP utilization.
- Familiarity with network source-of-truth and IPAM systems: NetBox, Nautobot, or internal equivalents used to drive automation.
Senior Product Manager, New Verticals
Senior PM to own and grow new verticals including local dining rewards, app campaigns, and surveys. Drive product strategy, roadmap, and execution for consumer-facing, revenue-generating products.
Staff Product Engineer
Own product outcomes end-to-end as a Staff Product Engineer at a mission-driven AI startup. Build features across the stack, act as PM for your area, and work directly with 911 dispatchers to define and ship what matters most.
Senior Product Engineer
Own product outcomes end-to-end as a Senior Product Engineer at a mission-driven AI startup. Build features across the stack, act as PM for your area, and work directly with 911 dispatchers to define and ship what matters most.
Senior Product Manager, Scaled Abuse
Lead product vision and roadmap for scaled abuse prevention at Discord, covering fake accounts, ATO, and scraping. Drive cross-functional execution with ML, security, and product teams to protect users.
Principal Product Manager, Supplier Platform
Lead product vision and roadmap for ezCater's restaurant partner platform, building data structures, services, and experiences that help partners grow their business. Requires 8+ years PM experience with marketplace and platform expertise.