Skip to content

Senior Network & Site Reliability Engineer

210k – 240kSan Francisco, CAOnsite8+ YOE
Summary

Design, operate, and automate the global network and reliability layer for a high-performance NVIDIA DGX SuperPOD supporting ML workloads. Own architecture, observability, incident response, and security for mission-critical infrastructure.

About the role

What You'll Do

  • Architect and operate scalable, secure network architecture for high-security requirements and large-scale machine learning workloads.
  • Own network device configuration management end to end, ensuring consistency and reliability across the fleet.
  • Improve system and network reliability and performance through automation, observability, and proactive capacity planning.
  • Implement and manage complex network protocols and connectivity, including BGP, VPNs, and WAN circuits and external peering.
  • Build and maintain comprehensive monitoring, alerting, and incident response — SLOs, runbooks, and on-call rotations — and drive post-incident analysis and continuous improvement.
  • Ensure security, compliance, and operational readiness across our network and cloud infrastructure.
  • Partner across engineering and data science to drive a culture of performance and reliability.

What Will Help You Succeed

  • 8+ years in network or infrastructure engineering, including 5+ years in datacenter operations and/or systems and network administration.
  • A strong background in network security, architecture, design, and operations.
  • Extensive hands-on experience with network devices (firewalls, switches, load balancers) and large-scale architectures and protocols — BGP, QoS, MPLS, and IPsec VPNs.
  • Experience designing and operating modern datacenter network fabrics (spine-leaf, EVPN/VXLAN, ECMP).
  • Network automation and IaC tooling (Ansible, Terraform, Nornir, or similar), plus IPAM/DCIM platforms (NetBox, Infoblox, or similar).
  • WAN engineering — carrier circuit provisioning and external network peering.
  • Familiarity with Kubernetes networking (CNI plugins, ingress, service networking, network policy) and strong operational experience with Linux-based production infrastructure.
  • Experience with monitoring and observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry).
  • Solid scripting (Python, Bash) to debug complex network and system issues and automate solutions, plus excellent cross-functional communication.

Also Helpful

  • NVIDIA networking technologies — Cumulus Linux, InfiniBand, Spectrum-X, and BlueField DPUs.
  • Familiarity with data-intensive platforms (Spark, Airflow, Kafka) and storage network protocols (NFS, LustreFS, iSCSI).
  • Security practices for applications and infrastructure, and experience in high-compliance or SOC 2 environments.
Skills
BGPVPNMPLSIPsecAnsibleTerraformKubernetesPythonPrometheusGrafana
Similar roles at this salary range
All DevOps / SRE jobs →
Datadog

Senior Software Engineer - Observability Visibility

Senior engineer building observability and resilience standards, tooling, and automation to make reliability the default across Datadog services. Requires 5+ years experience, Go/Python skills, and AI feature delivery experience.

175k – 240kNew York, NYDevOps / SREHybrid5+ YOEGoPython
Shield AI

Senior Manager, DevOps Engineering

Lead and mentor a team of DevOps and Infrastructure Engineers responsible for build pipelines, CI/CD systems, developer tooling, and release infrastructure across Hivemind Solutions. Drive modernization of C++/Python build ecosystems and ensure scalable, secure software delivery pipelines.

180k – 280kWashington, DCDevOps / SREOn-site7+ YOENixCMake
Coinbase

Staff Software Engineer

Staff Software Engineer owning technical strategy and systems for Coinbase's test infrastructure at scale. Focus on fast, reliable test signals through orchestration, smart selection, sharding, and flakiness remediation.

218k – 257kUnited StatesDevOps / SRERemote10+ YOEGoAWS
Hightouch

Staff Engineer, AI Productivity

Staff-level engineer building infrastructure, tooling, and documentation to make AI coding agents dramatically more productive across the codebase. Owns agentic dev environments, MCP integrations, and agent context.

180k – 400kUnited StatesDevOps / SRERemote7+ YOEGoDevin
Skydio

Staff Software Engineer - Infrastructure

Staff Infrastructure Engineer responsible for re-architecting Kubernetes infrastructure, improving continuous delivery, and making code changes across the stack to support drone platform needs.

230k – 275kSan Mateo, CADevOps / SREHybrid6+ YOEGoSaaS