Staff Network Engineer, Operations

195k – 235kSan Francisco, CAOnsite8+ YOEJun 5

Summary

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

About the role

What You'll Be Working On

Production Reliability

Help own uptime across Crusoe's global edge, backbone, data center, and GPU cluster network, directly supporting AI workloads at scale.

Incident Response

Lead and contribute to end-to-end response for high-severity network events, including mitigation, stakeholder communication, and postmortem documentation.

Root Cause Analysis

Drive RCAs for production incidents, identify systemic issues, and author remediation plans tracked through to closure.

Observability Improvements

Contribute to and improve Crusoe's network monitoring stack using streaming telemetry, SNMP, NetFlow, and tools such as Kentik, Grafana, Prometheus, and ThousandEyes.

Operational Standards

Author and maintain runbooks, escalation playbooks, and SOPs used across the operations team.

Operational Automation

Write Python-based tooling to reduce toil, automate common remediation workflows, and accelerate mean time to resolution.

SLI/SLO Contribution

Partner with Architecture and SRE teams to define and track network reliability metrics and service level objectives backed by real-time dashboards.

Mentorship

Provide technical guidance to Senior engineers and contribute to a culture of operational excellence and continuous learning.

What You'll Bring to the Team

8+ years of production network engineering experience with a focus on operations, incident response, and reliability in large-scale or internet-scale environments.
Hands-on experience with observability and monitoring tools including streaming telemetry, SNMP, NetFlow/sFlow, Grafana, Prometheus, and ThousandEyes.
Experience operating RDMA/RoCE lossless fabrics for GPU or HPC workloads, including familiarity with PFC, ECN, and DCQCN tuning.
Expert hands-on knowledge of BGP, EVPN-VXLAN, IS-IS, OSPF, MPLS, QoS, and TCP/IP in production data center environments.
Proficiency with Arista (EOS) and Juniper (Junos) platforms in leaf-spine CLOS architectures across multi-vendor environments.
Python proficiency for writing auto-remediation scripts, diagnostic tooling, and operational automation.
Comfort operating large device fleets across multi-region environments with on-call responsibility, including experience as an escalation point during critical events.
Bachelor's degree in Computer Science, Electrical Engineering, or a related field, or equivalent practical experience.

Bonus Points

Experience with NVIDIA/Mellanox networking platforms in GPU cluster environments.
Familiarity with Kentik or Arbor for traffic analysis and DDoS visibility.
Experience defining or contributing to SLIs and SLOs in partnership with SRE or product teams.
Exposure to operating 10K+ device fleets across hyperscale or cloud environments.
Background contributing to post-incident learning programs or operational excellence initiatives org-wide.

Benefits

Competitive compensation and equity packages
Restricted Stock Units
Paid time off, paid holidays & leave of absence programs
Comprehensive health, dental & vision insurance
Employer contributions to HSA account
Paid parental leave
Paid life insurance, short-term and long-term disability
Professional development & tuition reimbursement
Mental health & wellness support
Commuter benefits (parking & transit)
Cell phone stipend
401(k) Retirement plan with company match up to 4% of salary
Volunteer time off
Global travel insurance & emergency assistance
Daily meals allowance
Additional perks & programs specific to location

Skills

BGPEVPN-VXLANIS-ISOSPFMPLSQoSTCP/IPArista EOSJunosPythonGrafanaPrometheusThousandEyesSNMPNetFlow/sFlow

Similar roles at this salary range

All DevOps / SRE jobs →

Crusoe

Jun 8

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit

Aurelian

Jun 8

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Stuut

Jun 8

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS

Huntress

Jun 8

Senior Developer Experience Engineer

Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.

160k – 190kUnited StatesDevOps / SRERemoteGoRuby

Watershed

Jun 5

Software Engineer, Developer Tooling

Software engineer building developer tooling, AI automation, and test infrastructure to improve productivity and reliability for Watershed engineering teams.

174k – 230kSan Francisco, CADevOps / SREOn-siteCI/CDTemporal

Apply