Staff Software Engineer, Network Automation
Design and deliver automation frameworks, observability platforms, and self-healing workflows for Crusoe's global network fleet. Requires 8+ years network engineering experience with strong Python/Go skills and expertise in model-driven automation.
What You'll Be Working On
Network Automation Platform
- Contribute to the technical roadmap for Crusoe's automation stack, from source of truth and config generation through day-2 operations and closed-loop remediation across our global fleet
Source of Truth
- Help design and maintain the authoritative data model (NetBox, Nautobot, or equivalent) that drives network configuration, validation, and operational state across teams
Intent-Based Configuration Pipelines
- Build and maintain declarative, model-driven configuration systems using Python, Nornir, Ansible, and Jinja2, treating the network as code and eliminating configuration drift
Model-Driven Automation
- Contribute to Crusoe's gNMI, OpenConfig, and NETCONF/YANG strategy for telemetry collection, configuration management, and state validation across multi-vendor fabrics
Self-Healing Workflows
- Build and maintain event-driven auto-remediation systems that detect faults, correlate telemetry, and resolve known failure modes without human escalation
Observability Platform
- Help build and improve Crusoe's telemetry, metrics, alerting, and dashboarding stack including Prometheus, Grafana, and streaming gNMI collectors
Architecture Partnership
- Work closely with Network Architecture to ensure designs are automation-first — deployable, validatable, and operable programmatically at scale
What You'll Bring to the Team
- 8+ years of network engineering experience with a demonstrated focus on production network automation and infrastructure as code
- Production-quality software engineering skills in Python or Go, with CI/CD integration and platform-level thinking
- Hands-on experience with model-driven automation including gNMI, OpenConfig, NETCONF, and YANG
- Experience contributing to or owning a network source of truth platform such as NetBox or Nautobot
- Strong knowledge of Arista (EOS) and/or Juniper (Junos) in leaf-spine DC fabric environments
- Solid understanding of BGP, EVPN-VXLAN, and LLDP at data center scale
- Experience building or contributing to observability platforms using Prometheus, Grafana, and streaming telemetry tooling
Bonus Points
- Experience with NVIDIA/Mellanox platforms in production environments
- Familiarity operating at fleet scale across thousands of network devices in multi-region environments
- Exposure to closed-loop, event-driven automation and auto-remediation systems
- Experience in hyperscale or internet-scale infrastructure (cloud providers, large CDNs, or AI/ML infrastructure companies)
Benefits
- Competitive compensation and equity packages
- Restricted Stock Units
- Paid time off, paid holidays & leave of absence programs
- Comprehensive health, dental & vision insurance
- Employer contributions to HSA account
- Paid parental leave
- Paid life insurance, short-term and long-term disability
- Professional development & tuition reimbursement
- Mental health & wellness support
- Commuter benefits (parking & transit)
- Cell phone stipend
- 401(k) Retirement plan with company match up to 4% of salary
- Volunteer time off
- Global travel insurance & emergency assistance
- Daily meals allowance
- Additional perks & programs specific to location
Staff Site Reliability Engineer, Release Engineering
Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.
Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.