Staff Software Engineer, Systems Engineering Focus
Designs, builds, and scales customer-facing managed services with a focus on edge agents running on customer infrastructure. Provides technical oversight for high-reliability systems using eBPF, Kubernetes, and low-level Linux metrics; leads cross-team collaboration and mentors engineers.
What You'll Be Working On
- Customer-Facing Feature Development: Build and scale core platform services end-to-end — from greenfield 0-to-1 projects to scaling systems handling growing production traffic.
- Edge & Agents Technical Oversight: Serve as the team's subject matter expert on edge software. Review existing agent architectures, provide technical guidance on inflight designs, and shape how we build and operate software at the system level.
- Edge Agent Development: Build and maintain lightweight, high-reliability agents deployed on customer VMs. Minimize CPU/memory footprint without sacrificing observability coverage.
- Linux Kernel Metrics & eBPF: Instrument low-level system metrics using eBPF and procfs to power Crusoe's monitoring and telemetry pipeline.
- Packaging & Distribution: Own agent packaging and deployment via Helm charts, ensuring smooth delivery across customer environments.
- Pull-Based Scraping Architecture: Design and evolve the "pull" scraping logic that collects metrics from customer infrastructure with minimal operational overhead.
- Cross-Functional Collaboration: Partner with Control Plane, Storage, and SRE teams to ensure agent data feeds are reliable, well-structured, and operationally sound.
- Cross-Team Bridge: Serve as the technical bridge between the Managed Platform Services team and adjacent infrastructure teams — including SRE and Compute — who work on the cloud hypervisor and lower-level platform layers.
- Technical Leadership: Set patterns and frameworks adopted across the team. Mentor senior engineers, contribute to architecture decisions, and help scope quarterly roadmap items with engineering and product leadership.
What You'll Bring to the Team
- Systems Programming Expertise: Strong proficiency in Python, Go, and/or Shell scripting, with comfort working across languages as the problem demands.
- Linux Kernel Metrics: Experience instrumenting low-level system metrics. Comfort working at the procfs level.
- Kubernetes & Helm: Strong understanding of Kubernetes internals and experience packaging and deploying workloads via Helm charts.
- Operational Mindset: On-call experience on a customer-facing team is required. You proactively identify gaps in monitoring, alerting, and tooling — and close them.
- Reliability-First Engineering: You design for crash safety, low resource footprint, and graceful degradation. You think about what happens when things go wrong before writing a line of code.
- Scalable Design Thinking: You plan how systems evolve under traffic growth. You consider resiliency, HA, and disaster recovery from the start.
- Staff-Level Impact: You lead cross-domain efforts, own team architecture, and coach multiple engineers across and beyond your team.
- Communication: Concise, proactive, and anticipates blockers. Keeps stakeholders informed without needing to be asked.
Compensation
Compensation range up to $210,000 - $255,000 + Bonus. Restricted Stock Units included.
Lead Site Reliability Engineer
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
Staff Network Engineer, Operations
Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.