Staff Software Engineer, Systems Engineering Focus

210k – 255kSan Francisco, CAOnsiteApr 10

Summary

Designs, builds, and scales customer-facing managed services with a focus on edge agents running on customer infrastructure. Provides technical oversight for high-reliability systems using eBPF, Kubernetes, and low-level Linux metrics; leads cross-team collaboration and mentors engineers.

About the role

What You'll Be Working On

Customer-Facing Feature Development: Build and scale core platform services end-to-end — from greenfield 0-to-1 projects to scaling systems handling growing production traffic.
Edge & Agents Technical Oversight: Serve as the team's subject matter expert on edge software. Review existing agent architectures, provide technical guidance on inflight designs, and shape how we build and operate software at the system level.
Edge Agent Development: Build and maintain lightweight, high-reliability agents deployed on customer VMs. Minimize CPU/memory footprint without sacrificing observability coverage.
Linux Kernel Metrics & eBPF: Instrument low-level system metrics using eBPF and procfs to power Crusoe's monitoring and telemetry pipeline.
Packaging & Distribution: Own agent packaging and deployment via Helm charts, ensuring smooth delivery across customer environments.
Pull-Based Scraping Architecture: Design and evolve the "pull" scraping logic that collects metrics from customer infrastructure with minimal operational overhead.
Cross-Functional Collaboration: Partner with Control Plane, Storage, and SRE teams to ensure agent data feeds are reliable, well-structured, and operationally sound.
Cross-Team Bridge: Serve as the technical bridge between the Managed Platform Services team and adjacent infrastructure teams — including SRE and Compute — who work on the cloud hypervisor and lower-level platform layers.
Technical Leadership: Set patterns and frameworks adopted across the team. Mentor senior engineers, contribute to architecture decisions, and help scope quarterly roadmap items with engineering and product leadership.

What You'll Bring to the Team

Systems Programming Expertise: Strong proficiency in Python, Go, and/or Shell scripting, with comfort working across languages as the problem demands.
Linux Kernel Metrics: Experience instrumenting low-level system metrics. Comfort working at the procfs level.
Kubernetes & Helm: Strong understanding of Kubernetes internals and experience packaging and deploying workloads via Helm charts.
Operational Mindset: On-call experience on a customer-facing team is required. You proactively identify gaps in monitoring, alerting, and tooling — and close them.
Reliability-First Engineering: You design for crash safety, low resource footprint, and graceful degradation. You think about what happens when things go wrong before writing a line of code.
Scalable Design Thinking: You plan how systems evolve under traffic growth. You consider resiliency, HA, and disaster recovery from the start.
Staff-Level Impact: You lead cross-domain efforts, own team architecture, and coach multiple engineers across and beyond your team.
Communication: Concise, proactive, and anticipates blockers. Keeps stakeholders informed without needing to be asked.

Compensation

Compensation range up to $210,000 - $255,000 + Bonus. Restricted Stock Units included.

Skills

PythonGoShell scriptingKubernetesHelmeBPFprocfsLinux kernelmonitoringtelemetry

Similar roles at this salary range

All DevOps / SRE jobs →

Crusoe

Jun 8

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit

Aurelian

Jun 8

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Stuut

Jun 8

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS

Crusoe

Jun 5

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS

Watershed

Jun 5

Software Engineer, Developer Tooling

Software engineer building developer tooling, AI automation, and test infrastructure to improve productivity and reliability for Watershed engineering teams.

174k – 230kSan Francisco, CADevOps / SREOn-siteCI/CDTemporal

Apply