Responsibilities
- Build fleet-scale pipelines that turn noisy onboard signals into actionable, high-confidence investigations.
- Develop automated triage and correlation systems that deduplicate issues, route them to the right owning teams, and attach up-to-date priority signals and diagnostic context.
- Partner with engineering teams and subject matter experts to turn investigation outcomes into better instrumentation, automation, and signal quality over time.
- Build internal tools and workflows that reduce duplicate effort and increase situational awareness as the fleet scales (self-service debugging, standardized metrics, shared templates, securely scoped access).
- Lead reliability investigations to identify contributing factors and ensure learnings turn into durable engineering changes.
Requirements
- Experience writing and shipping software that runs in production, with an ownership mindset and attention to how it behaves in real-world conditions.
- Ability to build and maintain tools and automation that enable other engineers: internal tools, instrumentation, and visualizations (Python, Go, Bash, C++).
- Strong debugging fundamentals across the stack, including using system signals and live troubleshooting to form hypotheses and identify contributing factors.
- Strong interest in reliability engineering as a growth path: motivated by making complex systems understandable, resilient, and easier to run as they scale.
Nice-to-Haves
- Background in distributed systems or real-world deployed systems (vehicles, robotics, IoT, or similar).
- Familiarity with production telemetry and observability.
- Experience applying reliability metrics and operational feedback loops to drive improvements.
- Exposure to cross-team reliability work in mission-critical environments.
Compensation
Base pay range: $145,830 - $219,000 (depending on experience, qualifications, education, location, skills). Eligible for annual performance bonus, equity, and competitive benefits package.