Sr. Staff Software Engineer, Observability

235k – 295kMountain View, CAOnsite15+ YOEFeb 1

Summary

Develops observability platforms handling billions of time series and petabytes of logs across global cloud regions. Requires 15+ years in systems languages, distributed systems, cloud tech, and mentoring engineers.

About the role

The impact you'll have:

Build the next generation of observability platforms that support billions of active time series and process petabytes of logs daily.
Manage infrastructure across nearly a hundred cloud regions, enabling all Databricks engineers and customers to monitor the reliability of our product.
Develop advanced workflows that accelerate incident diagnosis for Bricksters, allowing engineers to quickly derive insights from logs and metrics.
Uplevel monitoring and reliability practices across Databricks engineering, developing opinionated tools that set common standards for managing structured logs, metrics, alerts, dashboards, and oncall rotations.
Mentor and uplevel engineers, fostering a culture of technical excellence within the team and broader observability community.

What we look for:

BS (or higher) in Computer Science, or a related field.
15+ years of production-level experience in one of: Go, Python, Java, Scala, Rust, C++, or similar languages.
Experience in software development, in large-scale distributed systems.
Experience driving large projects involving multiple teams.
Experience with cloud technologies, e.g. AWS, Azure, GCP, Docker, or Kubernetes.
Familiarity with observability infrastructure, monitoring patterns, and reliability practices.

Skills

GoPythonJavaScalaRustC++KubernetesDockerAWSGCP

Similar roles at this salary range

All DevOps / SRE jobs →

Crusoe

Jun 8

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit

Stuut

Jun 8

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS

Crusoe

Jun 5

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS

Snowflake

Jun 5

Senior Software Engineer - Internal Observability

Senior engineer building AI-powered observability systems and large-scale telemetry pipelines for Snowflake's multi-cloud data platform. Requires 7+ years focused on distributed systems and cloud services.

200k – 288kMenlo Park, CADevOps / SREOn-siteC++AWS

Kepler

Jun 4

Platform Engineer

Own AWS infrastructure, Pulumi IaC, deployment pipelines, and security baseline for an AI research platform serving financial institutions. First dedicated platform hire defining enterprise deployment, SOC 2 controls, and developer experience.

200k – 280kNew York, NYDevOps / SREOn-siteAWSCDK

Apply