Skip to content

Sr. Staff Software Engineer, Observability

235k – 295kMountain View, CAOnsite15+ YOE
Summary

Develops observability platforms handling billions of time series and petabytes of logs across global cloud regions. Requires 15+ years in systems languages, distributed systems, cloud tech, and mentoring engineers.

About the role

The impact you'll have:

  • Build the next generation of observability platforms that support billions of active time series and process petabytes of logs daily.
  • Manage infrastructure across nearly a hundred cloud regions, enabling all Databricks engineers and customers to monitor the reliability of our product.
  • Develop advanced workflows that accelerate incident diagnosis for Bricksters, allowing engineers to quickly derive insights from logs and metrics.
  • Uplevel monitoring and reliability practices across Databricks engineering, developing opinionated tools that set common standards for managing structured logs, metrics, alerts, dashboards, and oncall rotations.
  • Mentor and uplevel engineers, fostering a culture of technical excellence within the team and broader observability community.

What we look for:

  • BS (or higher) in Computer Science, or a related field.
  • 15+ years of production-level experience in one of: Go, Python, Java, Scala, Rust, C++, or similar languages.
  • Experience in software development, in large-scale distributed systems.
  • Experience driving large projects involving multiple teams.
  • Experience with cloud technologies, e.g. AWS, Azure, GCP, Docker, or Kubernetes.
  • Familiarity with observability infrastructure, monitoring patterns, and reliability practices.
Skills
GoPythonJavaScalaRustC++KubernetesDockerAWSGCP
Similar roles at this salary range
All DevOps / SRE jobs →
Crusoe

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit
Stuut

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Crusoe

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS
Snowflake

Senior Software Engineer - Internal Observability

Senior engineer building AI-powered observability systems and large-scale telemetry pipelines for Snowflake's multi-cloud data platform. Requires 7+ years focused on distributed systems and cloud services.

200k – 288kMenlo Park, CADevOps / SREOn-siteC++AWS
Kepler

Platform Engineer

Own AWS infrastructure, Pulumi IaC, deployment pipelines, and security baseline for an AI research platform serving financial institutions. First dedicated platform hire defining enterprise deployment, SOC 2 controls, and developer experience.

200k – 280kNew York, NYDevOps / SREOn-siteAWSCDK