Sr. Staff Software Engineer, Observability
Develops observability platforms handling billions of time series and petabytes of logs across global cloud regions. Requires 15+ years in systems languages, distributed systems, cloud tech, and mentoring engineers.
The impact you'll have:
- Build the next generation of observability platforms that support billions of active time series and process petabytes of logs daily.
- Manage infrastructure across nearly a hundred cloud regions, enabling all Databricks engineers and customers to monitor the reliability of our product.
- Develop advanced workflows that accelerate incident diagnosis for Bricksters, allowing engineers to quickly derive insights from logs and metrics.
- Uplevel monitoring and reliability practices across Databricks engineering, developing opinionated tools that set common standards for managing structured logs, metrics, alerts, dashboards, and oncall rotations.
- Mentor and uplevel engineers, fostering a culture of technical excellence within the team and broader observability community.
What we look for:
- BS (or higher) in Computer Science, or a related field.
- 15+ years of production-level experience in one of: Go, Python, Java, Scala, Rust, C++, or similar languages.
- Experience in software development, in large-scale distributed systems.
- Experience driving large projects involving multiple teams.
- Experience with cloud technologies, e.g. AWS, Azure, GCP, Docker, or Kubernetes.
- Familiarity with observability infrastructure, monitoring patterns, and reliability practices.
Lead Site Reliability Engineer
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
Staff Network Engineer, Operations
Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.
Senior Software Engineer - Internal Observability
Senior engineer building AI-powered observability systems and large-scale telemetry pipelines for Snowflake's multi-cloud data platform. Requires 7+ years focused on distributed systems and cloud services.
Platform Engineer
Own AWS infrastructure, Pulumi IaC, deployment pipelines, and security baseline for an AI research platform serving financial institutions. First dedicated platform hire defining enterprise deployment, SOC 2 controls, and developer experience.