Skip to content

Cloud Software Engineer - Observability Platform

Builds and operates scalable observability platforms handling trillions of events daily, focusing on reliability, performance, and automation. Requires 5+ years experience, Golang proficiency, Kubernetes, IaC tools, cloud providers, and observability stack like OpenTelemetry/Prometheus.

141k – 230kUnited StatesDevOps / SRERemote5+ YOE

About the role

What you’ll do

  • Design, build, and operate distributed systems that power observability across ClickHouse Cloud
  • Own reliability, performance, and cost-efficiency of our telemetry pipeline and storage systems
  • Take part in the on-call rotation and help drive root-cause resolution and long-term fixes
  • Build tooling and automation to eliminate repetitive operational work
  • Help shape the roadmap for observability by identifying bottlenecks and scaling challenges
  • Collaborate with other engineering teams to improve their observability posture
  • Contribute to design discussions, architecture reviews, and mentor teammates

What we’re looking for

  • Strong bias for action and ownership — you ship, fix, and improve systems proactively
  • Great production debugging skills and a problem-solving mindset
  • Strong communication skills; comfortable working in a remote, async-friendly team
  • Experience balancing system performance, reliability, and cost
  • Ability to iterate quickly: build MVPs, collect feedback, and improve continuously

Requirements

  • 5+ years building and running production systems at scale
  • Proficiency in Golang
  • Experience with Kubernetes, Helm, ArgoCD, and Terraform or similar IaC tools
  • Comfortable working with at least one major cloud provider (AWS, GCP, Azure)
  • Experience with OpenTelemetry, Prometheus, Grafana, or similar tools
  • Experience with ClickHouse preferred

Skills

GoKubernetesHelmArgo CDTerraformAWSGCPAzureOpenTelemetryPrometheusGrafanaClickHouse

Similar roles

DevOps / SRE jobs

Cloud Engineer - Product Metrics

Design, build, and operate petabyte-scale distributed systems for product metrics using Golang, Kubernetes, and ClickHouse. Requires 5+ years building scalable systems and 2+ years with Golang.

141k – 230kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Software Engineer, Compute Infrastructure

Build and operate Kubernetes-based compute and runtime infrastructure powering AI search, assistant, and agent workloads across multi-cloud environments. Own reliability, scalability, cost-efficiency, and on-call for production platform services.

140k – 220kMountain View, CADevOps / SREHybrid5+ YOEGCPAWS

Release Engineer

As a Release Engineer, you will orchestrate software releases for autonomous vehicle technology, ensuring secure and streamlined delivery from development to production. This role involves managing simulation tools and autonomy software releases, coordinating vehicle-level testing, and scaling automation systems.

140k – 190kFoster City, CADevOps / SREHybrid3+ YOEGitAWS

Site Reliability Engineering

Site Reliability Engineer owns the lifecycle of services powering autonomous vehicles, designing fault-tolerant systems, building monitoring tools, leading incident response, and ensuring infrastructure resilience with large-scale data processing on CPUs/GPUs. Requires 5+ years SRE experience, cloud/IaC expertise, Kubernetes, and strong programming skills.

140k – 230kFoster City, CADevOps / SREHybrid5+ YOEGoAWS

Infrastructure Engineer, Foundation

Infrastructure Engineer on the Foundation team builds and maintains highly available systems and developer tooling to ensure platform stability and productivity for processing mortgage transactions. Requires deep curiosity, full ownership from design to maintenance, and ability to solve hard problems under pressure.

140k – 220kPalo Alto, CA +1DevOps / SREHybridAWSGraphQL