Skip to content

Infrastructure Engineer (Observability)

180k – 200kNew York, NYSan Francisco, CASeattle, WADevOps / SRERemote5+ YOE
Summary

Builds and operates scalable observability platforms for metrics, logs, traces across GPU, HPC infrastructure. Designs telemetry pipelines, alerting, and multi-tenant systems using Prometheus, Grafana, Kafka; requires 5+ years SRE/infra experience.

About the role

What You’ll Do

Observability Platform & Productization

  • Own and evolve a scalable observability platform spanning metrics, logs, traces, and events
  • Drive the productization of observability capabilities for both internal teams and external customers
  • Design multi-tenant observability systems with scoped access, RBAC, and customer-facing visibility
  • Continuously improve observability systems to keep pace with rapid infrastructure buildouts

Telemetry & Data Pipelines

  • Design and operate telemetry pipelines ingesting data from GPUs, CPUs, networking (Ethernet & InfiniBand), containers, APIs, and BMC/Redfish
  • Build systems to correlate signals across infrastructure layers to enable faster debugging and root cause analysis
  • Implement streaming and real-time data pipelines using tools such as Kafka, OTEL, Promtail, or similar

Alerting, Reliability & Insights

  • Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load
  • Create dashboards and alerting for InfraOps, Engineering, and Customer Success teams
  • Build automated insights and enable proactive detection, forecasting, and system health visibility at scale

Systems & Infrastructure Engineering

  • Contribute to broader infrastructure engineering projects beyond observability
  • Partner with infrastructure and platform teams to embed observability into core systems and workflows
  • Support large-scale, distributed systems across compute, networking, and storage environments

Cross-Functional Collaboration

  • Work closely with customer-facing teams to deliver external observability experiences
  • Collaborate with engineering, operations, and support teams to improve system transparency and reliability
  • Help define best practices for observability across the organization

What You’ll Need

Required Qualifications

  • 5+ years of experience in infrastructure engineering, SRE, or observability-focused roles
  • Strong experience with monitoring systems such as Prometheus, Grafana, ELK, or VictoriaMetrics
  • Experience building and operating observability platforms at scale
  • Proficiency in Python, Go, or bash for automation and data integration
  • Familiarity with containerized environments and Kubernetes observability
  • Experience with streaming telemetry pipelines (Kafka, OTEL, Promtail, or equivalent)
  • Experience with multi-tenant monitoring architectures
  • Strong written and verbal communication skills

Ideal Experience

  • Experience with GPU observability, particularly NVIDIA DCGM
  • Experience monitoring large-scale GPU or HPC clusters
  • Familiarity with InfiniBand fabric observability
  • Experience building customer-facing or productized infrastructure systems
  • Experience with correlation engines, RCA workflows, or predictive alerting systems
  • Broad exposure to infrastructure domains including networking, storage, and provisioning
Skills
PrometheusGrafanaKubernetesKafkaOTELPromtailPythonGoELKVictoriaMetricsNVIDIA DCGMInfiniBand
Similar roles at this salary range
All DevOps / SRE jobs →
Plaid

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3