Infrastructure Engineer (Observability)
180k – 200kNew York, NYSan Francisco, CASeattle, WADevOps / SRERemote5+ YOE
Summary
Builds and operates scalable observability platforms for metrics, logs, traces across GPU, HPC infrastructure. Designs telemetry pipelines, alerting, and multi-tenant systems using Prometheus, Grafana, Kafka; requires 5+ years SRE/infra experience.
About the role
What You’ll Do
Observability Platform & Productization
- Own and evolve a scalable observability platform spanning metrics, logs, traces, and events
- Drive the productization of observability capabilities for both internal teams and external customers
- Design multi-tenant observability systems with scoped access, RBAC, and customer-facing visibility
- Continuously improve observability systems to keep pace with rapid infrastructure buildouts
Telemetry & Data Pipelines
- Design and operate telemetry pipelines ingesting data from GPUs, CPUs, networking (Ethernet & InfiniBand), containers, APIs, and BMC/Redfish
- Build systems to correlate signals across infrastructure layers to enable faster debugging and root cause analysis
- Implement streaming and real-time data pipelines using tools such as Kafka, OTEL, Promtail, or similar
Alerting, Reliability & Insights
- Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load
- Create dashboards and alerting for InfraOps, Engineering, and Customer Success teams
- Build automated insights and enable proactive detection, forecasting, and system health visibility at scale
Systems & Infrastructure Engineering
- Contribute to broader infrastructure engineering projects beyond observability
- Partner with infrastructure and platform teams to embed observability into core systems and workflows
- Support large-scale, distributed systems across compute, networking, and storage environments
Cross-Functional Collaboration
- Work closely with customer-facing teams to deliver external observability experiences
- Collaborate with engineering, operations, and support teams to improve system transparency and reliability
- Help define best practices for observability across the organization
What You’ll Need
Required Qualifications
- 5+ years of experience in infrastructure engineering, SRE, or observability-focused roles
- Strong experience with monitoring systems such as Prometheus, Grafana, ELK, or VictoriaMetrics
- Experience building and operating observability platforms at scale
- Proficiency in Python, Go, or bash for automation and data integration
- Familiarity with containerized environments and Kubernetes observability
- Experience with streaming telemetry pipelines (Kafka, OTEL, Promtail, or equivalent)
- Experience with multi-tenant monitoring architectures
- Strong written and verbal communication skills
Ideal Experience
- Experience with GPU observability, particularly NVIDIA DCGM
- Experience monitoring large-scale GPU or HPC clusters
- Familiarity with InfiniBand fabric observability
- Experience building customer-facing or productized infrastructure systems
- Experience with correlation engines, RCA workflows, or predictive alerting systems
- Broad exposure to infrastructure domains including networking, storage, and provisioning
Skills
PrometheusGrafanaKubernetesKafkaOTELPromtailPythonGoELKVictoriaMetricsNVIDIA DCGMInfiniBand
Similar roles at this salary range
All DevOps / SRE jobs →Staff Site Reliability Engineer, Release Engineering
Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.
208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.
194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE