Skip to content

Software Engineer, Site Reliability

160k – 300kNew York, NYSan Francisco, CADevOps / SREOnsite5+ YOE
Summary

Site Reliability Engineer who owns production services end-to-end, writes production code, builds observability and internal tooling, and embeds with product teams to improve reliability and performance. Requires 5+ years experience and strong systems programming skills.

About the role

Responsibilities

  • Own critical production services end-to-end, from design and code review through deployment, operation, and incident response
  • Profile, benchmark, and rewrite hot paths to eliminate bottlenecks as Hebbia scales
  • Lead incident response and drive post-mortem culture, translating findings into code changes and architectural improvements rather than runbooks
  • Design and build observability frameworks from scratch, writing custom instrumentation, alerting logic, and debugging tooling
  • Define and enforce SLOs across platform services and build the feedback loops that keep engineering teams accountable
  • Own capacity planning and cost efficiency: model growth, right-size infrastructure, and write automation that prevents over-provisioning
  • Build robust, well-tested internal platforms and deployment tooling held to the same engineering standards as customer-facing code
  • Own and continuously improve CI/CD systems
  • Embed with product engineering teams as a peer software engineer, contributing directly to production codebases and co-designing systems for reliability
  • Partner on infrastructure security through threat modeling, hardening, and automated compliance tooling

Requirements

  • 5+ years software development with a track record of writing, shipping, and maintaining production services
  • Production-grade proficiency in at least one systems or backend language: Go, Python, C++, or Rust
  • Proven experience as a Production Engineer, SRE, or software engineer with a deep infrastructure focus
  • Deep understanding of distributed systems
  • Container orchestration expertise and hands-on experience debugging complex distributed failures in production
  • Working knowledge of OS-level concepts
  • Cloud platform fluency (AWS preferred)
  • Experience in building and maintaining observability stacks
  • Strong CI/CD pipeline expertise and a track record of improving developer velocity without sacrificing safety

Nice-to-Haves

  • Background at a company with a Production Engineering or software-focused SRE culture
  • Experience building platforms for AI/ML workloads or high-throughput document processing pipelines
Skills
GoPythonC++RustAWSKubernetesDockerCI/CDObservabilityDistributed Systems
Similar roles at this salary range
All DevOps / SRE jobs →
Pindrop

Senior Manager, DevOps

Lead DevOps strategy and team to improve engineering velocity, platform reliability, and operational efficiency across multi-cloud (AWS/GCP) environments. Drive IaC, Kubernetes delivery, observability, AI-powered tooling adoption, and cross-functional collaboration.

155k – 185kUnited StatesDevOps / SRERemote6+ YOEGoAWS
Render

Software Engineer, Dev Velocity

Build internal developer platform, tooling, and automation to accelerate engineering velocity. Focus on CI/CD pipelines, test infrastructure, build systems, and metrics to help engineers ship faster and more reliably.

170k – 290kUnited StatesDevOps / SRERemote5+ YOEGoCI/CD
Okta

Senior Software Engineer, Observability

Senior engineer on the Auth0 Platform Observability team responsible for designing, building, and maintaining scalable observability infrastructure (metrics, logs, traces) using Datadog, Terraform, and OpenTelemetry.

147k – 202kBellevue, WA +3DevOps / SREHybrid5+ YOEAWSAzure
NMI

Senior MySQL Database Administrator

Senior DBA responsible for designing, maintaining, and improving MySQL database infrastructure in a high-availability SRE environment. Requires 5+ years MySQL/MariaDB experience and on-call participation.

130k – 160kUnited StatesDevOps / SRERemote5+ YOEMHAMySQL
Beacon AI

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud and LLM infrastructure powering retrieval-augmented generation, vector search, and ML pipelines for aviation AI systems. Requires strong AWS depth, Python data pipelines, and production LLM experience.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOES3AWS