Skip to content

Product Reliability Engineer

Owns end-to-end system reliability, incident response, observability, and proactive stability improvements in a serverless AWS environment. Requires 2+ years software engineering with production-facing experience, strong debugging, and hands-on AWS/Go/TypeScript skills.

100k – 160kNew York, NYDevOps / SREOnsite2+ YOE

About the role

Reliability & Incident Response

  • Respond quickly to automated alerts and customer-reported issues
  • Triage, diagnose, and resolve production incidents with a bias toward permanent fixes over workarounds
  • Build and maintain incident response playbooks and postmortem processes
  • Coordinate cross-functionally with customer success managers and key account stakeholders to maintain customer trust in the event of an incident

Observability & Prevention

  • Design and instrument telemetry, logging, and alerting across our serverless AWS stack
  • Build dashboards and health metrics that surface issues before customers feel them
  • Identify recurring failure patterns and drive systemic fixes into the codebase
  • Reduce operational toil through automation

Product Stability

  • Contribute directly to the codebase—improving resilience, reducing tech debt, and creating automation to ensure bugs are resolved quickly and with little human intervention
  • Partner with engineers on new feature launches to assess reliability risks before they ship
  • Make data-driven recommendations on where to invest in stability

What We're Looking For

  • 2+ years of software engineering experience, with meaningful time spent in reliability, platform, or production-facing roles
  • Strong debugging instincts and comfort tracing failures across distributed systems using logs, traces, and metrics
  • Hands-on experience with AWS (Lambda, SQS, RDS, CloudWatch or equivalent)
  • Comfortable reading and writing Go, TypeScript, or similar backend languages
  • Experience building or improving observability infrastructure (alerting, dashboards, telemetry)
  • High ownership mentality: you close the loop, you write the postmortem, you ship the fix
  • Strong plus: experience in legaltech, fintech, healthtech, or other high-sensitivity, always-on environments

Skills

AWSAWS LambdaSQSRdsCloudWatchGoTypeScriptObservabilityTelemetryLoggingAlertingDashboardsDistributed SystemsDebuggingIncident Response

Similar roles

DevOps / SRE jobs

Intermediate AI-Enabled DevOps Engineer

Builds and operates cloud infrastructure and CI/CD pipelines for AI-enabled workloads, focusing on automation, reliability, and containerized deployments in Kubernetes. Requires 2-4+ years DevOps experience, IaC, scripting, and cloud providers like AWS/Azure/GCP.

106k – 118kUnited StatesDevOps / SRERemote2+ YOEAWSGCP

Infrastructure Administrator I, Cloud AI

Supports hybrid multi-cloud environments (AWS, Azure, GCP) with administration, provisioning, monitoring, and IaC using Terraform/Bicep. Ideal for early-career professionals with 0-1+ years experience, foundational cloud knowledge, and basic scripting.

91k – 105kSunnyvale, CADevOps / SREOn-siteEntry levelAWSGCP

Cloud Operations Engineer

As a Cloud Operations Engineer, you will ensure the operational success of MongoDB Atlas customers by monitoring, detecting, and resolving incidents. This role involves coordinating with a global team, automating tasks, and contributing to documentation.

90k – 176kUnited StatesDevOps / SRERemote2+ YOEGoAWS

Commissioning Engineer II

Hands-on commissioning engineer supporting test execution, documentation, and coordination for data center MEP, BMS, and power/cooling systems. Requires 2+ years experience, engineering degree, and 75% travel across project sites.

90k – 106kAbilene, TXDevOps / SREOn-site2+ YOEBmsEpms

Software Engineer, DevOps / Infrastructure

DevOps Engineer builds and maintains CI/CD pipelines, ML model infrastructure, and automated testing for AI image/video software products. Requires 2+ years experience, C++ build tools expertise, and cloud platforms like AWS/Azure.

110k – 160kDallas, TXDevOps / SREOn-site2+ YOEQtGo