Director of SRE (FTE)

Leads SRE strategy, reliability, observability, incident management, and QA for cloud-native healthcare EMR platform. Oversees Azure AKS, Kubernetes, CI/CD, and vendor teams; requires 12+ years SRE experience with 5+ years leadership.

175k – 200kUnited StatesDevOps / SRERemote12+ YOE

Apply

About the role

Key Responsibilities

Own and execute the SRE strategy and multi-quarter roadmap across reliability, observability, incident management, QA maturity, and release engineering.
Define, measure, and continuously improve SLAs, SLOs, error budgets, uptime, performance, and operational health metrics across all products and services.
Lead production reliability for the full platform, including monitoring, alerting, on-call operations, incident response, root cause analysis, and MTTR reduction.
Establish release readiness standards, deployment safety controls, and quality gates to ensure stable and predictable product releases.
Manage external SRE vendors and partners, including service delivery, SLA governance, escalations, performance reviews, and compliance expectations.
Lead QA engineering strategy with a focus on automation, regression prevention, test coverage, and reducing escaped defects in production.
Partner with Security and Engineering leaders to ensure cloud infrastructure, CI/CD pipelines, and operational tooling meet HIPAA, SOC2, and internal security standards.
Oversee core platform operations including Azure AKS environments, Kubernetes, GitOps workflows, CI/CD pipelines, GitHub Actions, secrets management, access controls, and audit readiness.
Drive observability maturity using tools such as Grafana, Prometheus, logging platforms, tracing tools, and automated alerting frameworks.
Collaborate with Product, Platform, and Engineering teams to embed reliability and quality best practices throughout the software development lifecycle.
Build, mentor, and scale high-performing SRE and QA teams while fostering a culture of ownership, accountability, learning, and continuous improvement.
Drive adoption of AI-enabled automation and intelligent tooling to reduce manual toil, improve productivity, and strengthen operational excellence.

Technical Experience

Strong hands-on experience with cloud infrastructure, preferably Microsoft Azure, including AKS, networking, storage, IAM, and security services.
Deep expertise in Kubernetes, containerized workloads, and production-scale distributed systems.
Experience building and managing CI/CD pipelines using GitHub Actions, ArgoCD, Terraform, or similar DevOps tooling.
Strong background in monitoring, logging, tracing, and observability platforms such as Grafana, Prometheus, Datadog, Splunk, or equivalent.
Experience with scripting and automation using Python, Bash, PowerShell, or similar languages.
Strong understanding of release engineering, automated testing frameworks, QA tooling, and shift-left quality practices.
Experience supporting SaaS applications with uptime, scalability, and security requirements in regulated industries such as healthcare.
Knowledge of HIPAA, SOC2, vulnerability management, access controls, and infrastructure security best practices.
Familiarity with databases, APIs, networking, and troubleshooting across modern web application stacks.
Exposure to AI-powered DevOps / AIOps tooling for incident management, automation, and engineering productivity is a plus.

Requirements

12+ years of SRE, infrastructure, or platform engineering experience, with 5+ years of engineering leadership roles.
Proven track record owning site reliability for complex, multi-tenant SaaS platforms with demanding availability requirements.
Demonstrated experience defining SLA and SLO frameworks, error budgets, and incident management processes at scale.
Experience managing vendor relationships for managed infrastructure or SRE services, including SLA governance and performance management.
Track record leading QA or quality engineering functions, including test automation maturity and release gate ownership.
Strong communication and cross-functional influence skills — able to represent reliability to both technical and non-technical audiences.

Preferred Qualifications

Experience in healthcare technology, HIPAA-compliant environments, or other highly regulated SaaS industries.
Familiarity with FHIR-native or EMR/EHR platform architectures and their specific reliability requirements.
Experience implementing AI-assisted SRE automation including runbook generation, anomaly detection, or incident triage tooling.
Background working with Playwright or equivalent test automation frameworks in a QA leadership capacity.
Experience building internal SRE capability alongside a managed services provider.

Compensation

Base salary range: $175k-$200k, with variable component and stock options.

Skills

KubernetesMicrosoft AzureGrafanaPrometheusGitHub ActionsArgo CDTerraformDatadogSplunkPython

Similar roles

DevOps / SRE jobs

Ai2

Director of AI Infrastructure

Leads AI infrastructure including on-prem GPU clusters, hybrid cloud orchestration, storage, and resource allocation to support frontier AI research. Requires 12+ years experience with HPC leadership, GPU systems, Kubernetes, and distributed storage.

176k – 265kSeattle, WADevOps / SREOn-site12+ YOEGoAWS

Okta

Platform Staff Engineer- Universal Directory

Staff engineer on Universal Directory Platform team builds scalable distributed systems for identity management using Java and cloud infrastructure. Requires 7+ years Java experience, expertise in Spring Boot/Hibernate, and 3+ years deploying services on AWS/GCP.

194k – 243kSan Francisco, CADevOps / SREHybrid7+ YOEGoAWS

Stellar

Director of Site Reliability Engineering

Lead and develop a distributed SRE team, setting vision and operating model for reliability, infrastructure, and service ownership across engineering. Own core infrastructure services and drive operational maturity, incident response, and developer productivity.

210k – 310kSan Francisco, CADevOps / SREHybrid10+ YOESREAWS

Stellar

Director of Site Reliability Engineering

Lead and develop a distributed SRE team, setting vision and operating model for reliability practices. Own core infrastructure services (Kubernetes, CI/CD, observability) and drive service ownership frameworks across engineering teams.

210k – 310kNew York, NYDevOps / SREHybrid10+ YOESREAWS

Forge

Director of Platform & Reliability Engineering

The Director of Platform & Reliability Engineering will lead an engineering organization responsible for secure, scalable, and highly reliable products. This role involves setting the vision for internal platforms, cloud infrastructure, developer enablement, and production operations.

235k – 245kSan Francisco, CADevOps / SREHybrid8+ YOECI/CDKubernetes