Lead Site Reliability Engineer

170k – 200kUnited StatesDevOps / SRERemote5+ YOEMar 31

Summary

Leads SRE function to architect reliable infrastructure for secure collaboration platform, driving scalability, observability, automation in cloud/hybrid environments. Requires 5+ years SRE/DevOps experience, Kubernetes/Terraform/AWS expertise, and leadership in regulated sectors.

About the role

Responsibilities

Define the strategy, architecture, and roadmap for Mattermost's site reliability engineering function, aligning infrastructure initiatives with product and business goals.
Lead the design, deployment, and optimization of production-grade containerized workloads, infrastructure-as-code, and compliant cloud environments for regulated domains (e.g., FedRAMP, DoD).
Establish and evolve observability, monitoring, and alerting frameworks to ensure performance, reliability, and capacity planning at scale.
Drive incident management processes, including on-call rotations, root cause analysis, and systemic reliability improvements.
Partner with security and compliance teams to meet data sovereignty, security, and regulatory requirements.
Champion automation and operational excellence to improve efficiency, reduce risk, and scale operations.
Oversee cloud cost management and capacity planning to optimize infrastructure spending while meeting performance targets.
Build and maintain a developer platform that enables fast, secure software delivery and improves application stability in production.
Mentor and coach SRE team members, fostering a culture of learning, collaboration, and technical excellence.

Requirements

BS in Computer Science, Cybersecurity, Software Engineering, or a related technical field, or equivalent experience, with 5+ years of relevant experience in site reliability engineering, DevOps, or cloud infrastructure roles.
Proven expertise in container orchestration platforms, ideally Kubernetes.
Extensive experience with infrastructure-as-code, ideally Terraform.
Strong background in cloud platforms, ideally AWS.
Demonstrated experience designing and implementing monitoring, alerting, and performance optimization strategies.
Exceptional troubleshooting and incident management skills for distributed systems.
Proficiency in at least one scripting or programming language for automation.
Excellent communication skills with a track record of influencing cross-functional teams.
Experience leading globally distributed teams in a remote-first environment.

Preferences

Familiarity with observability stacks such as Grafana and Prometheus.
Experience designing high-availability, disaster recovery, and scaling architectures.
Exposure to GCP and Azure cloud environments.
Leadership experience in highly regulated industries such as defense, finance, or critical infrastructure.
Experience with U.S. federal compliance frameworks and authorization processes, including FedRAMP, DoD ATO, NIST 800-53, and related government standards.
Experience preparing, delivering, and maintaining software offerings through AWS Marketplace and other cloud provider marketplaces.
Open-source contributions in reliability, DevOps, or infrastructure tooling.
Certifications in cloud infrastructure, reliability, or DevOps engineering (e.g., CKA, CKAD, AWS Certified Solutions Architect).

Compensation

Posting Range: $170,000—$200,000 USD

Skills

KubernetesTerraformAWSGrafanaPrometheusGCPAzureInfrastructure as CodeContainer OrchestrationMonitoringAlertingIncident ManagementAutomationDistributed Systems

Similar roles at this salary range

All DevOps / SRE jobs →

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Cribl

Jun 17

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Grow Therapy

Jun 16

Senior Platform Reliability Engineer

Senior Platform Reliability Engineer establishing reliability standards, observability, and incident response practices across engineering teams. Requires 6+ years operating production systems at scale with AWS, Kubernetes, Terraform, and modern observability tooling.

182k – 250kSan Francisco, CA +2DevOps / SREHybrid6+ YOEAWSEKS

Apply