Skip to content

Lead Site Reliability Engineer

170k – 200kUnited StatesDevOps / SRERemote5+ YOE
Summary

Leads SRE function to architect reliable infrastructure for secure collaboration platform, driving scalability, observability, automation in cloud/hybrid environments. Requires 5+ years SRE/DevOps experience, Kubernetes/Terraform/AWS expertise, and leadership in regulated sectors.

About the role

Responsibilities

  • Define the strategy, architecture, and roadmap for Mattermost's site reliability engineering function, aligning infrastructure initiatives with product and business goals.
  • Lead the design, deployment, and optimization of production-grade containerized workloads, infrastructure-as-code, and compliant cloud environments for regulated domains (e.g., FedRAMP, DoD).
  • Establish and evolve observability, monitoring, and alerting frameworks to ensure performance, reliability, and capacity planning at scale.
  • Drive incident management processes, including on-call rotations, root cause analysis, and systemic reliability improvements.
  • Partner with security and compliance teams to meet data sovereignty, security, and regulatory requirements.
  • Champion automation and operational excellence to improve efficiency, reduce risk, and scale operations.
  • Oversee cloud cost management and capacity planning to optimize infrastructure spending while meeting performance targets.
  • Build and maintain a developer platform that enables fast, secure software delivery and improves application stability in production.
  • Mentor and coach SRE team members, fostering a culture of learning, collaboration, and technical excellence.

Requirements

  • BS in Computer Science, Cybersecurity, Software Engineering, or a related technical field, or equivalent experience, with 5+ years of relevant experience in site reliability engineering, DevOps, or cloud infrastructure roles.
  • Proven expertise in container orchestration platforms, ideally Kubernetes.
  • Extensive experience with infrastructure-as-code, ideally Terraform.
  • Strong background in cloud platforms, ideally AWS.
  • Demonstrated experience designing and implementing monitoring, alerting, and performance optimization strategies.
  • Exceptional troubleshooting and incident management skills for distributed systems.
  • Proficiency in at least one scripting or programming language for automation.
  • Excellent communication skills with a track record of influencing cross-functional teams.
  • Experience leading globally distributed teams in a remote-first environment.

Preferences

  • Familiarity with observability stacks such as Grafana and Prometheus.
  • Experience designing high-availability, disaster recovery, and scaling architectures.
  • Exposure to GCP and Azure cloud environments.
  • Leadership experience in highly regulated industries such as defense, finance, or critical infrastructure.
  • Experience with U.S. federal compliance frameworks and authorization processes, including FedRAMP, DoD ATO, NIST 800-53, and related government standards.
  • Experience preparing, delivering, and maintaining software offerings through AWS Marketplace and other cloud provider marketplaces.
  • Open-source contributions in reliability, DevOps, or infrastructure tooling.
  • Certifications in cloud infrastructure, reliability, or DevOps engineering (e.g., CKA, CKAD, AWS Certified Solutions Architect).

Compensation

  • Posting Range: $170,000—$200,000 USD
Skills
KubernetesTerraformAWSGrafanaPrometheusGCPAzureInfrastructure as CodeContainer OrchestrationMonitoringAlertingIncident ManagementAutomationDistributed Systems
Similar roles at this salary range
All DevOps / SRE jobs →
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3
Grow Therapy

Senior Platform Reliability Engineer

Senior Platform Reliability Engineer establishing reliability standards, observability, and incident response practices across engineering teams. Requires 6+ years operating production systems at scale with AWS, Kubernetes, Terraform, and modern observability tooling.

182k – 250kSan Francisco, CA +2DevOps / SREHybrid6+ YOEAWSEKS