Skip to content

Senior Site Reliability Engineer

Senior Site Reliability Engineer building and operating highly reliable, scalable Kubernetes-based cloud services in Okta's Emerging Products Group. Lead incident response, define SLOs, develop automation in Go/Python/Terraform, improve observability, and mentor on reliability best practices.

San Francisco, CADevOps / SREHybrid5+ YOE

About the role

What You'll Be Doing

Reliability & Operations

  • Design, build, and operate large-scale cloud infrastructure and production services.
  • Participate in an on-call rotation supporting highly available customer-facing systems.
  • Lead incident response efforts and drive post-incident reviews focused on systemic improvements.
  • Define, measure, and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
  • Partner with engineering teams to improve service availability, scalability, performance, and resilience.
  • Continuously improve observability through metrics, logging, tracing, dashboards, and alerting.

Engineering & Automation

  • Develop software, automation, and infrastructure using Go, Python, Terraform, and related technologies.
  • Eliminate operational toil through automation, tooling, and platform engineering.
  • Improve deployment safety and operational workflows through CI/CD and GitOps practices.
  • Collaborate on modernizing existing workloads and aligning them with evolving platform capabilities.
  • Build self-service platforms, operational guardrails, and automation that improve developer velocity while maintaining reliability and security.

Technical Leadership

  • Contribute to and drive reliability initiatives within the product group.
  • Guide engineers in adopting operational best practices and reliability engineering principles.
  • Mentor engineers through technical collaboration, design reviews, incident analysis, and knowledge sharing.
  • Support architecture and operational decisions through data-driven recommendations and engineering expertise.
  • Execute projects from conception through production rollout and long-term operational ownership.

Innovation

  • Explore and apply AI-assisted engineering techniques to improve operational efficiency, incident response, troubleshooting, and automation.
  • Identify opportunities to leverage emerging technologies to reduce toil and improve engineering productivity.

Our Tech Stack

  • Infrastructure/Orchestration: Kubernetes (EKS/GKE), Terraform, Helm, Git, ArgoCD, GitOps
  • Programming: Golang, Python
  • Observability: Datadog, Splunk
  • Data Stores: PostgreSQL, Redis, OpenSearch

What We Are Looking For

Technical Excellence

  • Strong experience operating large-scale production services in AWS and/or GCP.
  • Deep expertise with Kubernetes in production environments.
  • Experience troubleshooting Kubernetes networking, storage, scheduling, scaling, and workload lifecycle issues.
  • Extensive experience with Infrastructure as Code technologies such as Terraform and Helm.
  • Strong software engineering skills in Golang and/or Python.
  • Experience building automation and internal engineering platforms.
  • Experience operating and troubleshooting distributed data platforms such as PostgreSQL, Redis, OpenSearch, MySQL, Cassandra, or similar technologies.
  • Strong understanding of cloud networking fundamentals including DNS, load balancing, ingress, TLS, service networking, and traffic management.
  • Experience with observability platforms, monitoring strategies, and production telemetry.
  • Experience with or strong interest in AI-assisted engineering and operational automation.

Operational Excellence

  • Strong expertise operating customer-facing production systems.
  • Experience leading incident response and driving operational improvements.
  • Deep understanding of reliability engineering concepts including SLIs, SLOs, error budgets, and capacity planning.
  • Strong understanding of CI/CD pipelines, deployment strategies, and automation-first operational practices.
  • Proven ability to balance reliability, scalability, security, and engineering velocity.

Security & Compliance

  • Understanding of cloud security fundamentals, IAM, secrets management, and secure infrastructure design.
  • Experience implementing operational controls and best practices in regulated or security-sensitive environments is a plus.

Leadership

  • Demonstrated experience contributing to complex engineering initiatives.
  • Strong collaboration and communication skills.
  • Experience working effectively within globally distributed engineering organizations spanning multiple timezones and cultures.
  • Experience mentoring engineers and elevating technical capabilities within an organization.
  • Ability to collaborate on technical direction through expertise, partnership, and execution.

Preferred Qualifications

  • Experience operating SaaS platforms serving large-scale customer workloads.
  • Experience working within Kubernetes-based microservices environments.
  • Experience supporting globally distributed production environments.
  • Experience with GitOps and ArgoCD.
  • Experience implementing AI-assisted operational tooling or automation workflows.

Skills

KubernetesTerraformGoPythonAWSGCPPostgresRedisOpensearchDatadogSplunkArgo CDHelmGitOps

Similar roles

DevOps / SRE jobs

Senior Software Engineer, Infrastructure

Senior engineer building and standardizing AWS/GCP cloud infrastructure, networking, and self-service tooling for Coinbase's multi-cloud platform.

186k – 219kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Senior Software Engineer - Snowpark Container Service

Senior engineer to design, build, and lead development of Snowpark Container Services, a Kubernetes-based container compute platform. Requires 7+ years building large-scale distributed systems and strong coding skills in Java, C++, or Go.

200k – 288kBellevue, WADevOps / SREHybrid7+ YOEGoC++

Senior DevOps Engineer

Senior DevOps Engineer building and operating Kubernetes-based ephemeral environments and cloud infrastructure on AWS to improve developer productivity and platform reliability.

153k – 231kUnited StatesDevOps / SRERemote4+ YOEGoAWS

Senior Site Reliability Engineer - Government Cloud

Build and operate AWS GovCloud infrastructure for federal customers, owning IaC, container pipelines, compliance documentation, and operational tooling. Requires 5+ years AWS experience and FedRAMP familiarity.

210k – 220kUnited StatesDevOps / SRERemote5+ YOEAWSCdk

Senior Manager, DevOps

Lead DevOps strategy and team to improve engineering velocity, platform reliability, and operational efficiency across multi-cloud (AWS/GCP) environments. Drive IaC, Kubernetes delivery, observability, AI-powered tooling adoption, and cross-functional collaboration.

155k – 185kUnited StatesDevOps / SRERemote6+ YOEGoAWS