# Senior Site Reliability Engineer
**Company:** [Okta](https://hotfix.jobs/companies/okta)
**Location:** San Francisco, CA
**Experience:** 5+ years
**Skills:** Kubernetes, Terraform, Go, Python, AWS, GCP, Postgres, Redis, Opensearch, Datadog, Splunk, Argo CD, Helm, GitOps
**Posted:** 2026-06-29
> Senior Site Reliability Engineer building and operating highly reliable, scalable Kubernetes-based cloud services in Okta's Emerging Products Group. Lead incident response, define SLOs, develop automation in Go/Python/Terraform, improve observability, and mentor on reliability best practices.
## Job Description
## What You'll Be Doing

### Reliability & Operations
- Design, build, and operate large-scale cloud infrastructure and production services.
- Participate in an on-call rotation supporting highly available customer-facing systems.
- Lead incident response efforts and drive post-incident reviews focused on systemic improvements.
- Define, measure, and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
- Partner with engineering teams to improve service availability, scalability, performance, and resilience.
- Continuously improve observability through metrics, logging, tracing, dashboards, and alerting.

### Engineering & Automation
- Develop software, automation, and infrastructure using Go, Python, Terraform, and related technologies.
- Eliminate operational toil through automation, tooling, and platform engineering.
- Improve deployment safety and operational workflows through CI/CD and GitOps practices.
- Collaborate on modernizing existing workloads and aligning them with evolving platform capabilities.
- Build self-service platforms, operational guardrails, and automation that improve developer velocity while maintaining reliability and security.

### Technical Leadership
- Contribute to and drive reliability initiatives within the product group.
- Guide engineers in adopting operational best practices and reliability engineering principles.
- Mentor engineers through technical collaboration, design reviews, incident analysis, and knowledge sharing.
- Support architecture and operational decisions through data-driven recommendations and engineering expertise.
- Execute projects from conception through production rollout and long-term operational ownership.

### Innovation
- Explore and apply AI-assisted engineering techniques to improve operational efficiency, incident response, troubleshooting, and automation.
- Identify opportunities to leverage emerging technologies to reduce toil and improve engineering productivity.

## Our Tech Stack
- **Infrastructure/Orchestration**: Kubernetes (EKS/GKE), Terraform, Helm, Git, ArgoCD, GitOps
- **Programming**: Golang, Python
- **Observability**: Datadog, Splunk
- **Data Stores**: PostgreSQL, Redis, OpenSearch

## What We Are Looking For

### Technical Excellence
- Strong experience operating large-scale production services in AWS and/or GCP.
- Deep expertise with Kubernetes in production environments.
- Experience troubleshooting Kubernetes networking, storage, scheduling, scaling, and workload lifecycle issues.
- Extensive experience with Infrastructure as Code technologies such as Terraform and Helm.
- Strong software engineering skills in Golang and/or Python.
- Experience building automation and internal engineering platforms.
- Experience operating and troubleshooting distributed data platforms such as PostgreSQL, Redis, OpenSearch, MySQL, Cassandra, or similar technologies.
- Strong understanding of cloud networking fundamentals including DNS, load balancing, ingress, TLS, service networking, and traffic management.
- Experience with observability platforms, monitoring strategies, and production telemetry.
- Experience with or strong interest in AI-assisted engineering and operational automation.

### Operational Excellence
- Strong expertise operating customer-facing production systems.
- Experience leading incident response and driving operational improvements.
- Deep understanding of reliability engineering concepts including SLIs, SLOs, error budgets, and capacity planning.
- Strong understanding of CI/CD pipelines, deployment strategies, and automation-first operational practices.
- Proven ability to balance reliability, scalability, security, and engineering velocity.

### Security & Compliance
- Understanding of cloud security fundamentals, IAM, secrets management, and secure infrastructure design.
- Experience implementing operational controls and best practices in regulated or security-sensitive environments is a plus.

### Leadership
- Demonstrated experience contributing to complex engineering initiatives.
- Strong collaboration and communication skills.
- Experience working effectively within globally distributed engineering organizations spanning multiple timezones and cultures.
- Experience mentoring engineers and elevating technical capabilities within an organization.
- Ability to collaborate on technical direction through expertise, partnership, and execution.

### Preferred Qualifications
- Experience operating SaaS platforms serving large-scale customer workloads.
- Experience working within Kubernetes-based microservices environments.
- Experience supporting globally distributed production environments.
- Experience with GitOps and ArgoCD.
- Experience implementing AI-assisted operational tooling or automation workflows.
**Apply:** https://hotfix.jobs/jobs/senior-site-reliability-engineer-at-okta-20426341-7e13-4e4e-af68-a6960013ba6b
**Canonical:** https://hotfix.jobs/jobs/senior-site-reliability-engineer-at-okta-20426341-7e13-4e4e-af68-a6960013ba6b