Sr/Staff Site Reliability Engineer

Senior/Staff SRE responsible for building and maintaining cloud-native infrastructure on GCP using Terraform, Kubernetes, Istio, and observability tools to ensure reliability and scalability of fintech systems.

Chicago, ILRedwood City, CADevOps / SREHybrid6+ YOE

Apply

About the role

Responsibilities

Write Terraform modules for deploying infrastructure resources via GitLab pipelines
Develop Helm charts for deploying services and jobs in Kubernetes clusters
Define metrics, network policies, and routing rules for Istio service mesh
Monitor and maintain GCP BigQuery and Spanner databases
Pipe metrics to Google-managed Prometheus and build Grafana dashboards and alerts
Experiment with GCP offerings, third-party vendors, and open-source tools to automate and secure operations
Leverage LLM models in developing infrastructure and tooling
Pair with engineering leads to instrument and monitor critical functionality
Add automation to existing and new systems to reduce reliance on manual processes
Participate in architecture design and capacity planning discussions
Build, maintain, and improve CI/CD pipelines

Requirements

6+ years of experience building and maintaining large-scale cloud-native infrastructure (AWS and/or GCP)
Experience with Docker, Kubernetes, and Istio or similar service mesh
Experience with SQL databases such as MySQL, Google BigQuery, and Google Spanner
Experience with streaming technologies such as Kafka and Amazon Kinesis
Experience with pub/sub technologies such as AWS SNS and Google Pub/Sub
Experience with serverless technologies such as AWS Lambda and Google Cloud Functions/Cloud Run
Experience with Terraform
Experience with observability tools such as Datadog, Prometheus, and Grafana
Strong computer science and software engineering fundamentals
Experience with SOC2 compliance processes

Nice-to-Haves

Comfortable wearing many hats in a fast-paced environment
Willingness to learn, teach, and provide/receive feedback
Desire to automate processes and tinker with new technologies

Skills

TerraformKubernetesDockerIstioGCPBigQuerySpannerPrometheusGrafanaKafka

Similar roles

DevOps / SRE jobs

Attentive

Staff Software Engineer, Cloud FinOps

Staff-level engineer driving company-wide cloud cost optimization and FinOps initiatives across engineering teams. Requires 5+ years infrastructure experience and 2+ years FinOps/cloud cost management.

180k – 240kUnited StatesDevOps / SRERemote5+ YOEAWSJava

Coinbase

Staff Software Engineer, Core Reliability

Staff engineer on the Infra Reliability team improving system resiliency, deployment safety, and configuration management for Coinbase's production environment at massive scale.

218k – 257kUnited StatesDevOps / SRERemote7+ YOEGoAWS

Anthropic

Staff+ Software Engineer, Caching

Build and operate Anthropic's managed Redis caching layer and client libraries from the ground up. Drive technical direction for distributed caching infrastructure across multi-cloud environments with focus on consistency, performance, and developer experience.

320k – 485kSan Francisco, CA +2DevOps / SREHybrid10+ YOEGoC++

Crusoe

Senior Staff Engineer, Platform R&D

Senior individual contributor embedded in Crusoe's Managed Platform Services team to accelerate delivery through rapid AI-augmented R&D, prototyping, and cross-domain technical leadership. Requires 10+ years experience with systems languages and cloud-native infrastructure.

245k – 295kSan Francisco, CADevOps / SREOn-site10+ YOEGoC++

Notion

Software Engineer, Developer Experience

Lead the rollout of Go as a fully supported, production-grade platform at Notion. Own service patterns, tooling, and guardrails while tackling high-leverage developer experience challenges across AI workflows, CI, and reliability.

New York, NY +1DevOps / SREHybrid10+ YOEGoCI/CD