Skip to content

Site Reliability Engineer II

114k – 235kSan Francisco, CADevOps / SRERemote4+ YOE
Summary

Operate and scale a cloud-native CTV advertising platform on AWS and Kubernetes. Focus on reliability, GitOps workflows, infrastructure automation, observability, and incident response.

About the role

What you’ll do

  • Ensuring the reliability, availability, and performance of production infrastructure and platform services
  • Operating and scaling Kubernetes platforms, including governance and support for multi-tenant workloads
  • Managing GitOps-based deployment workflows using ArgoCD and Helm
  • Supporting infrastructure provisioning and change management through Terraform/Terragrunt
  • Building and supporting CI/CD automation and deployment workflows using GitHub Actions
  • Participating in incident response, root cause analysis, and post-incident improvement initiatives
  • Reducing operational toil through scripting, tooling, and process automation
  • Advancing observability practices across logs, metrics, traces, dashboards, and alerting
  • Supporting secure secrets integration, IAM-aware operations, and platform guardrails
  • Partnering closely with application, security, and platform teams to improve reliability and delivery outcomes

What we're looking for

  • 4+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or Cloud Infrastructure
  • Strong hands-on experience operating AWS in production environments
  • Good expertise in Kubernetes, including cluster operations, troubleshooting, workload reliability, and platform administration
  • Experience with Kubernetes multi-tenancy, including namespaces, RBAC, quotas, policies, and tenant isolation patterns
  • Experience implementing and operating ArgoCD within a GitOps delivery model
  • Strong hands-on experience with Helm
  • Experience with Terraform/Terragrunt for infrastructure provisioning and environment management
  • Solid scripting and automation skills using Bash and/or Python
  • Experience building, maintaining, or supporting CI/CD pipelines, ideally using GitHub Actions
  • Strong troubleshooting skills across Linux, containers, IAM, networking, and distributed systems
  • Experience with monitoring, alerting, and observability in production environments
  • Demonstrated ownership mindset with experience handling incidents and resolving production issues
  • Strong collaboration and communication skills, with the ability to work effectively across engineering, security, and platform teams
  • Bachelor’s degree in computer science, engineering, a related field or equivalent experience
  • Demonstrated ability to use AI to improve speed and quality in your day-to-day workflow for relevant outputs
  • Strong track record of critical evaluation and verification of AI-assisted work (e.g., testing, source-checking, data validation, peer review)
  • High integrity and ownership: you protect sensitive data, avoid over-reliance on AI, and remain accountable for final decisions and deliverables
Skills
AWSKubernetesEKSArgoCDHelmTerraformTerragruntGitHub ActionsBashPythonLinuxIAMCI/CDObservability
Similar roles at this salary range
All DevOps / SRE jobs →
Northwood Space

Senior Network Engineer

Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.

133k – 215kLos Angeles, CA +1DevOps / SREOn-site5+ YOEAWSVLAN
Forterra

Senior Software Engineer-Internal Tools

Senior Software Engineer on the DevOps and Tooling team building internal tools. Requires 3-5+ years experience, Rust or strong systems background, TypeScript/React, Linux, Docker, and CI/CD.

125k – 140kArlington, VA +1DevOps / SREOn-site5+ YOEAWSRust
Beacon AI

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud infrastructure and LLM platform services including RAG pipelines, vector search, model endpoints, and data ingestion for an aviation AI company.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSGlue
MongoDB

Site Reliability Engineer

Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.

127k – 249kBoston, MA +6DevOps / SREHybrid6+ YOEGoAWS
CommandLink

Senior Network Engineer

Senior Network Engineer building and supporting carrier interconnects, private circuits, NNIs, and cloud connectivity for a managed network services provider. Requires hands-on service provider experience with Layer 2/3 protocols and direct carrier coordination.

120k – 160kUnited StatesDevOps / SRERemote5+ YOEBGPVRF