Skip to content

Staff Devops Engineer

Owns full DevOps and infrastructure layer for AI agent platform, managing Kubernetes on AWS, CI/CD, observability, email deliverability at scale, security, and cost optimization. Requires 6+ years experience with deep Kubernetes/AWS expertise.

San Francisco, CANew YorkDevOps / SREOnsite6+ YOE

About the role

Responsibilities

  • Own Kubernetes cluster management, scaling, resource optimization, and compute layer for AI workloads, data pipelines, and customer infrastructure on AWS.
  • Build and maintain CI/CD pipelines with automated testing, staging, rollbacks, and feature flags.
  • Implement observability: monitoring, alerting, logging, tracing with tools like Datadog, Grafana, Prometheus.
  • Manage email deliverability infrastructure: sender reputation, domain warming, DNS, IP management, SMTP.
  • Handle security and compliance: secrets management, access controls, network policies, vulnerability scanning, SOC 2.
  • Optimize costs for AI and data-intensive workloads.
  • Improve developer experience: local dev environments, fast builds, reliable deploys, documentation.

Requirements

  • 6+ years in DevOps, SRE, or infrastructure engineering owning production systems.
  • Deep Kubernetes, container orchestration, cloud-native on AWS expertise.
  • Strong CI/CD pipeline experience.
  • Infrastructure-as-code (Terraform, Pulumi) and GitOps.
  • Networking, DNS, load balancing, security fundamentals.
  • Monitoring/observability tooling experience.
  • Scripting in Python, Bash, or Go.
  • Operating databases (PostgreSQL, Redis) in production.
  • Reliability, incident response, on-call practices.

Nice-to-Haves

  • Email infrastructure at scale.
  • ML/AI workloads in production.
  • Startup infrastructure from scratch.

Skills

KubernetesAWSCI/CDTerraformGitOpsDatadogGrafanaPrometheusPostgresRedisPythonBashGo

Similar roles

DevOps / SRE jobs

Staff Software Engineer, Cloud FinOps

Staff-level engineer driving company-wide cloud cost optimization and FinOps initiatives across engineering teams. Requires 5+ years infrastructure experience and 2+ years FinOps/cloud cost management.

180k – 240kUnited StatesDevOps / SRERemote5+ YOEAWSJava

Staff Software Engineer, Core Reliability

Staff engineer on the Infra Reliability team improving system resiliency, deployment safety, and configuration management for Coinbase's production environment at massive scale.

218k – 257kUnited StatesDevOps / SRERemote7+ YOEGoAWS

Staff+ Software Engineer, Caching

Build and operate Anthropic's managed Redis caching layer and client libraries from the ground up. Drive technical direction for distributed caching infrastructure across multi-cloud environments with focus on consistency, performance, and developer experience.

320k – 485kSan Francisco, CA +2DevOps / SREHybrid10+ YOEGoC++

Senior Staff Engineer, Platform R&D

Senior individual contributor embedded in Crusoe's Managed Platform Services team to accelerate delivery through rapid AI-augmented R&D, prototyping, and cross-domain technical leadership. Requires 10+ years experience with systems languages and cloud-native infrastructure.

245k – 295kSan Francisco, CADevOps / SREOn-site10+ YOEGoC++

Software Engineer, Developer Experience

Lead the rollout of Go as a fully supported, production-grade platform at Notion. Own service patterns, tooling, and guardrails while tackling high-leverage developer experience challenges across AI workflows, CI, and reliability.

New York, NY +1DevOps / SREHybrid10+ YOEGoCI/CD