Cloud Infrastructure Engineer

135k – 240kSan Francisco, CACaliforniaDevOps / SREHybrid5+ YOEApr 14

Summary

Designs, deploys, and improves scalable blockchain infrastructure using Kubernetes, Terraform, and cloud tools. Drives AI enablement, builds observability with Prometheus/Grafana, manages multi-cloud networks, and leads incident response. Requires 5+ years in SRE/infrastructure with strong automation focus.

About the role

What You'll Do

Architect and operate scalable, self-healing infrastructure leveraging Kubernetes, Terraform, and cloud-native tools across multi-region deployments.
Drive AI enablement across engineering — ensuring repos, tooling, and workflows are optimized for agentic development with tools like Claude Code, Cursor, and Codex.
Build AI-powered infrastructure tooling and automation (e.g., automated K8s upgrades, IaC plan analysis, cost optimization advisors, MCP servers, n8n workflows).
Build and maintain internal developer platform (IDP) capabilities for self-service deployments, observability, and reliability.
Develop observability frameworks using Prometheus and Grafana for metrics, dashboards, and alerting.
Lead incident management with blameless post-mortems; define and enforce SLIs, SLOs, and error budgets across services.
Design and manage multi-cloud, multi-region network architecture — VPC design, IPAM, DNS (Cloudflare), cross-cloud connectivity, security groups, and edge-proxy/istio gateway configuration.
Collaborate with security teams to embed compliance into infrastructure, including IaC scanning and runtime protection.
Provide technical leadership and mentorship to elevate the team's operational capabilities.

What We're Looking For

5+ years as an Infrastructure Engineer focused on reliability (SRE, Production Engineer, Platform Engineer).
Experience driving company-wide reliability efforts, including SLO frameworks and error budget policies.
Strong proficiency with observability stacks: OpenTelemetry, Prometheus/Grafana.
Deep experience with cloud infrastructure (AWS/GCP), Kubernetes, and multi-region architectures.
Skilled with Terraform, Helm, and GitOps workflows (e.g., ArgoCD) with an automation-first mindset.
Experience leveraging agentic development tools (Claude Code, Cursor, Codex) and workflow automation (n8n) to accelerate IaC and build internal tooling is a strong plus.
Solid networking fundamentals — VPC design, DNS, IPAM, security groups, cross-cloud connectivity, and service mesh (e.g., Istio) experience is a plus.
Calm and effective incident responder with a focus on systemic improvement.
Strong cross-functional communicator across SRE, security, and product engineering.
Blockchain infrastructure, distributed systems, or high-throughput RPC experience — not required but a plus.

Benefits and Perks

Medical, Dental, & Vision
Gym Reimbursement
Home Office Build-out Budget
In-Office Group Meals
Wellbeing & Mental Health Perks
Learning & Development Stipend
Company Sponsored Conferences & Events
HSA and FSA Plans
Fertility Benefits
Competitive compensation including base salary and equity
401k
Unlimited flexible time off

Skills

KubernetesTerraformAWSGCPPrometheusGrafanaOpenTelemetryHelmArgoCDIstioGitOpsCloudflare

Similar roles at this salary range

All DevOps / SRE jobs →

Northwood Space

Jun 19

Senior Network Engineer

Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.

133k – 215kLos Angeles, CA +1DevOps / SREOn-site5+ YOEAWSVLAN

Jun 18

Site Reliability Engineer II

Operate and scale a cloud-native CTV advertising platform on AWS and Kubernetes. Focus on reliability, GitOps workflows, infrastructure automation, observability, and incident response.

114k – 235kSan Francisco, CADevOps / SRERemote4+ YOEAWSEKS

Forterra

Jun 18

Senior Software Engineer-Internal Tools

Senior Software Engineer on the DevOps and Tooling team building internal tools. Requires 3-5+ years experience, Rust or strong systems background, TypeScript/React, Linux, Docker, and CI/CD.

125k – 140kArlington, VA +1DevOps / SREOn-site5+ YOEAWSRust

Beacon AI

Jun 17

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud infrastructure and LLM platform services including RAG pipelines, vector search, model endpoints, and data ingestion for an aviation AI company.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSGlue

MongoDB

Jun 17

Site Reliability Engineer

Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.

127k – 249kBoston, MA +6DevOps / SREHybrid6+ YOEGoAWS

Apply