Software Engineer, Infrastructure

Owns and evolves Kubernetes-based infrastructure for secure, compliant AI deployments in financial services, including observability with Datadog, IaC with Terraform, and incident response. Requires 8+ years experience with Docker, K8s, AWS, and Python at scale.

168k – 213kSan Francisco, CADevOps / SREOnsite8+ YOE

Apply

About the role

What You’ll Do

Own and evolve our Kubernetes infrastructure, including cluster management, service mesh configuration, and container security policies.
Design and implement progressive delivery pipelines with canary deployments, automated rollbacks, and deployment health validation.
Build and maintain our observability infrastructure in Datadog, including dashboards, monitors, SLOs, and distributed tracing.
Drive incident response for high-severity outages and proactively model capacity needs for low-latency AI inference.
Architect and automate secure infrastructure using Infrastructure-as-Code for VPCs, IAM policies, Kubernetes manifests, and private cloud deployments.
Maintain and improve the infrastructure controls that support our SOC 2 compliance posture.
Lead customer engagements for enterprise rollouts and mentor mid-level engineers on infrastructure best practices.

What We’re Looking For

Must-Haves:

8+ years in infrastructure engineering or DevOps at high-growth or hyperscale companies.
Experience with Docker and Kubernetes, including production cluster management, Helm, and service mesh technologies.
A proven track record of architecting and operating AWS (preferred), GCP, or Azure at an enterprise scale.
Experience with observability platforms, preferably Datadog (metrics, logs, APM, distributed tracing).
A strong background in Infrastructure-as-Code (Terraform, Helm, Kustomize) and safe deployment practices (progressive delivery, canary deployments, GitOps, automated rollbacks).
"Battle scars" from leading outages, capacity events, and large-scale incident reviews.
Strong programming skills in Python.

Bonus Points:

Familiarity with TypeScript.
Direct involvement in SOC 2 or other compliance audit preparation or remediation.
Direct experience with private-cloud or on-premises deployments for regulated customers.
Previous experience at startups scaling infrastructure from the early stages to the enterprise level.
A background in fintech or building systems for highly regulated industries.
Experience with AI/ML infrastructure and model deployment at scale.

Compensation & Benefits

$168k - $213k + equity
Comprehensive healthcare, 401k matching, commuter benefits
15 days PTO + holidays, unlimited sick days
Flexible leave options

Skills

KubernetesDockerHelmDatadogTerraformAWSPythonGitOpsService MeshKustomize

Similar roles

DevOps / SRE jobs

Upstart

Senior Software Engineer, Site Reliability

Senior SRE engineer builds tooling and automation to enhance production system reliability, monitoring microservices, Kubernetes, and ML platforms. Requires 6+ years in software/SRE/DevOps, proficiency in Python/Go, IaC, and observability tools.

167k – 231kUnited StatesDevOps / SRERemote10+ YOEAWSReact

Fivetran

Senior Site Reliability Engineer

Senior SRE monitors production infrastructure availability, capacity, and throughput at Fivetran. Collaborates with engineering on reliability practices, automation, and vulnerability remediation. Requires 5+ years SaaS experience, Kubernetes, cloud platforms, and scripting.

167k – 200kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Skydio

Senior Software Engineer, Developer Productivity Cloud Infrastructure

Senior engineer focused on developer productivity and cloud infrastructure. Designs scalable internal tools, re-architects build systems, and improves CI/CD workflows using Terraform, Go/Python/C++.

170k – 240kSan Mateo, CADevOps / SREHybrid5+ YOEGoC++

Sigma

Senior Software Engineer - Observability and Reliability

Build observability platforms and tools (metrics, logging, tracing, alerting) using Go, OpenTelemetry, and Kubernetes. Requires 5+ years experience building production software and strong CS fundamentals.

170k – 240kNew York, NYDevOps / SREOn-site5+ YOEGoGCP

Sigma

Senior Software Engineer - Observability and Reliability

Build observability tools and platforms (metrics, logging, tracing, alerting) using Go, OpenTelemetry, and Kubernetes. Requires 5+ years experience building high-quality software that other engineers use.

170k – 240kSan Francisco, CADevOps / SREOn-site5+ YOEGoGCP