Skip to content

Software Engineer, Infrastructure

Owns and evolves Kubernetes-based infrastructure for secure, compliant AI deployments in financial services, including observability with Datadog, IaC with Terraform, and incident response. Requires 8+ years experience with Docker, K8s, AWS, and Python at scale.

168k – 213kSan Francisco, CADevOps / SREOnsite8+ YOE

About the role

What You’ll Do

  • Own and evolve our Kubernetes infrastructure, including cluster management, service mesh configuration, and container security policies.
  • Design and implement progressive delivery pipelines with canary deployments, automated rollbacks, and deployment health validation.
  • Build and maintain our observability infrastructure in Datadog, including dashboards, monitors, SLOs, and distributed tracing.
  • Drive incident response for high-severity outages and proactively model capacity needs for low-latency AI inference.
  • Architect and automate secure infrastructure using Infrastructure-as-Code for VPCs, IAM policies, Kubernetes manifests, and private cloud deployments.
  • Maintain and improve the infrastructure controls that support our SOC 2 compliance posture.
  • Lead customer engagements for enterprise rollouts and mentor mid-level engineers on infrastructure best practices.

What We’re Looking For

Must-Haves:

  • 8+ years in infrastructure engineering or DevOps at high-growth or hyperscale companies.
  • Experience with Docker and Kubernetes, including production cluster management, Helm, and service mesh technologies.
  • A proven track record of architecting and operating AWS (preferred), GCP, or Azure at an enterprise scale.
  • Experience with observability platforms, preferably Datadog (metrics, logs, APM, distributed tracing).
  • A strong background in Infrastructure-as-Code (Terraform, Helm, Kustomize) and safe deployment practices (progressive delivery, canary deployments, GitOps, automated rollbacks).
  • "Battle scars" from leading outages, capacity events, and large-scale incident reviews.
  • Strong programming skills in Python.

Bonus Points:

  • Familiarity with TypeScript.
  • Direct involvement in SOC 2 or other compliance audit preparation or remediation.
  • Direct experience with private-cloud or on-premises deployments for regulated customers.
  • Previous experience at startups scaling infrastructure from the early stages to the enterprise level.
  • A background in fintech or building systems for highly regulated industries.
  • Experience with AI/ML infrastructure and model deployment at scale.

Compensation & Benefits

  • $168k - $213k + equity
  • Comprehensive healthcare, 401k matching, commuter benefits
  • 15 days PTO + holidays, unlimited sick days
  • Flexible leave options

Skills

KubernetesDockerHelmDatadogTerraformAWSPythonGitOpsService MeshKustomize

Similar roles

DevOps / SRE jobs

Senior Software Engineer, Site Reliability

Senior SRE engineer builds tooling and automation to enhance production system reliability, monitoring microservices, Kubernetes, and ML platforms. Requires 6+ years in software/SRE/DevOps, proficiency in Python/Go, IaC, and observability tools.

167k – 231kUnited StatesDevOps / SRERemote10+ YOEAWSReact

Senior Site Reliability Engineer

Senior SRE monitors production infrastructure availability, capacity, and throughput at Fivetran. Collaborates with engineering on reliability practices, automation, and vulnerability remediation. Requires 5+ years SaaS experience, Kubernetes, cloud platforms, and scripting.

167k – 200kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Senior Software Engineer, Developer Productivity Cloud Infrastructure

Senior engineer focused on developer productivity and cloud infrastructure. Designs scalable internal tools, re-architects build systems, and improves CI/CD workflows using Terraform, Go/Python/C++.

170k – 240kSan Mateo, CADevOps / SREHybrid5+ YOEGoC++

Senior Software Engineer - Observability and Reliability

Build observability platforms and tools (metrics, logging, tracing, alerting) using Go, OpenTelemetry, and Kubernetes. Requires 5+ years experience building production software and strong CS fundamentals.

170k – 240kNew York, NYDevOps / SREOn-site5+ YOEGoGCP

Senior Software Engineer - Observability and Reliability

Build observability tools and platforms (metrics, logging, tracing, alerting) using Go, OpenTelemetry, and Kubernetes. Requires 5+ years experience building high-quality software that other engineers use.

170k – 240kSan Francisco, CADevOps / SREOn-site5+ YOEGoGCP