# Software Engineer, Infrastructure
**Company:** [Bretton AI](https://hotfix.jobs/companies/bretton-ai)
**Location:** San Francisco, CA
**Salary:** $168K-$213K
**Experience:** 8+ years
**Skills:** Kubernetes, Docker, Helm, Datadog, Terraform, AWS, Python, GitOps, Service Mesh, Kustomize
**Posted:** 2026-01-28
> Owns and evolves Kubernetes-based infrastructure for secure, compliant AI deployments in financial services, including observability with Datadog, IaC with Terraform, and incident response. Requires 8+ years experience with Docker, K8s, AWS, and Python at scale.
## Job Description
## What You’ll Do
- Own and evolve our Kubernetes infrastructure, including cluster management, service mesh configuration, and container security policies.
- Design and implement progressive delivery pipelines with canary deployments, automated rollbacks, and deployment health validation.
- Build and maintain our observability infrastructure in Datadog, including dashboards, monitors, SLOs, and distributed tracing.
- Drive incident response for high-severity outages and proactively model capacity needs for low-latency AI inference.
- Architect and automate secure infrastructure using Infrastructure-as-Code for VPCs, IAM policies, Kubernetes manifests, and private cloud deployments.
- Maintain and improve the infrastructure controls that support our SOC 2 compliance posture.
- Lead customer engagements for enterprise rollouts and mentor mid-level engineers on infrastructure best practices.

## What We’re Looking For
**Must-Haves:**
- 8+ years in infrastructure engineering or DevOps at high-growth or hyperscale companies.
- Experience with **Docker** and **Kubernetes**, including production cluster management, **Helm**, and service mesh technologies.
- A proven track record of architecting and operating **AWS** (preferred), **GCP**, or **Azure** at an enterprise scale.
- Experience with observability platforms, preferably **Datadog** (metrics, logs, APM, distributed tracing).
- A strong background in **Infrastructure-as-Code** (**Terraform**, **Helm**, **Kustomize**) and safe deployment practices (progressive delivery, canary deployments, **GitOps**, automated rollbacks).
- \"Battle scars\" from leading outages, capacity events, and large-scale incident reviews.
- Strong programming skills in **Python**.

**Bonus Points:**
- Familiarity with **TypeScript**.
- Direct involvement in SOC 2 or other compliance audit preparation or remediation.
- Direct experience with private-cloud or on-premises deployments for regulated customers.
- Previous experience at startups scaling infrastructure from the early stages to the enterprise level.
- A background in fintech or building systems for highly regulated industries.
- Experience with AI/ML infrastructure and model deployment at scale.

## Compensation & Benefits
- $168k - $213k + equity
- Comprehensive healthcare, 401k matching, commuter benefits
- 15 days PTO + holidays, unlimited sick days
- Flexible leave options
**Apply:** https://hotfix.jobs/jobs/software-engineer-infrastructure-at-bretton-ai-8b9c1c26-933e-43cb-9dc8-0527aa71733b
**Canonical:** https://hotfix.jobs/jobs/software-engineer-infrastructure-at-bretton-ai-8b9c1c26-933e-43cb-9dc8-0527aa71733b