DevOps Engineer

Designs, builds, and operates reliable cloud infrastructure for real-time voice AI systems. Owns Kubernetes clusters, CI/CD pipelines, observability, and security using AWS and IaC tools. Requires 5+ years DevOps experience with strong Python and async programming skills.

180k – 260kSan Francisco, CADevOps / SREOnsite5+ YOE

Apply

About the role

Responsibilities

Design, build, and operate highly reliable cloud infrastructure that powers real-time voice AI systems with extremely low latency and high availability.
Own Kubernetes clusters end-to-end: provisioning, scaling, upgrades, networking, and debugging production incidents under real customer load.
Build, maintain, and evolve infrastructure as code using tools like Terraform, Pulumi, or CloudFormation to ensure repeatable, auditable, and secure environments across staging and production.
Create and operate CI/CD pipelines that enable fast, safe iteration across multiple microservices and teams.
Design and maintain observability systems (metrics, logs, traces, alerting) to detect failures early and rapidly diagnose production issues.
Partner with backend engineers to translate application requirements into scalable, secure infrastructure and clean deployment workflows.
Harden systems through strong security practices including IAM, secrets management, network isolation, and least-privilege access controls.
Optimize cloud performance and costs while maintaining reliability, developer velocity, and customer experience.
Implement and operate GitOps-driven deployment workflows, using Git as the source of truth for infrastructure and application state, enabling safe, auditable, and automated rollouts.
Lead incident response: investigate outages, coordinate fixes, write postmortems, and drive systemic reliability improvements.
Continuously improve resilience through load testing, chaos testing, capacity planning, and proactive infrastructure upgrades.

Qualifications

5+ years as a DevOps engineer
Experience writing async web apps using FastAPI in Python
Builder of APIs, Clouds, CI/CD pipelines
Experience with IaC, AWS, Database Management at scale
Understanding of good architecture, security practices
Strong technical and communication skills
Extensive experience with AWS & Kubernetes

Software Stack

Backend: Python, microservices, async programming
Cloud & Infrastructure: AWS, GCP, Kubernetes, Redis, ArgoCD, GitOps
Databases: Firebase, Supabase (PostgreSQL)
Frontend: Next.js
Observability & Monitoring: Datadog, logging, metrics, tracing
Telephony & Voice AI: SIP, voice APIs, real-time call handling
Other tools & practices: CI/CD, automated testing, resilient architecture

Skills

KubernetesAWSTerraformCI/CDGitOpsPythonFastAPIArgo CDDatadogRedisPostgresGCPPulumiCloudFormationIAM

Similar roles

DevOps / SRE jobs

Fluidstack

Network Engineer, Design & Engineering

Design end-to-end datacenter network architectures for AI training and inference workloads. Own topology selection, fabric design, physical infrastructure integration, and produce deployable HLDs/LLDs across multiple GPU platforms and customer requirements.

180k – 300kNew York, NY +4DevOps / SREOn-site5+ YOEBGPPfc

Hightouch

Developer Productivity Engineer

As a Senior Developer Productivity Engineer, you will own the build, test, and deployment processes for a 50+ person engineering team. You will improve monorepo productivity, drive excellence in testing, and support multi-cloud/multi-region infrastructure to enable fast and safe shipping.

180k – 320kUnited StatesDevOps / SRERemote5+ YOEGoCI/CD

Baseten

Data Center Network Engineer

Design and own high-performance data center network infrastructure for GPU clusters, including fabric architecture, cabling, and performance validation. Requires deep experience with InfiniBand, RDMA, or high-performance Ethernet at a senior level.

180k – 360kSan Francisco, CA +1DevOps / SREHybrid5+ YOERdmaEthernet

Lightning AI

Infrastructure Engineer (Observability)

Builds and operates scalable observability platforms for metrics, logs, traces across GPU, HPC infrastructure. Designs telemetry pipelines, alerting, and multi-tenant systems using Prometheus, Grafana, Kafka; requires 5+ years SRE/infra experience.

180k – 200kNew York, NY +2DevOps / SRERemote5+ YOEGoElk

Lightning AI

Infrastructure Engineer (GPU & Compute)

Owns GPU diagnostics, validation workflows, and automation for bare-metal infrastructure supporting AI/ML workloads. Requires 5+ years in systems engineering with strong Linux, Python, and NVIDIA tools expertise.

180k – 200kNew York, NY +2DevOps / SRERemote5+ YOEPxeIpmi