Skip to content

DevOps Engineer

Designs, builds, and operates reliable cloud infrastructure for real-time voice AI systems. Owns Kubernetes clusters, CI/CD pipelines, observability, and security using AWS and IaC tools. Requires 5+ years DevOps experience with strong Python and async programming skills.

180k – 260kSan Francisco, CADevOps / SREOnsite5+ YOE

About the role

Responsibilities

  • Design, build, and operate highly reliable cloud infrastructure that powers real-time voice AI systems with extremely low latency and high availability.
  • Own Kubernetes clusters end-to-end: provisioning, scaling, upgrades, networking, and debugging production incidents under real customer load.
  • Build, maintain, and evolve infrastructure as code using tools like Terraform, Pulumi, or CloudFormation to ensure repeatable, auditable, and secure environments across staging and production.
  • Create and operate CI/CD pipelines that enable fast, safe iteration across multiple microservices and teams.
  • Design and maintain observability systems (metrics, logs, traces, alerting) to detect failures early and rapidly diagnose production issues.
  • Partner with backend engineers to translate application requirements into scalable, secure infrastructure and clean deployment workflows.
  • Harden systems through strong security practices including IAM, secrets management, network isolation, and least-privilege access controls.
  • Optimize cloud performance and costs while maintaining reliability, developer velocity, and customer experience.
  • Implement and operate GitOps-driven deployment workflows, using Git as the source of truth for infrastructure and application state, enabling safe, auditable, and automated rollouts.
  • Lead incident response: investigate outages, coordinate fixes, write postmortems, and drive systemic reliability improvements.
  • Continuously improve resilience through load testing, chaos testing, capacity planning, and proactive infrastructure upgrades.

Qualifications

  • 5+ years as a DevOps engineer
  • Experience writing async web apps using FastAPI in Python
  • Builder of APIs, Clouds, CI/CD pipelines
  • Experience with IaC, AWS, Database Management at scale
  • Understanding of good architecture, security practices
  • Strong technical and communication skills
  • Extensive experience with AWS & Kubernetes

Software Stack

  • Backend: Python, microservices, async programming
  • Cloud & Infrastructure: AWS, GCP, Kubernetes, Redis, ArgoCD, GitOps
  • Databases: Firebase, Supabase (PostgreSQL)
  • Frontend: Next.js
  • Observability & Monitoring: Datadog, logging, metrics, tracing
  • Telephony & Voice AI: SIP, voice APIs, real-time call handling
  • Other tools & practices: CI/CD, automated testing, resilient architecture

Skills

KubernetesAWSTerraformCI/CDGitOpsPythonFastAPIArgo CDDatadogRedisPostgresGCPPulumiCloudFormationIAM

Similar roles

DevOps / SRE jobs

Network Engineer, Design & Engineering

Design end-to-end datacenter network architectures for AI training and inference workloads. Own topology selection, fabric design, physical infrastructure integration, and produce deployable HLDs/LLDs across multiple GPU platforms and customer requirements.

180k – 300kNew York, NY +4DevOps / SREOn-site5+ YOEBGPPfc

Developer Productivity Engineer

As a Senior Developer Productivity Engineer, you will own the build, test, and deployment processes for a 50+ person engineering team. You will improve monorepo productivity, drive excellence in testing, and support multi-cloud/multi-region infrastructure to enable fast and safe shipping.

180k – 320kUnited StatesDevOps / SRERemote5+ YOEGoCI/CD

Data Center Network Engineer

Design and own high-performance data center network infrastructure for GPU clusters, including fabric architecture, cabling, and performance validation. Requires deep experience with InfiniBand, RDMA, or high-performance Ethernet at a senior level.

180k – 360kSan Francisco, CA +1DevOps / SREHybrid5+ YOERdmaEthernet

Infrastructure Engineer (Observability)

Builds and operates scalable observability platforms for metrics, logs, traces across GPU, HPC infrastructure. Designs telemetry pipelines, alerting, and multi-tenant systems using Prometheus, Grafana, Kafka; requires 5+ years SRE/infra experience.

180k – 200kNew York, NY +2DevOps / SRERemote5+ YOEGoElk

Infrastructure Engineer (GPU & Compute)

Owns GPU diagnostics, validation workflows, and automation for bare-metal infrastructure supporting AI/ML workloads. Requires 5+ years in systems engineering with strong Linux, Python, and NVIDIA tools expertise.

180k – 200kNew York, NY +2DevOps / SRERemote5+ YOEPxeIpmi