Skip to content

DevOps Team Lead

United StatesEngineering ManagementRemote7+ YOE
Summary

Leads DevOps team to architect cloud infrastructure, manage Kubernetes for GPU/ML workloads, implement IaC with Terraform and GitOps via ArgoCD. Requires 7+ years DevOps experience including 3+ in leadership, multi-cloud expertise, and strong CI/CD skills.

About the role

Key Responsibilities

  • Lead and mentor a team of DevOps engineers, fostering technical growth and collaboration
  • Define and drive the infrastructure roadmap aligned with company objectives
  • Architect and oversee cloud infrastructure design and implementation
  • Establish best practices, standards, and processes for infrastructure development and operations
  • Partner with Engineering, Research, and FDE to align infrastructure capabilities with business needs
  • Drive the evolution of Kubernetes clusters optimized for GPU workloads, Production SaaS hosting and varied enterprise deployment models
  • Champion GitOps practices using ArgoCD for continuous deployment
  • Establish infrastructure as code standards using Terraform
  • Define monitoring and observability strategy for distributed systems
  • Collaborate with ML engineers to optimize infrastructure for model training and serving
  • Own infrastructure reliability, performance, and security posture
  • Implement and maintain cost optimization strategies (FinOps) for cloud resources

Must Have

  • 7+ years of experience in cloud infrastructure and DevOps, with 3+ years in a technical leadership role
  • Proven track record of building and leading high-performing infrastructure teams
  • Strong experience with AWS, GCP and Azure
  • Deep expertise in Kubernetes, including multi-cluster management, GPU workload optimization, resource scheduling and autoscaling, and network policies and security
  • Extensive experience with cloud networking, including VPC design, load balancer configuration, network security and segmentation, and cross-cloud networking solutions
  • Strong CI/CD expertise, preferably with GitHub Actions
  • Proficiency in Terraform
  • Proficiency with GitOps tools (ArgoCD preferred)
  • 3+ years of experience with Python
  • Experience with monitoring and observability tools
  • Experience with FinOps practices and cloud cost optimization
  • Excellent communication skills with ability to translate technical concepts for diverse audiences

Nice to Have

  • Experience with ML workflow tooling (MLflow, Kubeflow, or similar)
  • Experience with FastAPI and backend applications
  • Familiarity with data platforms like Databricks or Snowflake
  • SRE practices experience or cloud security certifications
  • Hands-on experience with Prometheus, Grafana, or Datadog

Benefits

  • Competitive compensation with salary and equity
  • Comprehensive health coverage, including medical, dental, vision, and 401K
  • Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
  • Relocation support for employees moving to join the team in one of our office locations
Skills
KubernetesAWSGCPAzureTerraformArgoCDGitHub ActionsPythonGitOpsFinOpsPrometheusGrafanaDatadogMLflowKubeflow