DevOps Team Lead
United StatesEngineering ManagementRemote7+ YOE
Summary
Leads DevOps team to architect cloud infrastructure, manage Kubernetes for GPU/ML workloads, implement IaC with Terraform and GitOps via ArgoCD. Requires 7+ years DevOps experience including 3+ in leadership, multi-cloud expertise, and strong CI/CD skills.
About the role
Key Responsibilities
- Lead and mentor a team of DevOps engineers, fostering technical growth and collaboration
- Define and drive the infrastructure roadmap aligned with company objectives
- Architect and oversee cloud infrastructure design and implementation
- Establish best practices, standards, and processes for infrastructure development and operations
- Partner with Engineering, Research, and FDE to align infrastructure capabilities with business needs
- Drive the evolution of Kubernetes clusters optimized for GPU workloads, Production SaaS hosting and varied enterprise deployment models
- Champion GitOps practices using ArgoCD for continuous deployment
- Establish infrastructure as code standards using Terraform
- Define monitoring and observability strategy for distributed systems
- Collaborate with ML engineers to optimize infrastructure for model training and serving
- Own infrastructure reliability, performance, and security posture
- Implement and maintain cost optimization strategies (FinOps) for cloud resources
Must Have
- 7+ years of experience in cloud infrastructure and DevOps, with 3+ years in a technical leadership role
- Proven track record of building and leading high-performing infrastructure teams
- Strong experience with AWS, GCP and Azure
- Deep expertise in Kubernetes, including multi-cluster management, GPU workload optimization, resource scheduling and autoscaling, and network policies and security
- Extensive experience with cloud networking, including VPC design, load balancer configuration, network security and segmentation, and cross-cloud networking solutions
- Strong CI/CD expertise, preferably with GitHub Actions
- Proficiency in Terraform
- Proficiency with GitOps tools (ArgoCD preferred)
- 3+ years of experience with Python
- Experience with monitoring and observability tools
- Experience with FinOps practices and cloud cost optimization
- Excellent communication skills with ability to translate technical concepts for diverse audiences
Nice to Have
- Experience with ML workflow tooling (MLflow, Kubeflow, or similar)
- Experience with FastAPI and backend applications
- Familiarity with data platforms like Databricks or Snowflake
- SRE practices experience or cloud security certifications
- Hands-on experience with Prometheus, Grafana, or Datadog
Benefits
- Competitive compensation with salary and equity
- Comprehensive health coverage, including medical, dental, vision, and 401K
- Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
- Relocation support for employees moving to join the team in one of our office locations
Skills
KubernetesAWSGCPAzureTerraformArgoCDGitHub ActionsPythonGitOpsFinOpsPrometheusGrafanaDatadogMLflowKubeflow