DevOps Engineer
United StatesDevOps / SRERemote5+ YOE
Summary
Designs and maintains cloud infrastructure, Kubernetes clusters for GPU/ML workloads, implements GitOps with ArgoCD and Terraform IaC. Requires 5+ years DevOps experience, Kubernetes expertise, AWS/GCP proficiency, and Python.
About the role
Key Responsibilities
- Design and implement cloud infrastructure from the ground up
- Build and maintain Kubernetes clusters optimized for GPU workloads and ML applications, as well as Production SaaS hosting
- Implement GitOps practices using ArgoCD for continuous deployment
- Develop infrastructure as code using Terraform
- Create and maintain CI/CD pipelines for infrastructure and application deployment
- Implement monitoring and observability solutions for distributed systems
- Automate infrastructure management with Python and Bash
- Collaborate with ML engineers to optimize infrastructure for model training and serving
- Implement and maintain cost optimization strategies (FinOps) for cloud resources
- Monitor and optimize cloud spending, especially for GPU-intensive workloads
Must Have
- 5+ years of experience in cloud infrastructure and DevOps
- 3+ years of experience with Python
- Strong experience with AWS and GCP cloud platforms
- Deep expertise in Kubernetes, including multi-cluster management, GPU workload optimization, resource scheduling and autoscaling, and network policies and security
- Experience with GitOps tools (ArgoCD preferred)
- Extensive experience with cloud networking, including VPC design, load balancer configuration, network security and segmentation, and cross-cloud networking solutions
- Strong CI/CD expertise, preferably with GitHub Actions
- Proficiency in infrastructure as code (Terraform)
- Experience with monitoring and observability tools
- Experience with FinOps practices and cloud cost optimization
Nice to Have
- Experience with ML workflow tooling (MLflow, Kubeflow, or similar)
- Experience with FastAPI and Backend applications
- Familiarity with data platforms like Databricks or Snowflake
- Exposure to SRE practices or cloud security certifications
- Hands-on experience with Prometheus, Grafana, or Datadog
Benefits
- Competitive compensation with salary and equity
- Comprehensive health coverage, including medical, dental, vision, and 401K
- Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
- Relocation support for employees moving to join the team in one of our office locations
- A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action
Skills
KubernetesAWSGCPTerraformArgoCDPythonGitHub ActionsGitOpsCI/CDPrometheusGrafanaDatadogMLflowKubeflowFinOps