Skip to content

DevOps Engineer

United StatesDevOps / SRERemote5+ YOE
Summary

Designs and maintains cloud infrastructure, Kubernetes clusters for GPU/ML workloads, implements GitOps with ArgoCD and Terraform IaC. Requires 5+ years DevOps experience, Kubernetes expertise, AWS/GCP proficiency, and Python.

About the role

Key Responsibilities

  • Design and implement cloud infrastructure from the ground up
  • Build and maintain Kubernetes clusters optimized for GPU workloads and ML applications, as well as Production SaaS hosting
  • Implement GitOps practices using ArgoCD for continuous deployment
  • Develop infrastructure as code using Terraform
  • Create and maintain CI/CD pipelines for infrastructure and application deployment
  • Implement monitoring and observability solutions for distributed systems
  • Automate infrastructure management with Python and Bash
  • Collaborate with ML engineers to optimize infrastructure for model training and serving
  • Implement and maintain cost optimization strategies (FinOps) for cloud resources
  • Monitor and optimize cloud spending, especially for GPU-intensive workloads

Must Have

  • 5+ years of experience in cloud infrastructure and DevOps
  • 3+ years of experience with Python
  • Strong experience with AWS and GCP cloud platforms
  • Deep expertise in Kubernetes, including multi-cluster management, GPU workload optimization, resource scheduling and autoscaling, and network policies and security
  • Experience with GitOps tools (ArgoCD preferred)
  • Extensive experience with cloud networking, including VPC design, load balancer configuration, network security and segmentation, and cross-cloud networking solutions
  • Strong CI/CD expertise, preferably with GitHub Actions
  • Proficiency in infrastructure as code (Terraform)
  • Experience with monitoring and observability tools
  • Experience with FinOps practices and cloud cost optimization

Nice to Have

  • Experience with ML workflow tooling (MLflow, Kubeflow, or similar)
  • Experience with FastAPI and Backend applications
  • Familiarity with data platforms like Databricks or Snowflake
  • Exposure to SRE practices or cloud security certifications
  • Hands-on experience with Prometheus, Grafana, or Datadog

Benefits

  • Competitive compensation with salary and equity
  • Comprehensive health coverage, including medical, dental, vision, and 401K
  • Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
  • Relocation support for employees moving to join the team in one of our office locations
  • A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action
Skills
KubernetesAWSGCPTerraformArgoCDPythonGitHub ActionsGitOpsCI/CDPrometheusGrafanaDatadogMLflowKubeflowFinOps