Skip to content

Software Engineer, Cloud Infrastructure

180k – 300kRedwood City, CAHybrid
Summary

Designs, builds, and operates scalable multi-cloud infrastructure powering ML training, inference, and data curation pipelines. Collaborates with teams on AWS-focused systems using Kubernetes and IaC tools like Terraform.

About the role

What You'll Work On

  • Architect and maintain our multi-cloud infrastructure (primarily AWS, potentially Azure/GCP), with a focus on reliability, security, and scalability
  • Define and implement infrastructure-as-code best practices using Terraform, CloudFormation, Pulumi (and similar technologies)
  • Design and manage Kubernetes-based systems for model training, inference, and data processing workloads
  • Optimize our CI/CD pipelines and streamline deployment of services across environments
  • Build monitoring, alerting, and logging systems to ensure high system availability and observability
  • Collaborate with research and engineering teams to provide infrastructure support for training large-scale ML models
  • Ensure our infrastructure supports various deployment models (cloud, on-prem, hybrid) for enterprise use cases
  • Drive cost-efficiency strategies across compute and storage resources
  • Respond to and resolve infrastructure-related incidents with a sense of ownership and urgency

About You

  • You've led or helped build robust infrastructure systems at a startup or fast-moving engineering organization
  • Deep experience working with cloud providers (especially AWS), and ideally exposure to multi-cloud or hybrid-cloud setups
  • Strong with Kubernetes, Terraform, and containerized architectures
  • Confident with systems-level debugging—networking issues, memory leaks, resource bottlenecks, etc.
  • Comfortable writing clean, maintainable scripts in Bash, Python, or Go
  • You care deeply about building secure and scalable systems and take pride in reliable infrastructure
  • You're collaborative, humble, and ready to own high-impact projects end-to-end

Nice to Have

  • Experience supporting infrastructure for ML workloads (training pipelines, inference clusters, GPU orchestration)
  • Built or scaled infrastructure for teams working with large-scale datasets
  • Exposure to cost monitoring and optimization tools in cloud environments
  • Background supporting compliance and security in enterprise deployments

Compensation

Salary ranges from $180,000 to $300,000.

Comprehensive benefits: 100% covered health benefits, 401(k) with 4% match, unlimited PTO, wellness and learning stipends, daily lunches/snacks, relocation assistance.

Skills
AWSKubernetesTerraformCI/CDPythonGoBashCloudFormationPulumiGPU
Similar roles at this salary range
All DevOps / SRE jobs →
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3
Grow Therapy

Senior Platform Reliability Engineer

Senior Platform Reliability Engineer establishing reliability standards, observability, and incident response practices across engineering teams. Requires 6+ years operating production systems at scale with AWS, Kubernetes, Terraform, and modern observability tooling.

182k – 250kSan Francisco, CA +2DevOps / SREHybrid6+ YOEAWSEKS