Skip to content

Sr Software Engineer, Storage

175k – 205kUnited StatesRemote5+ YOE
Summary

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

About the role

Responsibilities

  • Design and build autoscaling systems for storage clusters — automated provisioning, scale-up/scale-down policies, cluster rebalancing, and node lifecycle management.
  • Own the infrastructure-as-code stack (Terraform) that defines and deploys storage infrastructure end-to-end on AWS.
  • Build self-healing automation: health checks, automated failover, capacity rebalancing, and remediation controllers that resolve issues before they page anyone.
  • Develop the CI/CD pipelines and deployment tooling for storage services — safe rollouts, canary deployments, automated rollback.
  • Design and implement observability for the entire storage platform — metrics, dashboards, SLOs, alerting, and capacity forecasting that drive automated scaling decisions.
  • Own cluster management tooling: provisioning new tenants, managing cluster topology, coordinating upgrades and migrations with zero downtime.
  • Drive performance and cost optimization across the storage data path: ingest pipelines, compaction, partitioning, and query execution.
  • Partner with product engineering to define scalability limits, load test new features, and harden the system for production readiness.
  • Contribute to incident response and lead blameless post-mortems, turning operational surprises into systemic automation.

Requirements

  • Significant experience building platform/infrastructure systems that manage, scale, and operate distributed services autonomously.
  • Strong software engineering skills in TypeScript/Node.js, Go, or similar languages.
  • Deep hands-on experience with infrastructure-as-code (Terraform) and AWS services (EC2, ECS/EKS, ASGs, DynamoDB, S3, CloudWatch).
  • Experience designing and implementing autoscaling systems, cluster orchestration, or automated provisioning for stateful workloads.
  • Track record operating data-intensive systems at scale — OLAP databases, NoSQL stores, or distributed storage platforms.
  • Strong platform engineering fundamentals: SLOs, error budgets, capacity planning, incident response, and a bias toward eliminating toil through software.
  • Comfortable working with high autonomy in a remote, distributed team.
  • Strong understanding of Linux systems, networking, and performance profiling at the infrastructure level.

Nice-to-Haves

  • Experience with DynamoDB or similar NoSQL databases at high throughput — partition design, capacity management, GSI optimization.
  • Background in cluster management for OLAP or analytical databases — automated provisioning, rolling upgrades, replication topology.
  • Experience with object storage and data lake architectures (S3, Parquet/ORC formats).
  • Knowledge of data pipeline optimization: batching strategies, write amplification reduction, partition pruning, compaction policies.
  • Background in capacity planning, cost optimization, and resource forecasting for storage-heavy workloads on AWS.
  • Experience building internal platforms or developer tooling that other engineers consume.
  • Opinions about what makes a great on-call experience and a track record of making on-call better for everyone.

This position will require stand-by, on-call, or off-hours duties.

Skills
TypeScriptNode.jsGoTerraformAWSEC2ECSEKSASGDynamoDBS3CloudWatchLinuxCI/CDautoscaling
Similar roles at this salary range
All DevOps / SRE jobs →
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Grow Therapy

Senior Platform Reliability Engineer

Senior Platform Reliability Engineer establishing reliability standards, observability, and incident response practices across engineering teams. Requires 6+ years operating production systems at scale with AWS, Kubernetes, Terraform, and modern observability tooling.

182k – 250kSan Francisco, CA +2DevOps / SREHybrid6+ YOEAWSEKS
WHOOP

Senior Platform Engineer - Kubernetes

Senior Platform Engineer responsible for designing, operating, and scaling Kubernetes clusters on AWS. Focuses on CI/CD, infrastructure automation, and developer productivity across WHOOP's technology stacks.

150k – 215kBoston, MADevOps / SREHybrid5+ YOEC#AWS