Sr Software Engineer, Storage

175k – 205kUnited StatesRemote5+ YOEJun 17

Summary

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

About the role

Responsibilities

Design and build autoscaling systems for storage clusters — automated provisioning, scale-up/scale-down policies, cluster rebalancing, and node lifecycle management.
Own the infrastructure-as-code stack (Terraform) that defines and deploys storage infrastructure end-to-end on AWS.
Build self-healing automation: health checks, automated failover, capacity rebalancing, and remediation controllers that resolve issues before they page anyone.
Develop the CI/CD pipelines and deployment tooling for storage services — safe rollouts, canary deployments, automated rollback.
Design and implement observability for the entire storage platform — metrics, dashboards, SLOs, alerting, and capacity forecasting that drive automated scaling decisions.
Own cluster management tooling: provisioning new tenants, managing cluster topology, coordinating upgrades and migrations with zero downtime.
Drive performance and cost optimization across the storage data path: ingest pipelines, compaction, partitioning, and query execution.
Partner with product engineering to define scalability limits, load test new features, and harden the system for production readiness.
Contribute to incident response and lead blameless post-mortems, turning operational surprises into systemic automation.

Requirements

Significant experience building platform/infrastructure systems that manage, scale, and operate distributed services autonomously.
Strong software engineering skills in TypeScript/Node.js, Go, or similar languages.
Deep hands-on experience with infrastructure-as-code (Terraform) and AWS services (EC2, ECS/EKS, ASGs, DynamoDB, S3, CloudWatch).
Experience designing and implementing autoscaling systems, cluster orchestration, or automated provisioning for stateful workloads.
Track record operating data-intensive systems at scale — OLAP databases, NoSQL stores, or distributed storage platforms.
Strong platform engineering fundamentals: SLOs, error budgets, capacity planning, incident response, and a bias toward eliminating toil through software.
Comfortable working with high autonomy in a remote, distributed team.
Strong understanding of Linux systems, networking, and performance profiling at the infrastructure level.

Nice-to-Haves

Experience with DynamoDB or similar NoSQL databases at high throughput — partition design, capacity management, GSI optimization.
Background in cluster management for OLAP or analytical databases — automated provisioning, rolling upgrades, replication topology.
Experience with object storage and data lake architectures (S3, Parquet/ORC formats).
Knowledge of data pipeline optimization: batching strategies, write amplification reduction, partition pruning, compaction policies.
Background in capacity planning, cost optimization, and resource forecasting for storage-heavy workloads on AWS.
Experience building internal platforms or developer tooling that other engineers consume.
Opinions about what makes a great on-call experience and a track record of making on-call better for everyone.

This position will require stand-by, on-call, or off-hours duties.

Skills

TypeScriptNode.jsGoTerraformAWSEC2ECSEKSASGDynamoDBS3CloudWatchLinuxCI/CDautoscaling

Similar roles at this salary range

All DevOps / SRE jobs →

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Grow Therapy

Jun 16

Senior Platform Reliability Engineer

Senior Platform Reliability Engineer establishing reliability standards, observability, and incident response practices across engineering teams. Requires 6+ years operating production systems at scale with AWS, Kubernetes, Terraform, and modern observability tooling.

182k – 250kSan Francisco, CA +2DevOps / SREHybrid6+ YOEAWSEKS

WHOOP

Jun 16

Senior Platform Engineer - Kubernetes

Senior Platform Engineer responsible for designing, operating, and scaling Kubernetes clusters on AWS. Focuses on CI/CD, infrastructure automation, and developer productivity across WHOOP's technology stacks.

150k – 215kBoston, MADevOps / SREHybrid5+ YOEC#AWS

Apply