Senior Software Engineer, Infrastructure
180k – 250kBoston, MADevOps / SREHybrid5+ YOE
Summary
Senior Infrastructure Engineer responsible for building and operating platform primitives including Kubernetes, CI/CD, observability, and developer tooling at a high-growth AI and data platform company.
About the role
Responsibilities
- Steward core platform services: Implement container orchestration, service mesh, ingress, and secrets management at scale.
- Cross-functional partnership: Collaborate with Product, Engineering, Data, and Security to deliver external and internal value.
- Harden reliability: Improve observability (logging, metrics, tracing), and automated remediation to increase availability and latency.
- Automate everything: Use infrastructure-as-code and configuration management to make systems and processes repeatable, auditable, and secure.
- Scale cost-effectively: Optimize cluster utilization, autoscaling, and storage/networking to balance performance, reliability, and spend.
- Level-up developer experience: Build internal tooling, templates, and golden paths that reduce cognitive load and time-to-first-deploy for product teams.
- On-call & incident response: Participate in a sustainable on-call rotation, drive post-mortems, eliminate toil, and reduce MTTR via automation.
- Enable fast, safe delivery: Evolve CI/CD pipelines (build/test/release), and environment strategies (dev/stage/prod).
- AI: Build using agentic tools (Claude Code, Codex, etc) and push the boundaries of agentic development.
Requirements
- 5+ years of experience in software engineering with a focus on infrastructure, DevOps, and/or platform engineering.
- Team focused mindset, with solid collaboration and communication skills, with a focus on enabling others.
- Pragmatic problem-solver who communicates clearly, documents well, and thrives in fast-moving, high-ownership environments.
- Experience working with cloud infrastructure, specifically Kubernetes.
- Understanding of observability: metrics, logs, traces, and building actionable alerts/SLOs.
- Familiarity with infrastructure-as-code tools.
- Some programming experience in at least one modern programming language.
- Awareness of security fundamentals: IAM, workload identity, network policies, encryption, and secrets management.
Nice to Haves
- Open source contributions.
- Experience with company transitioning from startup to high-growth.
- Google Cloud Platform.
- Terraform.
- Python, Go, and/or JavaScript (TypeScript).
- Building and managing CI/CD systems and developer tooling.
Skills
KubernetesTerraformGoogle Cloud PlatformPythonGoJavaScriptTypeScriptCI/CDInfrastructure as CodeObservability
Similar roles at this salary range
All DevOps / SRE jobs →Staff Site Reliability Engineer, Release Engineering
Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.
208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.
194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE