Staff Software Engineer, Infrastructure

Designs and operates scalable cloud infrastructure on AWS, focusing on Kubernetes orchestration, reliability practices, and observability for AI healthcare products. Requires 8+ years experience with IaC, containerization, and cross-team leadership.

175k – 230kUnited StatesDevOps / SRERemote8+ YOE

Apply

About the role

What You’ll Be Doing

Influence the technical direction for infrastructure and platform capabilities that support our rapidly growing AI product suite.
Architect and evolve our cloud infrastructure (primarily on AWS) across container orchestration (Kubernetes, Elastic Container Service), serverless (e.g., Lambda), virtual machines (e.g., EC2), and data stores to support current and future products.
Work closely with Platform leadership, product engineering, data, and ML teams to design systems that are robust, observable, and compliant in a healthcare environment.
Define and drive infrastructure strategy for the Platform org—partnering with engineering leadership to align roadmaps, set standards, and sequence work for maximum business impact.
Secure networking, identity, and access patterns across environments.
Improve reliability and operational excellence by defining SLOs, SLIs, and error budgets for core platform services.
Leading and participating in blameless post-incident reviews and translating learnings into systemic improvements.
Own observability and monitoring strategy across logging, metrics, and tracing, ensuring we can detect, debug, and prevent issues efficiently.
Mentor and level up engineers across Platform and product teams—reviewing design docs, guiding architecture decisions, and modeling high standards for reliability, security, and maintainability.
Partner with security and compliance stakeholders to ensure our infrastructure and operational practices meet HIPAA and other healthcare requirements.
Advocate for and implement developer experience improvements, such as better CI/CD workflows, faster feedback loops, and tooling that reduces cognitive load for product teams.

Who We’re Looking For

Bring 8+ years of hands-on infrastructure / platform development experience (or equivalent practical experience) in modern, cloud-native environments, with a track record of owning critical systems in production.
Have deep expertise with AWS (preferred) and/or GCP, including core networking, compute, storage, and managed services.
Are highly proficient in at least one programming/scripting language used for infrastructure work (Python preferred).
Extensive experience building tooling and automation for other engineers.
Have strong experience with Kubernetes, containers (Docker), and container orchestration, and understand how to operate these systems reliably at scale.
Are comfortable with Infrastructure as Code (Terraform preferred, Pulumi, or similar) and Git-based workflows.
Possess solid Linux fundamentals and are comfortable debugging issues at the OS, networking, and application layers.
Have demonstrable experience leading complex, cross-team initiatives from design through rollout—communicating tradeoffs, aligning stakeholders, de-risking launches, and measuring impact.
Communicate clearly and empathetically with both technical and non-technical partners, and enjoy mentoring engineers at multiple levels.
Take a data-informed, pragmatic approach to decision-making—balancing ideal architecture with business needs, delivery timelines, and team capacity.

Nice to Haves

Experience in regulated environments (e.g., HIPAA) or prior work in healthcare or healthtech.
Background in platform or security engineering, especially around access control, encryption, auditability, and compliance.
Experience working closely with ML / data teams or with ML platforms (e.g., Airflow, Ray, ML pipelines, model serving stacks).
Familiarity with observability stacks (CloudWatch, New Relic, Grafana, OpenTelemetry, etc.).
Experience designing or operating internal developer platforms, SDKs, or reusable frameworks that standardize how services are built and deployed.
Prior experience at a fast-growing startup where you’ve helped scale infrastructure, processes, and teams.

Skills

AWSKubernetesDockerTerraformPythonLinuxGCPCI/CDHIPAASLOs

Similar roles

DevOps / SRE jobs

Sage

Senior/Staff Site Reliability Engineer

Leads design, operation, and evolution of highly reliable, scalable production infrastructure including cloud, databases, and observability. Drives incident response, SRE practices, automation, and capacity planning for large-scale distributed systems. Requires 7-12+ years in SRE/infrastructure engineering.

175k – 230kNew York, NYDevOps / SREHybrid7+ YOEGoAWS

Fireworks AI

Member of Technical Staff, Performance Optimization

Optimizes performance of high-scale systems by analyzing latency, throughput, and resource usage. Requires expertise in profiling, systems programming, and distributed scaling techniques.

175k – 220kSan Mateo, CADevOps / SREOn-siteGoC++

Fireworks AI

Member of Technical Staff, Cloud Infrastructure

Builds and maintains scalable cloud infrastructure, focusing on reliability and performance. Requires expertise in cloud platforms, IaC tools like Terraform and Kubernetes, and systems programming.

175k – 220kNew York, NY +1DevOps / SREHybridAWSGCP

Fireworks AI

Member of Technical Staff, AI Training Infrastructure

Builds and optimizes scalable infrastructure for AI model training on large GPU clusters. Requires expertise in distributed systems, Python/C++, and ML frameworks.

175k – 220kSan Mateo, CADevOps / SREOn-siteC++AWS

Grafana Labs

Staff Software Engineer - Platform, SysEng

Staff Backend Engineer on the Platform SysEng squad building and scaling the internal engineering platform that powers Grafana Cloud services. Owns distributed systems design, Kubernetes infrastructure, reliability/SLOs, and performance at massive scale.

175k – 210kUnited StatesDevOps / SRERemote7+ YOEGoIac