Leads infrastructure engineering team to build scalable cloud platforms on AWS and Kubernetes, driving reliability, security, and developer productivity in a regulated biotech environment. Requires 5+ years people leadership and hands-on infra experience.
About the role
Responsibilities
- Lead and grow a team of infrastructure engineers including hiring, coaching, performance management, and career development across all levels of the team.
- Own delivery and outcomes for critical infrastructure and services-platform initiatives including reliability, scalability, cost, security posture, and compliance-aligned engineering practices.
- Drive reliability improvements using pragmatic SRE principles: service-level thinking (SLIs/SLOs), operational readiness, automation, and resiliency patterns.
- Strengthen incident response: improve on-call health, incident processes, postmortems, and follow-up execution in collaboration with product engineering and platform teams.
- Build and evolve cloud foundations on AWS, including network and compute patterns and Kubernetes-based services.
- Partner cross-functionally with Platform Engineering, Security, and Product Engineering to align priorities and deliver shared roadmaps.
- Operate in a regulated environment (GxP/biotech): help ensure infrastructure changes are traceable, auditable, and appropriately controlled, without slowing teams down unnecessarily.
Qualifications
- 5+ years of people leadership experience including managing a team across a range
- Strong technical depth with previous direct, hands-on experience in software engineering and infrastructure/platform engineering.
- Production experience operating on AWS (e.g., EKS/ECS, RDS, VPC, EC2, S3) and building scalable internal platforms.
- Strong experience with Kubernetes and services-platform patterns (e.g., service mesh such as Istio, ingress, service discovery, workload isolation).
- Experience building or evolving observability and operational tooling (Datadog, FireHydrant, Sentry, or similar).
- Ability to collaborate across teams and influence without authority; strong written and verbal communication.
- Comfort working in a stack largely in Python (Go also highly appreciated), with TypeScript in the broader ecosystem (especially for platform tooling and integrations).
Nice to Have Qualifications
- Direct ownership of incident management programs (on-call design, incident command, postmortems, and reliability governance).
- Experience managing or partnering closely with SRE teams.
- Experience in regulated environments (GxP, biotech, healthcare, finance), including change control, auditability, and SDLC/process rigor.
Skills
AWSKubernetesPythonGoTypeScriptIstioDatadogSREObservabilityGxp