Responsibilities
- Own reliability and operational excellence for production systems
- Design and implement monitoring, alerting, and incident response processes
- Build tooling to improve engineering team effectiveness
- Establish on-call rotations and runbooks
- Ensure platform handles demands of regulated financial product
- Spend 50%+ time writing code: infrastructure tooling, automation, reliability improvements, developer productivity tools
Requirements (Must-haves)
- 4+ years experience in SRE, infrastructure, or platform engineering
- Experience on a team of SREs at company with mature SRE practices
- Real on-call experience at scale in large production environment
- Deep AWS expertise (ECS, RDS, networking, security)
- Strong experience with declarative infrastructure (Terraform, CDK, or similar)
- Nix experience
- Track record of building reliability tooling and automation
- Can design and implement monitoring, alerting, and observability systems from first principles
- Comfortable in regulated environment
Nice-to-haves
- Experience at companies with strong SRE cultures (Google, Replit, Stripe, etc.)
- Background in fintech, healthtech, or regulated domains
- Experience migrating monitoring systems or implementing SLOs
- Contributions to infrastructure tooling or open source projects
Technology Stack
Infrastructure: AWS (ECS, RDS, CloudFront, Lambda), CDK
Observability: Honeycomb, OpenTelemetry
CI/CD: GitHub Actions, Nix
Core platform: TypeScript/Node, PostgreSQL, React
Languages: TypeScript, Python, Nix, SQL
Compensation & Benefits
- Stock options
- Health insurance, 401K, dental