Senior Site Reliability Engineer
Senior SRE responsible for incident response, infrastructure reliability, database operations, and scaling production systems on AWS and Kubernetes.
Responsibilities
- Act as a first responder for system incidents and outages
- Own and evolve monitoring, alerting, and log management systems
- Manage and optimize database infrastructure (MySQL, Postgres, Clickhouse, Redis)
- Maintain and improve server infrastructure and deployment pipelines
- Collaborate with engineering teams to build scalable, resilient systems
- Contribute to internal SRE tooling and automation efforts
Requirements
- Deep expertise with AWS and Kubernetes
- 5+ years of experience in a Site Reliability, DevOps, or Infrastructure Engineering role
- Proven experience scaling production systems in a high-growth environment
- Practical experience using AI tools to improve engineering productivity
- Experience scaling an early-stage product to 1M+ monthly active users
- Experience managing incident response and production system outages
- Hands-on experience with database operations and optimization
- Familiarity with observability tooling, monitoring, and logging best practices
- Based in North or South America (AMER region)
Nice-to-Haves
- Experience with SOC2 compliance or building secure infrastructure
- Experience with Clickhouse or similar technologies
Compensation & Benefits
- $130,000 - $140,000 USD per year
- Fully remote
- 35 days of PTO annually + paid sabbatical after 5 years
- 100% medical coverage for you and family (or reimbursement)
- Parental leave
- Home office stipend
- Learning & development stipend
- Annual bonus potential
- Company retreats twice a year
Senior Data Engineer, Sentinel (Pacific Time Zone)
Senior Infrastructure Engineer building and operating AWS cloud infrastructure for healthcare data platform. Requires Python, Terraform, CI/CD expertise, and big data tools experience.
Software Engineer, Infrastructure
Build and operate foundational data infrastructure including Airflow, Flink, DynamoDB, and RDS using Terraform and Kubernetes. Requires 2-4 years of infrastructure/platform experience and strong Python skills.
Software Engineer, Developer Experience
Build internal AI tools and autonomous agents that embed into Retool's engineering workflows to boost developer productivity and reduce toil. Requires shipping real AI-powered developer tools and infrastructure.