Senior SRE engineer builds tooling and automation to enhance production system reliability, monitoring microservices, Kubernetes, and ML platforms. Requires 6+ years in software/SRE/DevOps, proficiency in Python/Go, IaC, and observability tools.
About the role
How you’ll make an impact
- Embody and share SRE principles at Upstart
- Exercise state-of-the-art SRE practices throughout the company
- Uphold a culture of visibility, ownership, and responsibility around service reliability
- Implement standards for monitoring microservices, web apps, mobile apps, databases, Kubernetes clusters, and machine learning platforms, in a fast-paced environment
- Improve incident response practices, both within SRE and throughout the company
- Automate away toil that make sense to be automated
What we’re looking for
Minimum requirements:
- Minimum of 6 years combined experience between Software Engineering, Site Reliability, and/or DevOps Engineering including CI/CD, TDD, internal tooling, observability, and other agile development practices
- Proficiency coding Python, Go, JavaScript/TypeScript
- Proficiency with Infrastructure as Code (Terraform, CDK, Cloudformation, etc.)
- Software engineering background with experience building internal tooling from scratch, and other agile development techniques
- Strong software design & architecture skills
- Fundamentally sound with data structures & algorithms
- Experience with on-call and incident management environments
- Experience with observability, monitoring, and reporting tools (e.g., Datadog, Sumologic, etc.)
- Experience supporting SaaS software in a microservice-oriented cloud environment
- Ability to work with multiple teams for enterprise-wide deliverables
- Data/metrics-driven mindset
Preferred qualifications:
- Experience with service mesh
- Full Stack development skills
- Experience building tooling for an observability platform
- Experience leveraging LLM/GenAI to improve SRE efficiency and processes
Skills
Ruby on RailsReactAWSDockerGitHub ActionsDistributed SystemsService-Oriented ArchitectureCI/CDInfrastructure As CodeA/B Testing