Site Reliability Engineering Tech Lead
Palo Alto, CADevOps / SRERemote8+ YOE
Summary
Leads SRE efforts to ensure reliability, scalability, and operational excellence for DataHub Cloud and enterprise deployments. Requires 8+ years SRE/DevOps experience, 3+ years technical leadership, expertise in cloud platforms, Kubernetes, IaC, and observability tools.
About the role
Key Responsibilities
Technical Leadership & Architecture
- Design and implement robust, scalable infrastructure solutions for DataHub Cloud and enterprise deployments
- Lead the technical vision for multi-cloud deployment strategies and distributed system integrations
- Architect monitoring, observability, and alerting systems across diverse environments
- Drive best practices for infrastructure as code, configuration management, and deployment automation
Enterprise Platform Development
- Partner with product and engineering teams to influence the development of advanced deployment capabilities
- Collaborate with cross-functional teams to build systems for seamless installation, upgrade, and rollback processes across various environments
- Influence the design and help implement comprehensive monitoring and health check systems for distributed deployments
- Partner with engineering teams to develop self-healing and automated remediation capabilities
Platform Reliability & Operations
- Establish and maintain SLAs/SLOs for both cloud and enterprise offerings
- Lead incident response and post-mortem processes to drive continuous improvement
- Implement chaos engineering practices to proactively identify system weaknesses
- Optimize system performance, capacity planning, and cost efficiency
Team Leadership & Collaboration
- Mentor and guide a team of SRE engineers and collaborate with platform engineering teams
- Work closely with product, engineering, and customer success teams to ensure reliable product delivery
- Improve on-call practices, runbooks, and knowledge sharing processes
- Drive cross-functional initiatives to improve overall system reliability
Required Qualifications
- 8+ years of experience in Site Reliability Engineering, Platform Engineering, or DevOps roles
- 3+ years of technical leadership experience managing engineering teams
- Strong expertise with cloud platforms (AWS, GCP, Azure) and infrastructure automation tools
- Proficiency in containerization technologies (Docker, Kubernetes) and orchestration
- Experience with infrastructure as code tools (Terraform, CloudFormation, Pulumi)
- Strong programming skills in Python, Java, or similar languages
- Deep understanding of monitoring and observability tools (Prometheus, Grafana, Datadog, etc.)
- Experience with CI/CD pipelines and deployment automation
- Strong knowledge of networking, security, and database operations in cloud environments
Preferred Qualifications
- Experience building and operating multi-tenant SaaS platforms
- Background in developing customer-facing deployment and management tools
- Knowledge of data infrastructure and metadata management systems
- Experience with service mesh technologies and microservices architectures
- Previous experience in a customer-facing technical role or working with enterprise clients
- Experience with data governance or data catalog platforms
Skills
KubernetesDockerAWSGCPAzureTerraformPrometheusGrafanaDatadogPythonJavaCI/CDChaos Engineering