Skip to content

Site Reliability Engineering Tech Lead

Palo Alto, CADevOps / SRERemote8+ YOE
Summary

Leads SRE efforts to ensure reliability, scalability, and operational excellence for DataHub Cloud and enterprise deployments. Requires 8+ years SRE/DevOps experience, 3+ years technical leadership, expertise in cloud platforms, Kubernetes, IaC, and observability tools.

About the role

Key Responsibilities

Technical Leadership & Architecture

  • Design and implement robust, scalable infrastructure solutions for DataHub Cloud and enterprise deployments
  • Lead the technical vision for multi-cloud deployment strategies and distributed system integrations
  • Architect monitoring, observability, and alerting systems across diverse environments
  • Drive best practices for infrastructure as code, configuration management, and deployment automation

Enterprise Platform Development

  • Partner with product and engineering teams to influence the development of advanced deployment capabilities
  • Collaborate with cross-functional teams to build systems for seamless installation, upgrade, and rollback processes across various environments
  • Influence the design and help implement comprehensive monitoring and health check systems for distributed deployments
  • Partner with engineering teams to develop self-healing and automated remediation capabilities

Platform Reliability & Operations

  • Establish and maintain SLAs/SLOs for both cloud and enterprise offerings
  • Lead incident response and post-mortem processes to drive continuous improvement
  • Implement chaos engineering practices to proactively identify system weaknesses
  • Optimize system performance, capacity planning, and cost efficiency

Team Leadership & Collaboration

  • Mentor and guide a team of SRE engineers and collaborate with platform engineering teams
  • Work closely with product, engineering, and customer success teams to ensure reliable product delivery
  • Improve on-call practices, runbooks, and knowledge sharing processes
  • Drive cross-functional initiatives to improve overall system reliability

Required Qualifications

  • 8+ years of experience in Site Reliability Engineering, Platform Engineering, or DevOps roles
  • 3+ years of technical leadership experience managing engineering teams
  • Strong expertise with cloud platforms (AWS, GCP, Azure) and infrastructure automation tools
  • Proficiency in containerization technologies (Docker, Kubernetes) and orchestration
  • Experience with infrastructure as code tools (Terraform, CloudFormation, Pulumi)
  • Strong programming skills in Python, Java, or similar languages
  • Deep understanding of monitoring and observability tools (Prometheus, Grafana, Datadog, etc.)
  • Experience with CI/CD pipelines and deployment automation
  • Strong knowledge of networking, security, and database operations in cloud environments

Preferred Qualifications

  • Experience building and operating multi-tenant SaaS platforms
  • Background in developing customer-facing deployment and management tools
  • Knowledge of data infrastructure and metadata management systems
  • Experience with service mesh technologies and microservices architectures
  • Previous experience in a customer-facing technical role or working with enterprise clients
  • Experience with data governance or data catalog platforms
Skills
KubernetesDockerAWSGCPAzureTerraformPrometheusGrafanaDatadogPythonJavaCI/CDChaos Engineering