Skip to content

Senior Site Reliability Engineer Cloud Platform

Senior SRE focuses on ensuring reliability, availability, and performance of distributed database systems in cloud-native environments. Requires 4+ years experience with Kubernetes, Docker, cloud platforms (AWS/GCP/Azure), IaC tools, and scripting in Python/Go/Java.

175k – 225kRedwood City, CADevOps / SREHybrid4+ YOE

About the role

Responsibilities

  • Work at the intersection of development and site reliability, creating SRE tools and systems while supporting existing infrastructure and platforms.
  • Ensure the reliability, availability, and performance of Zilliz’s distributed database systems.
  • Develop and implement strategies for monitoring, incident management, and disaster recovery.
  • Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention.
  • Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness.
  • Collaborate with software engineers to enhance system reliability, scalability, and performance.
  • Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes.
  • Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency.

Requirements

  • 4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems.
  • Proficiency in scripting languages such as Python, Go, or Java.
  • Strong knowledge of container orchestration technologies like Kubernetes and Docker.
  • Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools.
  • Experience with infrastructure as code tools such as Terraform or Ansible.
  • Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo.
  • Proven ability to troubleshoot complex distributed systems and resolve issues promptly.
  • Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines.
  • Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously.

Nice-to-Haves

  • Experience with Open Source Milvus Vector Database.

Skills

PythonGoJavaKubernetesDockerAWSGCPAzureTerraformAnsibleJenkinsGitlab CiArgoMilvus

Similar roles

DevOps / SRE jobs

Senior Software Engineer, Cloud Platform

Build and operate the cloud platform powering Zilliz Cloud and Vector Lakebase across multi-cloud environments, integrating control plane, scheduling, and database runtime for scalable AI workloads. Requires 3+ years building production systems, strong Kubernetes and cloud experience, and a bachelor's degree or equivalent.

175k – 225kRedwood City, CADevOps / SREHybrid3+ YOEAWSGCP

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Senior Site Reliability Engineer, TCore (FedRamp)

Staff SRE on the TCore team responsible for designing and operating Okta's global network infrastructure, ensuring high availability, performance, and security of cloud edge and internal networks. Requires 8+ years in cloud networking with deep AWS/GCP expertise and automation skills.

174k – 267kSan Francisco, CADevOps / SREHybrid8+ YOEGoAWS

Senior Software Engineer, DevProd (Infrastructure Observability)

Leads end-to-end development of scalable distributed systems for infrastructure observability, owns production issues, and collaborates on designs. Requires expertise in Go, Kubernetes, SQL, cloud providers, and observability tools like Clickhouse and Prometheus.

176k – 238kUnited StatesDevOps / SRERemoteGoSQL