Skip to content

Software Engineer, Compute Infrastructure

Build and operate Kubernetes-based compute and runtime infrastructure powering AI search, assistant, and agent workloads across multi-cloud environments. Own reliability, scalability, cost-efficiency, and on-call for production platform services.

140k – 220kMountain View, CADevOps / SREHybrid5+ YOE

About the role

Responsibilities

  • Design, build, and own backend/platform services that power Glean’s runtime infrastructure, with a focus on reliability, scalability, and performance for AI and search workloads.
  • Develop and evolve Kubernetes-based runtime primitives (e.g., service orchestration, scheduling integrations, autoscaling patterns) across multi-cloud foundation (GCP, AWS, Azure).
  • Collaborate with platform, data, and product engineering teams to make it easy and safe to spin up new services and batch workloads, with clear golden paths for deployment, configuration, and runtime operations.
  • Drive end-to-end improvements in latency, resource utilization, and cost for core platform services, including multitenant runtime environments and experimental AI workloads.
  • Implement and harden infrastructure-as-code patterns, observability, and guardrails so teams can confidently ship and run services in production (e.g., SLOs, dashboards, alerts, safe rollout/rollback).
  • Partner with the Costs and Runtime teams to build shared mechanisms for attribution, guardrails, and automation that keep the runtime layer efficient as the company scales.
  • Participate in an on-call rotation for critical platform services, lead incident response when needed, and translate learnings into better reliability, tooling, and documentation.
  • Contribute to technical direction for Runtime Infra: help define roadmaps around multitenancy, autoscaling, capacity/placement, and platformized patterns.

Requirements

  • Strong distributed systems fundamentals and experience operating high-throughput, low-latency services or batch pipelines in production environments.
  • Comfortable owning systems end-to-end: design, implementation, testing, deployment, observability, and ongoing operations.
  • Experience thinking in terms of reliability and guardrails: SLOs, incident response, safe deployment strategies, and clear operational runbooks.
  • Pragmatic and execution-oriented: balance ideal architectures with the constraints of a fast-moving startup and ship iterative improvements.
  • Clear communication with both infra and product engineers; enjoy collaborating across teams to understand requirements and translate them into platform capabilities.
  • Excited to work in a multi-cloud, multi-tenant environment and help define best practices for running AI workloads efficiently at scale.

Nice-to-Haves

  • Experience with Kubernetes-based runtime systems and multi-cloud infrastructure (GCP, AWS, Azure).
  • Background in cost-efficient, low-latency execution for production services and pipelines.

Skills

KubernetesGCPAWSAzureInfrastructure As CodeDistributed SystemsObservabilitySLOsAutoscalingMulti-Tenancy

Similar roles

DevOps / SRE jobs

Release Engineer

As a Release Engineer, you will orchestrate software releases for autonomous vehicle technology, ensuring secure and streamlined delivery from development to production. This role involves managing simulation tools and autonomy software releases, coordinating vehicle-level testing, and scaling automation systems.

140k – 190kFoster City, CADevOps / SREHybrid3+ YOEGitAWS

Site Reliability Engineering

Site Reliability Engineer owns the lifecycle of services powering autonomous vehicles, designing fault-tolerant systems, building monitoring tools, leading incident response, and ensuring infrastructure resilience with large-scale data processing on CPUs/GPUs. Requires 5+ years SRE experience, cloud/IaC expertise, Kubernetes, and strong programming skills.

140k – 230kFoster City, CADevOps / SREHybrid5+ YOEGoAWS

Infrastructure Engineer, Foundation

Infrastructure Engineer on the Foundation team builds and maintains highly available systems and developer tooling to ensure platform stability and productivity for processing mortgage transactions. Requires deep curiosity, full ownership from design to maintenance, and ability to solve hard problems under pressure.

140k – 220kPalo Alto, CA +1DevOps / SREHybridAWSGraphQL

DevOps Engineer, DevEx

Builds and evolves internal developer platforms using Kubernetes, Terraform, and GitOps to enhance reliability, scalability, and DevEx. Requires 5+ years in platform engineering, strong AWS and cloud-native expertise, with on-call responsibilities.

140k – 170kNew York, NY +1DevOps / SRERemote5+ YOEAWSHelm

Software Engineer, Developer Productivity

Designs and optimizes build systems, CI/CD pipelines, and developer tooling in a Bazel monorepo. Enables AI-powered productivity tools like GitHub Copilot to boost engineering velocity and reduce workflow friction.

140k – 265kPalo Alto, CA +1DevOps / SREHybridGoJava