Site Reliability Engineer
Owns digital infrastructure for AI research, managing compute access, auto-scaling, resource visibility, and reproducibility using Kubernetes and observability tools. Requires systems intuition, operational rigor, and pragmatism for experimental workloads.
Responsibilities
- Own digital infrastructure powering research, including compute resources from third parties, container registries, and dashboards.
- Ensure easy and efficient sharing of resources, reliability, and accessibility.
- Provide compute access, resource visibility into utilization and cluster health.
- Enable auto-scaling of compute resources based on demand.
- Manage access to ensure right people have appropriate permissions.
- Drive deterministic deployments and reproducible research environments.
- Automate operational processes for efficiency.
Current stack: Ansible, Kubernetes, Docker, Tailscale, Python, Grafana, Prometheus, Talos Linux.
Qualifications
- Ownership: Comfortable being accountable for cluster health and capacity.
- Systems Intuition: Understand schedulers, containers, networking, storage, hardware interactions; reason about failure modes.
- Operational Rigor: Value observability, reproducibility, clear boundaries; leave understandable systems.
- Pragmatism: Support experimental workloads without rigid production constraints.
Site Reliability Engineer II (Remote, US)
DevOps/SRE II building and maintaining infrastructure for an insurance platform using GCP, Kubernetes, and Terraform. Focus on automation, monitoring, incident response, and security best practices.
Senior Network Engineer
Senior Network Engineer building and supporting carrier interconnects, private circuits, NNIs, and cloud connectivity for a managed network services provider. Requires hands-on service provider experience with Layer 2/3 protocols and direct carrier coordination.
Site Reliability Engineer - AI Agents
Design, build, and operate reliable infrastructure for AI agent workflows and model serving on AWS and Kubernetes. Build platform APIs, SDKs, and self-service tooling while ensuring observability and incident response for production AI systems.