# Staff Site Reliability Engineer - Data Center
**Company:** [PathAI](https://hotfix.jobs/companies/pathai)
**Location:** Boston, MA
**Salary:** $166K-$224K
**Experience:** 8+ years
**Skills:** Ansible, Python, Go, Datadog, Grafana, Prometheus, Terraform, CloudFormation, Idrac, Ipmi, Nvidia Ufm, Juniper Systems, Quobyte, S3, Fsx For Lustre
**Posted:** 2026-04-22
> Designs, builds, and operates data centers and hybrid cloud infrastructure for ML workloads, implementing SRE practices with automation, monitoring, and reliability improvements. Requires 8+ years experience in infrastructure ops, IaC, and observability tools.
## Job Description
## What You’ll Do

- Advancing the state of our operations by implementing SRE best practices - focusing on users, monitoring, and automation.
- Engineering infrastructure patterns for cloud environments in Amazon Web Services - building in security, reliability and scalability.
- Designing, building, and operating our data center to support our rapidly growing Machine Learning team.
- Integrating on-premises datacenter environments with existing cloud infrastructure to create a seamless hybrid cloud environment.
- Improving the reliability and resilience of our infrastructure through root-cause analysis and reviewing gaps in designs, and implementations of our infrastructure.
- Participating in platform on-call rotations and assisting with urgent incident response.

## What You Bring

- 8+ years of relevant experience.
- **Automation**: You work hard to eliminate toil by automating everything through scripting, configuration management tools (**Ansible**), and code (**Python**/**Go**).
- You’ve built monitoring infrastructure with modern observability tools (**Datadog**/**Grafana**/**Prometheus**).
- You’ve worked with infrastructure as code (**Terraform**/**CloudFormation**).
- You’ve administered physical hardware stacks in production settings (**iDRAC**/**IPMI**/**Nvidia UFM**/**Juniper Systems**).
- You’re opinionated on storage solutions and how they can be optimized for high performance workloads (**Quobyte**/**S3**/**FSx**/**EFS**).
- Familiarity with modern network designs and comfort operating across network layers.
- Some experience and opinions on virtualization, containerization, or container orchestration platforms. (**EKS**/**ClusterAPI**/**KVM**).
- Operations experience: You’ve managed critical production infrastructure and are familiar with incident response, scaling, and rapid growth related challenges.
- A bachelor's degree in Computer Science or equivalent experience.
- An insatiable intellectual curiosity and the ability to learn quickly in a complex space.
- Travel: Willingness to travel up to 25% of the time.

**Annual Pay Range: $165,750 - $224,450**
Not Overtime Eligible
Eligible for Equity
**Apply:** https://hotfix.jobs/jobs/staff-site-reliability-engineer-data-center-at-pathai-131f28e9-5848-4f12-8d0b-61ae34487246
**Canonical:** https://hotfix.jobs/jobs/staff-site-reliability-engineer-data-center-at-pathai-131f28e9-5848-4f12-8d0b-61ae34487246