Responsibilities

Design, develop, and deploy scalable code and services (primarily in Python and Rust) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning.
Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers.
Collaborate with cross-functional teams—including software development, network engineering, site operations, and facility operations—to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation.
Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems.
Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration, and scripting for automation.
Understand network topologies and concepts in large-scale, multi-data center environments to effectively troubleshoot connectivity, routing, redundancy, and performance issues.
Participate in on-call rotations, post-incident reviews (blameless postmortems), and continuous improvement initiatives.
Mentor junior team members and document processes.

Requirements

Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience).
3+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, DevOps, or systems engineering, preferably supporting large-scale, distributed, or production environments.
Strong programming skills with proven production experience in Python; experience with Rust or willingness to work in Rust is a plus; strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential.
Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.
Practical knowledge of containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).
Experience implementing observability solutions, including metrics, logging, tracing, monitoring tools (e.g., Prometheus, Grafana, or alternatives), alerting, and dashboards.
Familiarity with troubleshooting complex issues in distributed systems, including software bugs, hardware failures, network problems, and environmental factors.
Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.
Experience participating in on-call rotations, incident response, post-incident reviews, and reliability practices such as error budgets or SLAs.
Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).

Nice-to-Haves

5+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud, or AI/ML training infrastructure environments with multi-data center setups.
Hands-on experience operating or scaling Kubernetes clusters (or equivalent orchestration) at large scale, including automation for provisioning, lifecycle management, and high-availability.
Proficiency in Rust for systems programming and performance-critical components.
Direct experience integrating software reliability tools with physical data center infrastructure (e.g., power, cooling, environmental monitoring, facility controls).