Skip to content

Sr. Software Engineer

Memphis, TNOnsite5+ YOE
Summary

Sr. Software Engineer focused on automating reliability workflows, observability, and incident response across multi-data center AI infrastructure. Requires strong Python skills, Linux expertise, and 3+ years SRE/infrastructure experience.

About the role

Responsibilities

  • Design, develop, and deploy scalable code and services (primarily in Python and Rust) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning.
  • Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers.
  • Collaborate with cross-functional teams—including software development, network engineering, site operations, and facility operations—to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation.
  • Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems.
  • Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration, and scripting for automation.
  • Understand network topologies and concepts in large-scale, multi-data center environments to effectively troubleshoot connectivity, routing, redundancy, and performance issues.
  • Participate in on-call rotations, post-incident reviews (blameless postmortems), and continuous improvement initiatives.
  • Mentor junior team members and document processes.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience).
  • 3+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, DevOps, or systems engineering, preferably supporting large-scale, distributed, or production environments.
  • Strong programming skills with proven production experience in Python; experience with Rust or willingness to work in Rust is a plus; strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential.
  • Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.
  • Practical knowledge of containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).
  • Experience implementing observability solutions, including metrics, logging, tracing, monitoring tools (e.g., Prometheus, Grafana, or alternatives), alerting, and dashboards.
  • Familiarity with troubleshooting complex issues in distributed systems, including software bugs, hardware failures, network problems, and environmental factors.
  • Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.
  • Experience participating in on-call rotations, incident response, post-incident reviews, and reliability practices such as error budgets or SLAs.
  • Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).

Nice-to-Haves

  • 5+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud, or AI/ML training infrastructure environments with multi-data center setups.
  • Hands-on experience operating or scaling Kubernetes clusters (or equivalent orchestration) at large scale, including automation for provisioning, lifecycle management, and high-availability.
  • Proficiency in Rust for systems programming and performance-critical components.
  • Direct experience integrating software reliability tools with physical data center infrastructure (e.g., power, cooling, environmental monitoring, facility controls).
Skills
PythonRustLinuxKubernetesDockerPrometheusGrafanaObservabilitySite Reliability EngineeringNetworking