Skip to content

Infrastructure Engineer (Mid/Senior/Staff)

Builds and maintains Python-based tooling for managing large-scale GPU server fleets, including provisioning, health monitoring, AI-driven recovery, Linux tuning, and security hardening. Requires 3+ years managing server fleets at scale with strong Python and Linux expertise.

180k – 250kSan Francisco, CADevOps / SREHybrid

About the role

Key Responsibilities

  • Build and maintain Python fleet tracking system that manages the full lifecycle of servers including contracting and procurement, target use, pricing, availability, health, RMAs, etc
  • Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting
  • Create and maintain metrics, dashboards, and alerting for hardware health across the fleet (GPU errors, disk failures, network issues, thermals)
  • Leverage AI to an extreme level to build tools and automate alerting and recovery
  • Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation
  • Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage
  • Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes)
  • Develop a suite of automated error detection and recovery processes
  • Work with partners to solve technical issues

Requirements

  • 3+ years experience managing bare-metal and cloud based server fleets at scale (100+ nodes)
  • Strong software engineering skills in Python; you write production tooling, not scripts
  • Deep Linux systems knowledge: boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling
  • Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init
  • Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning
  • Familiarity with hardware diagnostics and failure modes (GPUs, NVMe, NICs, memory)
  • Experience building internal tools or dashboards for infrastructure visibility
  • Excellent communication and ability to drive technical decisions across teams
  • Self-starter who executes quickly, takes ownership, and constantly seeks improvement

Nice to Have

  • Familiarity with network configuration and diagnostics (VLAN, VXLAN, ECMP, BGP, tcpdump)
  • Experience with NVIDIA GPU infrastructure: driver management, health monitoring, DCGM, NVLink/NVSwitch diagnostics, RDMA, InfiniBand/RoCEv2
  • Experience with AMD GPUs
  • Experience with bare metal and VM provisioning (PXE/iPXE, Kickstart, libvirt, Qemu/KVM)
  • Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001)

Compensation

$180,000-250,000 plus equity + benefits

Skills

PythonLinuxAnsibleTerraformNvidia GpuCUDAKubernetesNvmeNfsSelinuxPrometheusGrafanaDcgmLvmRaid

Similar roles

DevOps / SRE jobs

Staff Software Engineer, Cloud FinOps

Staff-level engineer driving company-wide cloud cost optimization and FinOps initiatives across engineering teams. Requires 5+ years infrastructure experience and 2+ years FinOps/cloud cost management.

180k – 240kUnited StatesDevOps / SRERemote5+ YOEAWSJava

Staff Engineer, AI Productivity

Staff-level engineer building infrastructure, tooling, and documentation to make AI coding agents dramatically more productive across the codebase. Owns agentic dev environments, MCP integrations, and agent context.

180k – 400kUnited StatesDevOps / SRERemote7+ YOEGoDevin

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-site6+ YOELoggingClickHouse

Staff Software Engineer, AI Developer Tools

Staff-level engineer architecting AI-native developer tools and infrastructure to accelerate engineering velocity across Gusto. Requires 8+ years experience building production AI systems with deep expertise in LLMs, RAG, and multi-agent workflows.

180k – 245kDenver, CO +3DevOps / SREHybrid8+ YOERAGLLMs

Staff Infrastructure Engineer

Staff Infrastructure Engineer building and operating secure cloud-native and edge platforms for military collaboration software. Requires 5+ years production infrastructure experience, deep Kubernetes expertise, and ability to obtain SECRET clearance.

180k – 235kUnited StatesDevOps / SRERemote5+ YOEGoAWS