Skip to content

Member of Technical Staff, AI Reliability & Monitoring Engineering Lead

Lead AI reliability engineering for Postman's API and agentic systems, building monitoring, observability, and automation for high availability. Requires strong SRE/DevOps background in large-scale AI infrastructure and cloud platforms.

256k – 276kSan Francisco, CADevOps / SREHybrid

About the role

The Opportunity

Postman is seeking an experienced AI Systems Reliability Engineer to help define, build, and maintain the infrastructure and processes that ensure the reliability, scalability, and performance of Postman’s AI-powered API and agentic systems in production. This role focuses on monitoring, availability, incident response, and automation to support AI services and tools trusted by millions of developers globally.

What You’ll Do

  • Develop and manage reliability metrics (SLOs) for AI-driven API services and agentic AI platform features
  • Implement comprehensive observability and monitoring systems for real-time performance and fault detection
  • Design and drive automated failover, recovery, and incident response strategies for high-availability AI infrastructure
  • Optimize resource utilization, particularly GPU/accelerator efficiency, ensuring cost-effective AI system operation
  • Collaborate closely with engineering, platform, and product teams to align reliability efforts with broader organizational goals
  • Lead efforts to build internal tooling and automation focused on AI system stability and operational excellence
  • Drive continuous improvement in deployment practices, monitoring approaches, and incident management processes

About You

  • Have a strong background in AI reliability engineering, SRE, or DevOps for distributed systems
  • Understand the unique challenges of maintaining large-scale AI systems and integrating AI-specific metrics into reliability frameworks
  • Are experienced with cloud platforms, monitoring tools, and incident response automation
  • Are comfortable collaborating across teams to influence best practices for AI system reliability and operational health
  • Thrive in dynamic, fast-paced environments focusing on delivering reliable, safe AI-powered services

Bonus Skills and Experiences

  • Hands-on experience with AI/ML infrastructure, including GPU/xPU optimization and scaling
  • Familiarity with API platform operations and large-scale distributed services
  • Prior experience building or operating observability tools tailored for AI and agentic systems
  • Contribution to open-source projects or reliability engineering thought leadership

Compensation The reasonably estimated base salary for this role ranges from $256,000 to $276,000, plus a competitive equity package. Actual compensation is based on the candidate's skills, qualifications, and experience.

Skills

SREDevOpsAi/Ml InfrastructureGpu OptimizationObservabilityMonitoringCloud PlatformsIncident ResponseSLOsDistributed Systems

Similar roles

DevOps / SRE jobs

Member of Technical Staff, AI Platform & Architecture (Infrastructure)

Builds and maintains distributed AI infrastructure for model training, inference, and data pipelines. Requires experience in GenAI systems, distributed computing, Python/Go, and scaling AI workloads on GPUs/cloud.

256k – 276kSan Francisco, CA +3DevOps / SREHybridGoGPU

Staff Engineer, Engineering Productivity & AI Quality

As a Staff Engineer, you will build and scale engineering productivity and AI quality systems, focusing on CI/CD gates, integration test harnesses, and agent instructions. This role is critical for enabling a small engineering team to operate with high leverage by encoding architectural taste into mechanical rules.

253k – 308kSan Francisco, CADevOps / SREOn-site8+ YOECI/CDAi/Ml Systems

Staff Site Reliability Engineer

Lead EarnIn's AI-first reliability engineering strategy. Define SLOs/SLIs, build AI agents for incident response and on-call automation, and partner with engineering teams to embed AI-assisted operations across production systems on AWS.

252k – 308kMountain View, CADevOps / SREHybrid7+ YOEGoSRE

Senior Staff Software Engineer, Infrastructure

Designs and implements large-scale public cloud infrastructure, builds complex distributed systems and microservices. Requires 10+ years experience, expert skills in performance tuning, concurrency, multiple cloud providers like AWS/GCP/Azure, and graduate degree or equivalent.

260k – 325kUnited StatesDevOps / SRERemote10+ YOEGoAWS

Member of Technical Staff

Hands-on technical role building AI-powered tools, infrastructure, and processes to accelerate engineering velocity and product delivery at an AI search company.

250k – 405kSan Francisco, CA +1DevOps / SREHybrid5+ YOEGoRust