Skip to content

Incident Response Manager - Product & Engineering

290k – 365kNew York, NYSan Francisco, CASeattle, WAHybrid5+ YOE
Summary

Leads incident response operations for product and engineering, serving as on-call commander to coordinate cross-functional teams, manage communications, and improve processes during high-stakes incidents. Requires 5+ years in incident management with technical depth in infrastructure and cloud systems.

About the role

Responsibilities

  • Build the incident response management function, establishing the processes, tooling, and operational standards that define how we handle incidents at scale
  • Serve as an on-call incident commander, driving coordinated response across technical and non-technical stakeholders during incidents of varying severity, including managing multiple active incidents simultaneously
  • Engage the right people at the right time, with a strong sense of urgency, bringing order and direction to fast-moving, ambiguous situations
  • Own incident communications end-to-end, from real-time internal coordination to external channels like status pages, direct customer outreach, and stakeholder updates, ensuring they reflect Anthropic's commitments to safety, transparency, and accuracy
  • Participate in blameless incident reviews, contributing operational context and helping drive follow-through on critical remediations so the same class of incident does not recur
  • Partner with engineering teams to develop and maintain incident response policies, procedures, and escalation frameworks that scale with Anthropic's growth
  • Partner with engineering, product, security, legal, and go-to-market teams to continuously improve how the organization detects, responds to, and learns from incidents

You May Be a Good Fit If You

  • Have 5+ years of experience in incident management, with direct experience managing technical product or infrastructure incidents (not exclusively security or trust and safety)
  • Have built or significantly shaped an incident response program, ideally at a high-growth startup or in an environment where you had to create structure rather than inherit it
  • Demonstrate a strong sense of ownership and urgency, with the ability to operate independently and make sound decisions under pressure without waiting for direction
  • Are comfortable working in unprecedented situations where processes are still being defined and guidance may be incomplete or conflicting, leaving things better than you found them
  • Have a track record of effective cross-functional collaboration, particularly with engineering, security, legal, communications, go-to-market, and executive leadership
  • Bring a blameless, learning-oriented mindset to incident reviews, focused on systemic improvement rather than individual fault
  • Have experience with cloud infrastructure incidents and enough technical depth across the stack to engage meaningfully with engineering teams during response, including comfort navigating distributed systems, monitoring tools, and logs
  • Are analytically minded, with experience using data (incident metrics, queries, trend analysis) to inform decisions during response and to drive operational improvements over time
  • Communicate clearly and calmly under pressure, both in real-time coordination and in post-incident written communications
  • Thrive in high-volume, fast-paced environments and are energized by bringing operational discipline to complex, evolving situations

Annual Salary: $290,000—$365,000 USD

Skills
Incident ManagementCloud InfrastructureDistributed SystemsMonitoring ToolsLogs AnalysisIncident MetricsTrend AnalysisOn-Call ManagementIncident ReviewsEscalation Frameworks
Similar roles at this salary range
All DevOps / SRE jobs →
Onebrief

Principal Infrastructure Engineer

Principal Infrastructure Engineer building and operating secure cloud-native and edge platforms for military collaboration software. Requires 8+ years production infrastructure experience, deep Kubernetes expertise, and ability to obtain SECRET clearance.

235k – 275kUnited StatesDevOps / SRERemoteGoAWS
Sentry

Staff Software Engineer, AI Developer Tooling

Own AI-assisted coding tooling at Sentry. Build harnesses, context systems, and API integrations so AI agents can operate across the full software development lifecycle.

240k – 320kSan Francisco, CADevOps / SREHybridCI/CDPython
Together AI

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

Design and operate multi-petabyte distributed storage systems for large-scale AI training and inference, integrating parallel filesystems and building Kubernetes-native storage platforms.

250k – 300kSan Francisco, CADevOps / SREOn-siteGoCeph
Forge

Director of Platform & Reliability Engineering

The Director of Platform & Reliability Engineering will lead an engineering organization responsible for secure, scalable, and highly reliable products. This role involves setting the vision for internal platforms, cloud infrastructure, developer enablement, and production operations.

235k – 245kSan Francisco, CADevOps / SREHybridCI/CDKubernetes
Anthropic

Staff Software Engineer, Infrastructure Asset Systems

As a Staff Software Engineer, you will build and extend systems for tracking, governing, and reporting on infrastructure assets. This involves designing data models, workflow engines, and integrations with financial and procurement systems, ensuring compliance and auditability.

320k – 405kSan Francisco, CA +1DevOps / SREHybridGoSQL