Incident Response Manager - Product & Engineering

290k – 365kNew York, NYSan Francisco, CASeattle, WAHybrid5+ YOEApr 29

Summary

Leads incident response operations for product and engineering, serving as on-call commander to coordinate cross-functional teams, manage communications, and improve processes during high-stakes incidents. Requires 5+ years in incident management with technical depth in infrastructure and cloud systems.

About the role

Responsibilities

Build the incident response management function, establishing the processes, tooling, and operational standards that define how we handle incidents at scale
Serve as an on-call incident commander, driving coordinated response across technical and non-technical stakeholders during incidents of varying severity, including managing multiple active incidents simultaneously
Engage the right people at the right time, with a strong sense of urgency, bringing order and direction to fast-moving, ambiguous situations
Own incident communications end-to-end, from real-time internal coordination to external channels like status pages, direct customer outreach, and stakeholder updates, ensuring they reflect Anthropic's commitments to safety, transparency, and accuracy
Participate in blameless incident reviews, contributing operational context and helping drive follow-through on critical remediations so the same class of incident does not recur
Partner with engineering teams to develop and maintain incident response policies, procedures, and escalation frameworks that scale with Anthropic's growth
Partner with engineering, product, security, legal, and go-to-market teams to continuously improve how the organization detects, responds to, and learns from incidents

You May Be a Good Fit If You

Have 5+ years of experience in incident management, with direct experience managing technical product or infrastructure incidents (not exclusively security or trust and safety)
Have built or significantly shaped an incident response program, ideally at a high-growth startup or in an environment where you had to create structure rather than inherit it
Demonstrate a strong sense of ownership and urgency, with the ability to operate independently and make sound decisions under pressure without waiting for direction
Are comfortable working in unprecedented situations where processes are still being defined and guidance may be incomplete or conflicting, leaving things better than you found them
Have a track record of effective cross-functional collaboration, particularly with engineering, security, legal, communications, go-to-market, and executive leadership
Bring a blameless, learning-oriented mindset to incident reviews, focused on systemic improvement rather than individual fault
Have experience with cloud infrastructure incidents and enough technical depth across the stack to engage meaningfully with engineering teams during response, including comfort navigating distributed systems, monitoring tools, and logs
Are analytically minded, with experience using data (incident metrics, queries, trend analysis) to inform decisions during response and to drive operational improvements over time
Communicate clearly and calmly under pressure, both in real-time coordination and in post-incident written communications
Thrive in high-volume, fast-paced environments and are energized by bringing operational discipline to complex, evolving situations

Annual Salary: $290,000—$365,000 USD

Skills

Incident ManagementCloud InfrastructureDistributed SystemsMonitoring ToolsLogs AnalysisIncident MetricsTrend AnalysisOn-Call ManagementIncident ReviewsEscalation Frameworks

Similar roles at this salary range

All DevOps / SRE jobs →

Onebrief

Jun 4

Principal Infrastructure Engineer

Principal Infrastructure Engineer building and operating secure cloud-native and edge platforms for military collaboration software. Requires 8+ years production infrastructure experience, deep Kubernetes expertise, and ability to obtain SECRET clearance.

235k – 275kUnited StatesDevOps / SRERemoteGoAWS

Sentry

Jun 4

Staff Software Engineer, AI Developer Tooling

Own AI-assisted coding tooling at Sentry. Build harnesses, context systems, and API integrations so AI agents can operate across the full software development lifecycle.

240k – 320kSan Francisco, CADevOps / SREHybridCI/CDPython

Together AI

Jun 4

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

Design and operate multi-petabyte distributed storage systems for large-scale AI training and inference, integrating parallel filesystems and building Kubernetes-native storage platforms.

250k – 300kSan Francisco, CADevOps / SREOn-siteGoCeph

Forge

Jun 4

Director of Platform & Reliability Engineering

The Director of Platform & Reliability Engineering will lead an engineering organization responsible for secure, scalable, and highly reliable products. This role involves setting the vision for internal platforms, cloud infrastructure, developer enablement, and production operations.

235k – 245kSan Francisco, CADevOps / SREHybridCI/CDKubernetes

Anthropic

Jun 3

Staff Software Engineer, Infrastructure Asset Systems

As a Staff Software Engineer, you will build and extend systems for tracking, governing, and reporting on infrastructure assets. This involves designing data models, workflow engines, and integrations with financial and procurement systems, ensuring compliance and auditability.

320k – 405kSan Francisco, CA +1DevOps / SREHybridGoSQL

Apply