External Resources

Incident Response / External Resources

This is a collection of external resources that may be useful for learning more about elements of incident response. Please feel free to submit PRs to add new resources if you find something particularly interesting.

Incident Response Procedures

Articles

  • PagerDuty Incident Response Guide

    This is the full (slightly sanitized) version of PagerDuty's internal

    incident response documentation, and it is very comprehensive. It is

    an excellent resource for seeing how to apply our general principles

    to a specific service.

  • Remote Incident Response

    This article by Ryan Frantz with help from Dr. Laura Maguire discusses

    the unique challenges of dealing with incident response with a

    distributed team.

Talks and Videos

  • nrrd 911 ic me: The Incident Commander Role

    This is a talk by Alice Goldfuss from SRECon 2016 where she talked

    about the incident response process at New Relic; this includes a

    discussion of severity levels and how they used a chatbot to

    automate elements of the process.

Books

  • Incident Management for Operations

    This is a book that talks about how to apply the ICS system to

    IT operations; it is a good introduction to the topic and

    describes how this actually looks in practice.

Incident Retrospectives (aka Postmortems)

Articles

  • Each Necessary, But Only Jointly Sufficient

    This 2012 blog post from John Allspaw provides a short description of

    why the idea of a "root cause" is a fundamentally flawed idea, and why

    learning must be the driving force behind incident analysis, not fixing.

  • Etsy Debriefing Facilitation Guide

    This is the incident retrospective guide used by Etsy and open-sourced

    in 2016; it's an excellent resource for conducting your own debriefings

    and the basis for a lot of similar guides throughout the industry.

  • The Infinite Hows

    This article by John Allspaw talks about the issues with the

    commonly used "Five Whys" system of incident analysis, and does an

    excellent job providing an overview of an alternative approach.

Talks and Videos

  • Incidents As We Imagine Them Versus How They Actually Are

    This is a talk by John Allspaw at PagerDuty Summit 2018 which is an

    excellent summary of the thorny issues around doing incident response

    and how what actually happened often gets oversimplified in a desire

    to make incidents fit in standardized boxes. If you watch nothing else

    about incident analysis, watch this.

  • Who Destroyed Three Mile Island?

    This talk by Nickolas Means at LeadDev Austin 2018 talks about the 1979

    Three Mile Island disaster and is an excellent walkthrough of the

    difference between first stories and second stories, and the dangers of

    hindsight and outcome bias.

Books

  • The Field Guide to Understanding Human Error

    This one of the best texts for incident analysis, written by Sidney

    Dekker, a leader in the field. While not specifically about software

    services, the guidance in this book is applicable to almost any

    technical system.

Last updated

Was this helpful?