TIP 3: Post Mortems #4

bartek · 2015-04-11T21:22:48Z

No description provided.

jonprindiville · 2017-02-03T20:49:58Z

Watched a talk called Finding the Order in Chaos about lessons learned, tools, etc from examination of past postmortems at Google. Obviously we are not Google, but perhaps some useful learnings.

To do analysis of those postmortems you need data, the most important points that they like to collect are:

Timing data:

start time (e.g. time that that bad commit goes into production and users experience it)
detection time (when did someone internally first notice a problem?)
end time (issue fully resolved)

Also interested in escalation time (when incident was brought to wider attention, e.g. formal incident response procedure starts), mitigation time (drain/failover/push/rollback), when did we fully understand the impact?, when did we fully understand the root cause?

Incident metadata:

root cause categories (e.g. capacity managment, deployment problem, software interfaces, etc)
trigger categories (e.g. config push, binary push, user behaviour, third-party service, etc.)

Also interested in severity dimensions, detection method (e.g. got an alert, log tailing, user reported).

They built tooling for:

generating a blank postmortem template
simple tags to use in postmortem text
processing/analysis of postmortems
automated capturing irc/email traffic during incidents

Try to make capturing this data easy for postmortem authors. Simple tools can do a lot -- started with Google Sheets and BigQuery!

Q: Threshold for postmortems?

A: Depends on the organizational, Google has strong "postmortem culture", almost every issue that reaches customers gets postmortem'd, probably thousands of people inside the org write postmortems, lots of reading of and commenting on each others' postmortems

Nagyman · 2017-02-09T15:50:42Z

I like the idea of postmortem'ing all issues that reach customers for future preventions. Is there a good way to make this part of our culture? How can we encourage this? Where do the findings go? Postmortem repository? :)

jonprindiville · 2017-04-06T16:06:43Z

Is there a good way to make this part of our culture? How can we encourage this? Where do the findings go? Postmortem repository? :)

I like the idea of a repo for this -- encourages commenting on other people's postmortems and makes collaborating on a PM easy.

IMO a repo for these things plus a markdown/rst/whatever template for this gets us pretty far.

In terms of what that template looks like... you think we should workshop that here in this PR/repo, or do that over in the postmortem repo?

silent1mezzo · 2017-04-06T16:17:38Z

@jonprindiville @Nagyman makes sense. I've sent a couple out recently and could provide a template for them. In regards of the repo, how do you see it structured? By app? Flat with the app names in the filename?

jonprindiville · 2017-04-06T16:48:48Z

@silent1mezzo re repo structure: could be very simple, like this TIPs repo, bunch of pm-<identifier>-<slug>.md files. e.g. yesterday perhaps we would have had pm-20170405-tincan-db-drop.md

Obviously <identifier> bit is negotiable, could add another sequence number/letter if we do more than one PM in a day.

In terms of template I think that the focus should be on ease of writing/reading for humans and not on designing some exhaustive form that captures every possible corner case and tries to enumerate all possible dimensions of data...

(But! having said that, a nice-to-have would be some conventions/tags that make it possible for us to mine this for data at some point... e.g. is it possible to write a tool that would read each incident start/resolution time and graph durations?)

Nagyman · 2017-04-06T18:09:08Z

@jonprindiville read my mind RE: repo structure.

Some basic metadata could be encoded with a bit of convention... For example, Hugo supports some basic YAML-based metadata ... Perhaps something similar in Markdown?

---
title: "spf13-vim 3.0 release and new website"
description: "spf13-vim is a cross platform distribution of vim plugins and resources for Vim."
tags: [ ".vimrc", "plugins", "spf13-vim", "vim" ]
lastmod: 2015-12-23
date: "2012-04-06"
categories:
  - "Development"
  - "VIM"
slug: "spf13-vim-3-0-release-and-new-website"
---

Content of the file goes Here

jonprindiville · 2017-04-06T19:20:18Z

That could work nicely, I think. Something like:

author: [email protected]
system: gadventures.com
description: this thing broke because reasons

started: 1970-01-01 00:00:00
detected: 1970-01-01 00:00:00
fixed: 1970-01-01 00:00:00
---

Words go here... first we did this, then we did that other thing

How to identify the system at fault? (git repo name? maybe just free-text? how granular? maybe it's a list instead of a single thing?)

Is a brief description in this metadata block useful?

Other metadata to add without getting carried away? General tags? Tags for causes? Maybe that's optional?

silent1mezzo · 2017-04-06T19:53:22Z

I'd say the description would be the TL;DR

The body would be markdown format of the email sent out. I'll create a repo and work on getting the template.

jonprindiville · 2017-04-06T20:39:59Z

Maybe we can include some pointers about keeping things blameless. PMs not meant to be about assigning fault, meant to learn as an organization how to avoid making same mistakes in the future.

Some tips below (emphasis mine) from
https://codeascraft.com/2012/05/22/blameless-postmortems/ ...

... investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

... engineers whose actions have contributed to an accident can give a detailed account of:

what actions they took at what time,

what effects they observed,

expectations they had,

assumptions they had made,

and their understanding of timeline of events as they occurred.

... take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.

silent1mezzo · 2017-04-06T20:53:04Z

@jonprindiville haha I already included that link in the README as well as the video you posted above. I'm going to take some time to write more about PM's as a TIP and as a blog post.

Post Mortems

8b9e17b

bartek changed the title ~~Post Mortems~~ TIP 3: Post Mortems Apr 13, 2015

silent1mezzo self-assigned this Jun 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIP 3: Post Mortems #4

TIP 3: Post Mortems #4

bartek commented Apr 11, 2015

jonprindiville commented Feb 3, 2017

Nagyman commented Feb 9, 2017

jonprindiville commented Apr 6, 2017

silent1mezzo commented Apr 6, 2017

jonprindiville commented Apr 6, 2017

Nagyman commented Apr 6, 2017 •

edited

Loading

jonprindiville commented Apr 6, 2017

silent1mezzo commented Apr 6, 2017

jonprindiville commented Apr 6, 2017

silent1mezzo commented Apr 6, 2017

TIP 3: Post Mortems #4

Are you sure you want to change the base?

TIP 3: Post Mortems #4

Conversation

bartek commented Apr 11, 2015

jonprindiville commented Feb 3, 2017

Timing data:

Incident metadata:

They built tooling for:

Q: Threshold for postmortems?

Nagyman commented Feb 9, 2017

jonprindiville commented Apr 6, 2017

silent1mezzo commented Apr 6, 2017

jonprindiville commented Apr 6, 2017

Nagyman commented Apr 6, 2017 • edited Loading

jonprindiville commented Apr 6, 2017

silent1mezzo commented Apr 6, 2017

jonprindiville commented Apr 6, 2017

silent1mezzo commented Apr 6, 2017

Nagyman commented Apr 6, 2017 •

edited

Loading