Skip to content
This repository has been archived by the owner on Jun 24, 2020. It is now read-only.

TIP 3: Post Mortems #4

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

TIP 3: Post Mortems #4

wants to merge 1 commit into from

Conversation

bartek
Copy link
Contributor

@bartek bartek commented Apr 11, 2015

No description provided.

@bartek bartek changed the title Post Mortems TIP 3: Post Mortems Apr 13, 2015
@silent1mezzo silent1mezzo self-assigned this Jun 23, 2015
@jonprindiville
Copy link
Member

Watched a talk called Finding the Order in Chaos about lessons learned, tools, etc from examination of past postmortems at Google. Obviously we are not Google, but perhaps some useful learnings.


To do analysis of those postmortems you need data, the most important points that they like to collect are:

Timing data:

  • start time (e.g. time that that bad commit goes into production and users experience it)
  • detection time (when did someone internally first notice a problem?)
  • end time (issue fully resolved)

Also interested in escalation time (when incident was brought to wider attention, e.g. formal incident response procedure starts), mitigation time (drain/failover/push/rollback), when did we fully understand the impact?, when did we fully understand the root cause?

Incident metadata:

  • root cause categories (e.g. capacity managment, deployment problem, software interfaces, etc)
  • trigger categories (e.g. config push, binary push, user behaviour, third-party service, etc.)

Also interested in severity dimensions, detection method (e.g. got an alert, log tailing, user reported).

They built tooling for:

  • generating a blank postmortem template
  • simple tags to use in postmortem text
  • processing/analysis of postmortems
  • automated capturing irc/email traffic during incidents

Try to make capturing this data easy for postmortem authors. Simple tools can do a lot -- started with Google Sheets and BigQuery!

Q: Threshold for postmortems?

A: Depends on the organizational, Google has strong "postmortem culture", almost every issue that reaches customers gets postmortem'd, probably thousands of people inside the org write postmortems, lots of reading of and commenting on each others' postmortems

@Nagyman
Copy link
Contributor

Nagyman commented Feb 9, 2017

I like the idea of postmortem'ing all issues that reach customers for future preventions. Is there a good way to make this part of our culture? How can we encourage this? Where do the findings go? Postmortem repository? :)

@jonprindiville
Copy link
Member

Is there a good way to make this part of our culture? How can we encourage this? Where do the findings go? Postmortem repository? :)

I like the idea of a repo for this -- encourages commenting on other people's postmortems and makes collaborating on a PM easy.

IMO a repo for these things plus a markdown/rst/whatever template for this gets us pretty far.

In terms of what that template looks like... you think we should workshop that here in this PR/repo, or do that over in the postmortem repo?

@silent1mezzo
Copy link
Contributor

@jonprindiville @Nagyman makes sense. I've sent a couple out recently and could provide a template for them. In regards of the repo, how do you see it structured? By app? Flat with the app names in the filename?

@jonprindiville
Copy link
Member

@silent1mezzo re repo structure: could be very simple, like this TIPs repo, bunch of pm-<identifier>-<slug>.md files. e.g. yesterday perhaps we would have had pm-20170405-tincan-db-drop.md

Obviously <identifier> bit is negotiable, could add another sequence number/letter if we do more than one PM in a day.

In terms of template I think that the focus should be on ease of writing/reading for humans and not on designing some exhaustive form that captures every possible corner case and tries to enumerate all possible dimensions of data...

(But! having said that, a nice-to-have would be some conventions/tags that make it possible for us to mine this for data at some point... e.g. is it possible to write a tool that would read each incident start/resolution time and graph durations?)

@Nagyman
Copy link
Contributor

Nagyman commented Apr 6, 2017

@jonprindiville read my mind RE: repo structure.

Some basic metadata could be encoded with a bit of convention... For example, Hugo supports some basic YAML-based metadata ... Perhaps something similar in Markdown?

---
title: "spf13-vim 3.0 release and new website"
description: "spf13-vim is a cross platform distribution of vim plugins and resources for Vim."
tags: [ ".vimrc", "plugins", "spf13-vim", "vim" ]
lastmod: 2015-12-23
date: "2012-04-06"
categories:
  - "Development"
  - "VIM"
slug: "spf13-vim-3-0-release-and-new-website"
---

Content of the file goes Here

@jonprindiville
Copy link
Member

That could work nicely, I think. Something like:

author: [email protected]
system: gadventures.com
description: this thing broke because reasons

started: 1970-01-01 00:00:00
detected: 1970-01-01 00:00:00
fixed: 1970-01-01 00:00:00
---

Words go here... first we did this, then we did that other thing

How to identify the system at fault? (git repo name? maybe just free-text? how granular? maybe it's a list instead of a single thing?)

Is a brief description in this metadata block useful?

Other metadata to add without getting carried away? General tags? Tags for causes? Maybe that's optional?

@silent1mezzo
Copy link
Contributor

I'd say the description would be the TL;DR

The body would be markdown format of the email sent out. I'll create a repo and work on getting the template.

@jonprindiville
Copy link
Member

Maybe we can include some pointers about keeping things blameless. PMs not meant to be about assigning fault, meant to learn as an organization how to avoid making same mistakes in the future.

Some tips below (emphasis mine) from
https://codeascraft.com/2012/05/22/blameless-postmortems/ ...

... investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

... engineers whose actions have contributed to an accident can give a detailed account of:

  • what actions they took at what time,
  • what effects they observed,
  • expectations they had,
  • assumptions they had made,
  • and their understanding of timeline of events as they occurred.

... take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.

@silent1mezzo
Copy link
Contributor

@jonprindiville haha I already included that link in the README as well as the video you posted above. I'm going to take some time to write more about PM's as a TIP and as a blog post.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants