-
Notifications
You must be signed in to change notification settings - Fork 1
TIP 3: Post Mortems #4
base: master
Are you sure you want to change the base?
Conversation
Watched a talk called Finding the Order in Chaos about lessons learned, tools, etc from examination of past postmortems at Google. Obviously we are not Google, but perhaps some useful learnings. To do analysis of those postmortems you need data, the most important points that they like to collect are: Timing data:
Also interested in escalation time (when incident was brought to wider attention, e.g. formal incident response procedure starts), mitigation time (drain/failover/push/rollback), when did we fully understand the impact?, when did we fully understand the root cause? Incident metadata:
Also interested in severity dimensions, detection method (e.g. got an alert, log tailing, user reported). They built tooling for:
Try to make capturing this data easy for postmortem authors. Simple tools can do a lot -- started with Google Sheets and BigQuery! Q: Threshold for postmortems?A: Depends on the organizational, Google has strong "postmortem culture", almost every issue that reaches customers gets postmortem'd, probably thousands of people inside the org write postmortems, lots of reading of and commenting on each others' postmortems |
I like the idea of postmortem'ing all issues that reach customers for future preventions. Is there a good way to make this part of our culture? How can we encourage this? Where do the findings go? Postmortem repository? :) |
I like the idea of a repo for this -- encourages commenting on other people's postmortems and makes collaborating on a PM easy. IMO a repo for these things plus a markdown/rst/whatever template for this gets us pretty far. In terms of what that template looks like... you think we should workshop that here in this PR/repo, or do that over in the postmortem repo? |
@jonprindiville @Nagyman makes sense. I've sent a couple out recently and could provide a template for them. In regards of the repo, how do you see it structured? By app? Flat with the app names in the filename? |
@silent1mezzo re repo structure: could be very simple, like this TIPs repo, bunch of Obviously In terms of template I think that the focus should be on ease of writing/reading for humans and not on designing some exhaustive form that captures every possible corner case and tries to enumerate all possible dimensions of data... (But! having said that, a nice-to-have would be some conventions/tags that make it possible for us to mine this for data at some point... e.g. is it possible to write a tool that would read each incident start/resolution time and graph durations?) |
@jonprindiville read my mind RE: repo structure. Some basic metadata could be encoded with a bit of convention... For example, Hugo supports some basic YAML-based metadata ... Perhaps something similar in Markdown?
|
That could work nicely, I think. Something like:
How to identify the Is a brief Other metadata to add without getting carried away? General tags? Tags for causes? Maybe that's optional? |
I'd say the description would be the TL;DR The body would be markdown format of the email sent out. I'll create a repo and work on getting the template. |
Maybe we can include some pointers about keeping things blameless. PMs not meant to be about assigning fault, meant to learn as an organization how to avoid making same mistakes in the future. Some tips below (emphasis mine) from
|
@jonprindiville haha I already included that link in the README as well as the video you posted above. I'm going to take some time to write more about PM's as a TIP and as a blog post. |
No description provided.