Skip to content

Incident Management Protocol

Dani Donisa edited this page Sep 14, 2020 · 31 revisions

Effective Incident Management is vital to limit disruption caused by an incident and go back to normal business operation as fast as possible.

When an incident happens, the priorities are:

  1. Stop the bleeding
  2. Restore the service
  3. Preserve the evidence in order to find the root-cause

Key elements:

  • Continuous communication to keep stakeholders and users up to date.
  • A role to coordinate communication between the different parties.
  • A role to think on the big picture and longer-term tasks to offload those duties from the people working in the incident resolution.
  • A predefined communication channel where all the communication goes through.

Clear Roles

It's very important that every team member working in the incident resolution knows their role. The role separation helps in knowing what the team member should and should not do in order to avoid confusion and chaos around who's responsible for what.

Every role should have full autonomy inside its boundaries and complete trust from the rest of the team members.

Every team member should ask for help to their Planning Lead when the workload starts to feel excessive or overwhelming.

Every lead can delegate components of work to other colleagues as they see fit.

Every team member should be comfortable with every role, so shuffle the roles around when possible.

The Incident Manager (IM) Role

The duties of the Incident Manager are:

  • Keep the high level state of the incident at all times via the Living Incident State Document.
  • Structure the task force and assigns/delegates roles according to necessities and priorities.
  • Hold all roles that are not explicitly delegated.
  • Remove roadblocks that interfere in Ops duties, if needed.
  • Clearly handover this role to someone else when logging off, so everybody knows who's responsible of the Response at all times.

They should be capable of answering the following questions:

  • What is the impact to the users?
  • What are the users seeing?
  • How many/which users are affected (all, logged users, beta)?
  • When did it all start?
  • How many related issues users have opened?
  • Is there any security implication? Is there any data loss?

During this assessment phase is when the Living Incident State Document starts to be filled in.

Their challenges are:

  • Keep the team communication effective.
  • To be up to date to the current theories about the incident, observations and the team's lines of work.
  • Clearly assign the roles. Escalate as needed.
  • Are effective decisions being made?
  • Ensure the changes into the system are made carefully and intentionally.
  • Is the team exhausted? Can we hand over the incident management?

Other roles:

Roles we can use:

  • Ops: The Ops team should be the only one doing changes in the system during the incident.
  • Communications: Ensure the rest of the team and the users are up to date via the designated external communication channel.
  • Support: Works as support to Ops, taking care of other long-term tasks like filing bugs, create Trello cards, GitHub Issues, etc and keep track of changes to the system to revert them once the incident is resolved (things like monkey patches, hotfixes, etc).

Living Incident State Document

We need a place to track the incident response. The Living Incident State Document is the place to do it.

It is the duty of the IM to keep this document live and up-to-date.

It should be a functional document, with the most important information at the top.

It should be a document editable by everyone in the team. Ideally, editable in real time. It should be readable for everyone interested in how the incident is evolving.

Document example: https://landing.google.com/sre/sre-book/chapters/incident-document/

When to Declare an Incident

It's better to declare an incident early and call it off later, than to spin up an incident response team when everything is messed up by unorganized tinkering.

In case of doubt, we follow this guideline:

  • Do we have a service disruption, like main page not reachable, users not able to login?
  • Do we have 500 errors piling up in our monitoring tools?
  • Do we have unresponsiveness, page slowdown?
  • Is the issue visible to the users?
  • Is the problem still unresolved after an hour of focused analysis and working of the issue?

When to Close an Incident

The incident is closed when the involved services are back to normal operation. This does not include those long-term tasks created during the incident response.

Incident Post Mortem

Once the incident is resolved, a post mortem is needed to find the root cause of the incident and take next steps to ensure the incident does not happen again.

This does not mean the root cause can't happen again. It means the next time, we can detect it earlier and react before it spins out of control.

Tools

Internal Communication Channel

We use Rocket Chat daily, we'll use that channel for internal communication.

External Communication Channel

The external communication channel will be the existing public mailing list:

In each of those channels we point to the Status Page that will keep an Timeline of the incident.

TBD: Figure out how to write timelines in Status Page

Incident Document Template

We'll use our private Etherpad instance.

Incident document template is taken from the Google SRE Book.

Communications Templates

Having to come up with sentences to use in communication updates is not something we should do during an incident. So having predefined and previously agreed-on communication templates removes that burden and allows the Communication role to focus on what instead of the how.

Service Disruption

Title: Open Build Service Service Disruption
We are currently experiencing a service disruption.
Our team is working to identify the root cause and implement a solution. 
**ADD_GENERAL_IMPACT** users may be affected.
We will send an additional update in **NEXT_UPDATE_TIME** minutes.

General Unresponsiveness

Title: Open Build Service Page Unresponsiveness
The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive.
**ADD_GENERAL_IMPACT** users may be affected.
We’re investigating the cause and will provide an update in **NEXT_UPDATE_TIME** minutes.

More examples:

https://support.atlassian.com/statuspage/docs/incident-template-library/

Training

TBD: In order to have a fast and smooth reaction in front of an incident we could war-game the incident management with the team periodically. Pick up something already resolved and role-play the response.

Clone this wiki locally