-
Notifications
You must be signed in to change notification settings - Fork 445
Incident Management Protocol
Effective Incident Management is vital to limit disruption caused by an incident and go back to normal business operation as fast as possible.
When an incident happens, the priorities are:
- Stop the bleeding
- Restore the service
- Preserve the evidence in order to find the root-cause
Key elements:
- Continuous communication to keep stakeholders and users up to date.
- A role to coordinate communication between the different parties.
- A role to think on the big picture and longer-term tasks to offload those duties from the people working in the incident resolution.
- A predefined communication channel where all the communication goes through.
It's very important that every team member working in the incident resolution knows their role. The role separation helps in knowing what the team member should and should not do in order to avoid confusion and chaos around who's responsible for what.
Every role should have full autonomy inside its boundaries and complete trust from the rest of the team members.
Every team member should ask for help to their Planning Lead when the workload starts to feel excessive or overwhelming.
Every lead can delegate components of work to other colleagues as they see fit.
Every team member should be comfortable with every role, so shuffle the roles around when possible.
The duties of the Incident Manager are:
- Keep the high level state of the incident at all times via the Living Incident State Document.
- Structure the task force and assigns/delegates roles according to necessities and priorities.
- Hold all roles that are not explicitly delegated.
- Remove roadblocks that interfere in Ops duties, if needed.
- Clearly handover this role to someone else when logging off, so everybody knows who's responsible of the Response at all times.
They should be capable of answering the following questions:
- What is the impact to the users?
- What are the users seeing?
- How many/which users are affected (all, logged users, beta)?
- When did it all start?
- How many related issues users have opened?
- Is there any security implication? Is there any data loss?
During this assessment phase is when the Living Incident State Document starts to be filled in.
Their challenges are:
- Keep the team communication effective.
- To be up to date to the current theories about the incident, observations and the team's lines of work.
- Clearly assign the roles. Escalate as needed.
- Are effective decisions being made?
- Ensure the changes into the system are made carefully and intentionally.
- Is the team exhausted? Can we hand over the incident management?
Roles we can use:
- Ops: The Ops team should be the only one doing changes in the system during the incident.
- Communications: Ensure the rest of the team and the users are up to date via the designated external communication channel.
- Support: Works as support to Ops, taking care of other long-term tasks like filing bugs, create Trello cards, GitHub Issues, etc and keep track of changes to the system to revert them once the incident is resolved (things like monkey patches, hotfixes, etc).
We need a place to track the incident response. The Living Incident State Document is the place to do it.
It is the duty of the IM to keep this document live and up-to-date.
It should be a functional document, with the most important information at the top.
It should be a document editable by everyone in the team. Ideally, editable in real time. It should be readable for everyone interested in how the incident is evolving.
Document example: https://landing.google.com/sre/sre-book/chapters/incident-document/
It's better to declare an incident early and call it off later, than to spin up an incident response team when everything is messed up by unorganized tinkering.
In case of doubt, we follow this guideline:
- Do we have a service disruption, like main page not reachable, users not able to login?
- Do we have 500 errors piling up in our monitoring tools?
- Do we have unresponsiveness, page slowdown?
- Is the issue visible to the users?
- Is the problem still unresolved after an hour of focused analysis and working of the issue?
The incident is closed when the involved services are back to normal operation. This does not include those long-term tasks created during the incident response.
Once the incident is resolved, a post mortem is needed to find the root cause of the incident and take next steps to ensure the incident does not happen again.
This does not mean the root cause can't happen again. It means the next time, we can detect it earlier and react before it spins out of control.
We use Rocket Chat daily, we'll use that channel for internal communication.
The external communication channel will be the existing public mailing list:
- [email protected]
- OBS Announcements
- Rocket Chat
- IRC
In each of those channels we point to the Status Page that will keep an Timeline of the incident.
TBD: Figure out how to write timelines in Status Page
We'll use our private Etherpad instance.
Incident document template is taken from the Google SRE Book.
Having to come up with sentences to use in communication updates is not something we should do during an incident. So having predefined and previously agreed-on communication templates removes that burden and allows the Communication role to focus on what instead of the how.
Title: Open Build Service Service Disruption
We are currently experiencing a service disruption.
Our team is working to identify the root cause and implement a solution.
**ADD_GENERAL_IMPACT** users may be affected.
We will send an additional update in **NEXT_UPDATE_TIME** minutes.
Title: Open Build Service Page Unresponsiveness
The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive.
**ADD_GENERAL_IMPACT** users may be affected.
We’re investigating the cause and will provide an update in **NEXT_UPDATE_TIME** minutes.
More examples:
https://support.atlassian.com/statuspage/docs/incident-template-library/
TBD: In order to have a fast and smooth reaction in front of an incident we could war-game the incident management with the team periodically. Pick up something already resolved and role-play the response.
- Development Environment Overview
- Development Environment Tips & Tricks
- Spec-Tips
- Code Style
- Rubocop
- Testing with VCR
- Authentication
- Authorization
- Autocomplete
- BS Requests
- Events
- ProjectLog
- Notifications
- Feature Toggles
- Build Results
- Attrib classes
- Flags
- The BackendPackage Cache
- Maintenance classes
- Cloud uploader
- Delayed Jobs
- Staging Workflow
- StatusHistory
- OBS API
- Owner Search
- Search
- Links
- Distributions
- Repository
- Data Migrations
- next_rails
- Ruby Update
- Rails Profiling
- Installing a local LDAP-server
- Remote Pairing Setup Guide
- Factory Dashboard
- osc
- Setup an OBS Development Environment on macOS
- Run OpenQA smoketest locally
- Responsive Guidelines
- Importing database dumps
- Problem Statement & Solution
- Kickoff New Stuff
- New Swagger API doc
- Documentation and Communication
- GitHub Actions
- How to Introduce Software Design Patterns
- Query Objects
- Services
- View Components
- RFC: Core Components
- RFC: Decorator Pattern
- RFC: Backend models