“Root cause analysis (RCA),” as traditionally defined, is a problem-solving method for identifying, well, root causes. As a DevOps guy, I have a couple of problems of my own with this.
First, this definition implies there is, indeed, one root cause. In practice, we often find a cocktail of contributing causes, as well as negative (and sometimes positive) outcomes.
Secondly, the term implies that we are on a hunt for a cause. Yes, we are on a hunt for causes, but only to help us identify preventative actions, not just to solve a mystery, or worse, find an offender to punish.
Step 1: Establish “The Motive” by asking two questions
Do we think anyone on our team did something deliberately malicious to cause this issue? For example, did someone consciously carry out actions they knew would cause this, or something of similar negative consequences? Or, they clearly understood the risks but cared so little that they weren’t deterred?
Does anyone think anyone outside our team could have caused this?
The assumption here is that the answer is “no” to both questions. If it is “no,” we can now proceed with a blameless mindset; for example, never stopping our analysis at a point where a person should (or could) have done something different.
If either answer is “yes,” we’re beyond the scope of this approach.
Step 2: Restate our meaning of “blameless”
Read aloud the following to everyone participating in the RCA:
“We have established that we don’t blame any individual either internal or external to our organization for the incident that has triggered this exercise. Our process has failed us and needs our collective input to improve it. If at any point during the process, anyone starts to doubt this statement or act like they no longer believe it, we must return to Step 1. Everyone is responsible for enforcing this.
“What is at stake here is not just getting to the bottom of this incident, it’s getting to the bottom of this incident and every future occurrence of the same incident. If anyone feels mistreated by this process, by human nature they will take actions in the future to disguise their actions to limit blame and this will damage our ability to continuously improve.”
Step 3: Restate the following rules
Facts must not be subjective. If an assertion of fact cannot be 100-percent validated, we should agree and capture our confidence level (for example, high, medium or low). We must also capture the actions that we could do to validate it.
If we don’t have enough facts, we will prioritize the facts that we need to go away and validate before reconvening to continue. Before suspending the process, agree to a full list of “things we wish we knew, but don’t know,” capture the actions that we could do to validate them and prioritize the discovery.
If anyone feels uncomfortable during the process due to, blame, concern with the process, language, tone of voice or having their voice heard, they must speak up immediately.
We are looking for causes only to inform what we can do to prevent re-occurrence, not to apportion blame.
Step 4: Agree on a statement to describe the incident that warranted this RCA
Using an open discussion, attempt to reach a consensus over a statement that describes the incident that warranted this RCA. This must identify the thing (or things) that we don’t want to happen again (including all negative side-effects). Don’t forget the impact on people; for example, having to work late to fix something. Don’t forget to capture the problem from all perspectives. Write this down somewhere everyone can see.
Step 5: Mark up the problem statement
Look at the problem statement and identify every aspect of the statement that someone could ask “why” about. Try to take an outsider view, even if you know the answer or think something cannot be challenged, it is still in-scope for being underlined.
Step 6: Perform the analysis
Document the “why” question related to each underlined aspect in the problem statement. For each why, attempt to agree on one direct answer. If you find you have more than one direct answer, split your why question into enough more specific why questions so that your answers can be correlated directly.
Mark up the answers as you did in Step 5. Repeat this step until you’ve built up a tree with at least five answers per branch and at least three branches. If you can’t find at least three branches, you need to ask more fundamental why questions about your problem statement and answers. If you can’t ask and answer more than five whys per branch. You are possibly taking too large of steps.
Do not stop this process with any branch ending on a statement that could be classified “human error” (refer to what we agreed at step 1).
Do not stop this process at something that could be described as a “third-party error.” While the actions of third parties may not be directly under our control, we have to maintain a sense of accountability for the problem statement where if necessary we should have implemented measures to protect ourselves from the third party.
Step 7: Form countermeasure hypothesis
Review the end points of your analysis tree and hypothesize about actions that could prevent re-occurrences. Your hypothesis should be specific and testable.
Use whatever mechanism you have for capturing and prioritizing the proposed work to track the identified actions and get them implemented. Use your normal approach to stating acceptance criteria, and don’t close the actions unless they satisfy the tests that they have been effective.