13 Nov 2011

The 5 whys: Another attempt

Towards the end of the week before last and the beginning of last week we’d been having quite a few problems with our QA environment to the point where we were unable to deploy anything to it for 3 days.

A few weeks ago I wrote about a 5 whys exercise that we did in a retrospective and in our weekly code review we decided to give it a go and see what we could learn.

We started with the question 'Why was there a mess?' and then branched out the first level whys since it was fairly clear that there wasn’t only one thing which had contributed to our problems.

We ended up with 4 answers to the first why:

There was a DNS change
Volume was deleted from our QA server
System tests failing
Change in one project hanging QA deployment
Main build broken for a while

We then worked across the whiteboard taking each of these in turn.

I think our approach allowed us to avoid part of 'the cult of the root cause' which Don Reinertsen wrote about.

It still wasn’t quite spot on due to some mistakes I made while facilitating but these were my observations:

Once we got to answering the whys for the 4th and 5th first level whys the whiteboard was way too cluttered and it had become quite difficult to see exactly where we’d got up to. As a result we lost the discipline around answering the question why and drifted off into general discussion around the original question but stopped drilling down further looking for a potential root cause. The next time I think it would probably work better to look for the first why and collect any potential other whys on the same level in a 'parking lot' type area which we could then go to later on.
Having said that, a neat thing about having the whys alongside each other was that we were able to see that the first two whys were linked to each other. Both changes had been done by someone in the operations team based on conversations they had with people on our team. We realised that our communication with the operations team hadn’t been entirely clear and had left room for doubt which had led to unexpected changes being made to the servers. This was an example of us stopping before we’d drilled down to 5 levels having realised that we could influence the situation positively even if we hadn’t found the root cause of the problem.
Drilling down into the 'System tests failing' led to the most interesting insights:
- System tests failing
  - Noone cares about them
    
    We can push to QA even if they’re broken
    
    Used to them failing
    
    Perception amongst devs that they’re flaky
    
    There had previously been a time when data changed frequently and broke them.
    
    Seen as being owned by the QAs
    
    The tests were defined by QAs
    
    The time from checkin to system tests failing is quite long
Looking back at this now we probably should have drilled a bit further down on some of the whys. We actually ended up discussing the perception amongst the developers that the tests were flaky and it was pointed out that most of the failures were actually real. We don’t currently have a 'stop the line' mentality if the systems tests fail but have agreed to adopt that approach for the next iteration and check at the end of this week to see if we’ve improved.
Even though I didn’t facilitate the exercise perfectly I think there was still a far greater level of analysis done by the team in this exercise than in others that I’ve seen. I’ve noticed that a lot of retrospective type exercises tend to only encourage surface level analysis so we never really go deeper into a subject and see if we can actually make some useful changes to the way that we work.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.