I find it quite silly, though
Well, he was wrong. I think I have some interesting pseudo-calculations around what defects are and how we can manage them.
The Problem
You might have heard this in your own team-space and have thought “yeah, what is wrong with no defects?” The problem is it encourages a bias towards inertia. The logic is simple: the only way to guarantee no defects is to never release any software. Though no one is likely to accept this extreme idea seeps its way into our minds every time we hear “no defects.”
We are in a climate where every day, sometimes every hour, that we can get something valuable out the door can give us a competitive advantage. So even the slight delay of “well, are you sure we should deploy this? Are you certain it won’t cause a bug?” is an opportunity lost for us.
Risk Gives Us Profit
The
Negative Business Impact
As I see it, the root of most fear over defects is what I call Negative Business Impact. This takes the form of things like service downtime, customers unable to submit their orders, or the button shows up misaligned at the bottom of the screen.
All of these have varying levels of business impact. The misaligned button, for example, will be annoying but does not stop a customer from seeking their goals. A service being down, however, can have large ramifications.
Here is how I view it:
Negative Business Impact (NBI) = Frequency of Incident * Severity of Incident.
Incident: this is an occurrence of a potential defect.
Frequency: This is the number of times that incident occurs in a time period.
Severity: This is the juicy one. This is how bad even one occurrence of an incident can stop your software from generating revenue or saving costs.
A Dive Into Severity
While an initial glance can show you not all defects or incidents are created equal, I feel this calculation is worth a deeper dive.
First, Severity is a very subjective variable. We can rank incidents, or even assign a dollar value for types of incidents. I find that having buckets of severities is the simplest thing to do. Later we will see why it is useful to have some quantity of value lost per each bucket.
There are two types of severity:
Severity, Constant = Value Lost. This is for things that once they happen, there is nothing you can do about it. If you submit your order and you get the wrong parts, it already happened. It does not get worse over time.
Severity = Value Lost/Time*Time To Recover. This is where the duration of the incident impacts your value lost. For example, if your service is down the value lost increases the longer it is down. You could try to tag each failed attempt to connect as their own incidents, but since you are down you probably cannot count those, so its best to just calculate or guess an average of value lost per unit of time.
Value Lost: revenue lost or a more abstract quantity, like a customer delight score.
Time to Recovery = Time to Triage + Time To Resolve. This is how long it takes to both triage/investigate an incident and then resolve it so the system works again. For common incidents, it is often calculated as a Mean Time To Recover (MTTR). plugging in MTTR into Severity would give your Mean Severity for the incident.
The Sneakiness of Customer Trust
A common retort form those who ahve not thought through the above calculations is “well, any incident will reduce customer trust.” Will it really though?
Take two video games as examples. The first one is some one-man-show that a kid developed in their spare time. Or maybe it is an EA game. The mechanics are clunky and not fun at all. In addition, there is a bug where it sometimes autosaves during combat, when a player may die and then reload in a sticky situation. In this case you can bet customers will not trust the game developer at all.
But let’s say a AAA game was just released and fans clamor to get it. They play it and love it. It also has the same bug: it sometimes autosaves during combat. But the game is so fun fans just shrug and say “They will fix it in the next patch” and continue on with their gameplay.
We can boil down these examples into the following pseudo-calculation:
Customer Trust = Value Delivered – Negative Business Impact
Value Delivered: Again this could be revenue or a more abstract quantity, but it is for all the thing that have been delivered that have also caused our current set of incidents.
This calculation lets me easily respond to the customer trust problem by saying “And how much more trust is lost if you are not putting anything valuable into their hands?” People will be more forgiving the more valuable things you give them. Admittedly, knowing what value you will give them in advance can be tricky to impossible.
Hedge Your Risk
If we take these pseudo-calculations to heart, we can derive what our goals should be: optimize for Value Delivered while keeping Negative Business Impact as low as possible. This means we want mechanisms to continually deliver value.
But we also want to find ways to keep Severity low. This usually means focusing no reducing Mean Time To Recover, through mechanisms like self-healing services. It also can mean applying things like Canary Releasing so the cost and/or frequency of incidents stays low.
Conclusion
Even though these calculations are not strictly scientific, I still find them useful. It is good for us to question the idea that “defects are bad” as a maxim. We need to understand the deeper context of what tradeoffs we make when managing defect risk. And I believe these pseudo-calculations, like Negative Business Impact, can help us form a framework to do that. And with that in hand, we can explore even further into testing in production.