In my earlier blog on Debugging Watchdog Timeouts I mentioned the dreaded No Trouble Found (NTF) problem. Some have asked why NTF is important. Well, the answer is because NTF is a huge cost to companies, compounded by the fact that NTF is extremely difficult to quantify and to address. Something as simple as an errant wedding ring can cost companies millions of dollars. Let me explain…
A study by Accenture1 found that, within the consumer electronics industry, product returns range from 11 to 20 percent, and more than two-thirds of these can be characterized as NTF. The cost throughout the value chain (the user, retailer, service provider, and OEM) is in the billions of dollars. NTF is specifically characterized by a product which fails (or appears to fail) in the field, but when it is returned to the OEM for testing, it does not fail. In other words, the failure occurs (or seems to occur) in the field, but cannot be reproduced in the lab. NTFs tend to be extremely difficult to pin down because of their elusive nature: after all, if the issue cannot be reproduced, how do we really know it occurred in the first place?
Perhaps the most recent visible manifestation of the NTF problem has been with Toyota’s issues. At a most fundamental level, problems were occurring in the “real world” for which sufficient diagnostic information did not exist to correlate to root cause – after all, when a car crashes, much of the evidence may have been destroyed.
Alas, NTFs will never completely go away because their source can be many-faceted and involve human error. Due to the multi-dimensional nature of the problem, many companies create task forces to address NTFs. I myself was part of one of these when I worked for Nortel Networks back in the mid-1980s. One particular voice switching system in Bell Canada’s network kept having failures in the middle of the night. By the time we rushed in with our logic analyzers and other tools, the system had recovered, but with no useful log information. We spent months of effort and millions of dollars troubleshooting the transient problems in this one switch. We even brought in Geiger counters to try to figure out if spurious radiation or solar flares were causing the problem. We finally found out that one of Bell’s disgruntled employees was actually sneaking behind the system and running his wedding ring up and down the backplane. He just enjoyed watching the alarms go off and all of the excitement that ensued.
Nortel then subsequently spent more millions of dollars improving the diagnostics of their voice switching systems so they could log some forensic data on the specific nature of the problem (i.e. transient short circuits) as opposed to simply crashing.
1 Accenture, Big Trouble with “No Trouble Found” Returns, 2008: here’s the article.