Perils of MTBF
Every organization talks about product reliability in some manner. Sometimes our customers provide explicit reliability requirements. Sometimes our customers have an expected metric to report reliability expectations. Our industry may have a ‘standard’ means to discuss reliability. Or, we have a local ‘tradition’.
One of the most common is MTBF.
MTBF, or Mean Time Between Failure, and the many variations of this term have a one thing in common. It is the most misunderstood four letter acronym in engineering. For the purpose of this discussion I am using MTBF and most of the comments equally apply to MTTF, MTBUR, etc.
During a presentation on this subject to a group of reliability professionals, I asked if anyone in the room had encounter trouble with MTBF. Nearly every person of the over 100 in attendance quickly raised their hand. We spent the next hour sharing horror stories resulting from the misuse of MTBF.
What is MTBF?
Technically, MTBF (MTTF actually, more on that later) is unbiased estimator of the exponential distribution parameter, theta. This is based on how we calculate the value based on either test or field data.
If we have 50 units that all run for 100 hours and right at the end of 100 hours one of the units fails. We can calculate the MTBF as follows. First determine the total hours all the units operated. That’s easy, 50 units times 100 hours is 5,000 hours. Then divide the total operating hours by the number of failures. In this simple example, that is one, for a resulting MTBF = 5,000/1 = 5,000 hours.
Note: if we had 100 units run for 50 hours and had one failure at the end of 50 hours, the result is the same. Or, if one unit runs for 5,000 hours before failing. Or, 5,000 units each running for one hour, then one fails. Weird, right?
Well, not so strange if the underlying failure mechanism has a equal chance of causing a failure every hour (or moment). If the chance of failure is constant, or we say the hazard rate or failure rate is constant, then the above method to estimate MTBF is valid.
There are better ways to estimate MTBF when the assumption of a constant failure rate is not true. Yet, most often MTBF is calculated as described above.
Light Bulbs & Smoke Detectors
How often do you change incandescent light bulbs? Randomly, right? When a bulb burns out (why does it always seem to be the hardest to reach and least available bulb?) you find a spare bulb and replace the burned out one. Do you then think about changing the rest of the similar light bulbs in the house? Probably not.
Note: Light bulbs really do not follow the exponential distribution for time to failure. Bill Meeker did the experiments and conveyed this information to me after reviewing this site. I’ll update this note once I find a good example. In the meantime, let’s assume light bulbs follow a random pattern with respect to time to failure.
Incandescent light bulbs tend to follow the exponential life distribution. (This is not actually true, yet in my experience and limited data the time-to-failure distribution in my home is close enough.) And as such there is no rationale to conduct preventative maintenance. The memoryless feature of the distribution suggest the new bulb has exactly the same chance of failure in the next hour as the existing working light bulb. So there is no time or cost benefit to the preventative replacement.
Note: Talking to Professor Bill Meeker, discovered that he had a group of student run a life test on incandescent light bulbs. They found that the life distribution is actually best modeled by a normal distribution, not exponential. Bill recommended that we think of the failure rate of a ceramic mug – which typically fails by dropping to the floor and shattering or being struck and chipped. The failure rate may well be exponential in that case.
Now, if your community is like mine, you receive annual reminders to change the batteries in your smoke detectors. Those 9V batteries do tend to wear out. Ignoring the preventative maintenance leads to middle of the night low battery power ‘beeps’ from the smoke detector.
This then leads to the annual effort to change all of the 9V smoke detectors. I’ve seen the same behavior in office buildings using fluorescent tube lighting. The maintenance crew tend to replace entire banks of tubes. When queried, I learned of their experience. “When one goes, then all will fail soon after. So, while we have the ladder out, we just replace them all.”
A few common ‘issues’
In support of the statement, ‘worst four letter acronym’ consider each element of the four letters.
M – STANDS FOR MEAN
Speaking statistically, this is the expected value or the first moment of the distribution. Each distribution has a mean value.
The issue stems, in my opinion, from those undergraduate statistics classes most would rather forget. The normal (Gaussian) distribution dominated those lectures. Many sections and test questions started with the phrase
“Assuming a normal distribution….”
It was drilled into our engineering minds. The learned response was ‘mean’ is ‘average’ is the 50th percentile of a normal distribution. One half of values are above and one half are below.
Therein dwells the root of a mistaken understanding of MTBF. Not all distributions have the same properties concerning mean values, which was most likely not mentioned during the undergraduate statistics course. For example, the exponential family of distributions has a expected value or mean which defined as the 63.2 percentile. One third (36.79%) of values are above and two thirds (63.21%) are below.
Let’s assume we have 1000 light bulbs with an MTBF of 100 hours. How many will still be working at the end of 100 hours of operation?
This is as expected if using the reliability function of the exponential distribution.
If we run the time out a little further the plot shows what we commonly call the exponential decay. The chance of failure each hour for each light bulb is the same. It just takes more time to have the same number of failures. the first hour of the experiment with 1000 light bulbs, 10 failed (1000 x 1/100 = 10 failures in one hour). When there are only 500 light bulbs remaining, it takes two hours to incur 10 failures (500 x 1/100 = 5 failures in one hour).
Hours, cycles, years, pages and many more ways of counting some form of use are common. Recall that the MTBF is the inverse of failure rate. The failure rate units are the number of failures per unit time. Inverting this give us units of time (hours, cycles, years, …) per failure.
I am not sure why (tend to think it was a marketing decision) someone decided to invert the negative connotation of ‘failures/hour’ into the positive sounding ‘hours/failure’.
Therein clicks another issue with MTBF. The units of MTBF, often in hours, is often confused with clock or calendar time. It really is a confusing unit of measure to convey the probability of failure. Instead of stating a light bulb has a 0.01 chance of failure per hour of operation, our dislike for numbers between 0 and 1 (recall probability and stats classes!) is avoided by inverting the failure rate. Now it reads 100 hours MTBF. Sounds much better.
B – stands for Between (or Before?)
Either way, between or before, when linked with the rest of the acronym it conveys a failure free period. It would have been better to state MTF, Mean Time of Failures. While that suggestion isn’t really that good, the idea of a failure free period, is not part of the definition.
I heard one design team manager explain MTBF as the time to expect from one failure to the next. The time between failures. So, once a failure occurs, we have the MTBF hours before we would expect the next failure.
MTTF, the closely associated metric, uses To instead of Between, and creates the same confusion. With To, Before or Between, two thirds of the light bulbs will fail at the 100 hour mark.
When the MTBF value is very large, say 1 million hours, it may seem like a failure free period is occurring. Not really, it just is the probability of failure is very small, 1 in a million chance of failure per hour. Running a test of 10 light bulbs for 1000 hours with an actual 1 million hour MTBF probability of failure would result in an expected ZERO failures (an expected 1% units failing – it may take an average of ten runs of the test for a single bulb to fail)
F – stands for Failure
Who defines this in your organization? Do your customers return the product and they are classified No Trouble Found? In a classical sense a product failure is when the product does not met stated performance specifications. Yet, customers will return products that fail to meet their expectations and it still creates warranty expenses.
In many forms of product testing only apply one form of stress which only promotes a subset of all possible failure mechanism. Basing MTBF calculation on a single stress test, while possible to be accurate enough, is often missing important life-cycle conditions, stresses and failure mechanisms.
The simple issue here is the internal definition of failure which well may be different than your customers definition. Be clear and concise, plus open to new definitions of failure. It is generally limited to product specifications.
History of Use
Karl Pearson first mention of the ‘negative exponential distribution’ in 1895. The Exponential Distribution has a number of interesting properties, one of which takes advantage of the tools available in the 1950’s and 60’s.
Specifically the ability to add failure rates (inverse of MTBF is the failure rate). Adding was rather easy at the time using mechanical and later electric adding machines. Using a slide rule and tables for the exponents is cumbersome with possibly 100’s or 1,000’s of calculations.
In 1961, the first issue of the MIL-HDBK-217 detailed how to perform parts count predictions. The method relied on the ability to add failure rates. Work continues to this day to update and revise the methodology. These efforts may take us out of the era of mechanical adders, as today doing complex calculations is as easy as turning the crank.
Today we have models and distributions for the complex array of failure mechanisms and should take advantage of this knowledge. Limiting the combination of failure rate information to a constant for each component distorts and misleads those attempting to make decisions based on the prediction or data analysis.
Examples of MisUse of MTBF
So, while it is a convenient assumption to say the component, product or system has a constant failure rate, this is often not true. And, this assumption does lead to very poor understanding, modeling and decisions related to real products.
The obvious misuse stems from the various means individuals misunderstand MTBF. For example, if an electric engineer believes MTBF to be a failure free period his selection of components will have a significantly less desirable field failure rate.
Another simple issue is the advertising of product or component reliability by simply stating an MTBF value. Without stating the conditions, environment, usage period, and other reliability related bits of information, the reader is left to wonder what the MTBF really means. For a component that has an increasing failure rate over time, like a cooling fan that experiences bearing wear out, the MTBF is a valid approximation of the fan failure rate over some specific period of time. The fan datasheet often does not state the expected duration over which the constant failure rate applies. If the vendor is designing and evaluating fan life for an expected one year of use, then the life data may actually be exponential. If the application the electrical engineering is considering the inclusion of a cooling fan is to operate for 10 years, then surprised when the product qualification or field performance experiences higher than expected fan failures.