Popular Reliability Measures and Their Problems
Mean time between failure or mean time before failure is very common. The common definition describes MTBF as a reliability measure that is calculated by tallying operating hours and dividing by the number of failures. Intuitively this is the average time until a failure occurs. Mathematically it is the inverse of the failure rate. Generally used for repairable systems.
Readers of NoMTBF know MTBF is commonly misunderstood as a failure free period or the normal distribution average, etc. Without knowing more about the dispersion of the time to failure data users of MTBF assume an exponential or homogenous Poisson process (constant failure rate behavior) which is rarely true in my experience.
MTBF stated alone provides at best a crude, misleading, and confusing statement about reliability. Without a duration and underlying distribution information I maintain it is less than useful for any application. Even with sufficient additional information, there are better reliability summary metrics available.
Mean time to failure, like MTBF, is a common measure. Generally used for non-repairable items or a focus on just first failures. MTTF is a reliability metric calculated by tallying the operating hours and dividing by the number of failures. When the units are not repaired and placed back in service we have a measure that is a little different then MTBF.
As with MTBF all the same problems surround the MTTF measure. I do not recommend using MTBF (obviously), MTTF, nor any of the many variations (like MTBUR).
Annualized failure rate is the average failure rate over a year. The calculation includes either gathering the number of failures over a year, and dividing by then number of units that could have failed. Another approach is to collect failure data over a shorter or longer time period and adjust the results to appear as the average failure rate over a year. For example, collecting data for a month, then multiply the numerator and denominator by 12.
The basic problem is the resulting failure rate is an average, and without the underlying dispersion of the data it most likely obscures the information related to when to expect failures. If the arrivals of failures is not truly random with an equal chance of occurring each time unit, then AFR provides only a crude inaccurate measure.
Jon G. Elerath wrote and presented the paper, “AFR: problems of definition, calculation and measurement in a commercial environment,” at the Reliability and Maintainability Symposium, 2000. I agree with Jon, there is only a slight chance this measure will provide a useful summary of your reliability performance.
When asked about the organizations reliability metrics, some said, “Warranty”. With a little exploration they either meant the cost of warranty claims per month, or other time period, or the duration of the offered warranty. The first provides the financial impact of field failures. The second implies a relationship between warranty policy and marketing set durations and product performance, which in general doesn’t exist.
A better warranty related metric is the cost of warranty per unit shipped. This provides a means to related the field failure rate, the cost to the organization (although only a fraction of the total cost) and the number of shipments. It converts the failure rate and warranty expense into similar units as the bill of material item costs. This allows us to compare the component cost of a part and the warranty cost of a failure for a unit. We then can make decisions on purchasing those more expensive reliable components and calculate the expected savings on warranty expenses.
One issue is the warranty type measures rely on failure rates and average warranty costs per failure, both being averages tend to smooth out and obscure the very information we need to make informed decisions.
Grab one of the component data sheets and look for the element that describes the reliability claim. On some data sheets, more than I think makes sense, you may find something like, ’2,000 hour Life’ or ‘5 year Life’. Now what does that mean? (No pun intended.)
The underlying calculations, testing, or experiment or field data evidence can be from simply a guess, to an overly simplified average. For example, incandescent light bulbs may claim 2,000 hour life. William Meeker, author and professor, with his students, tested this claim. They found the bulbs did have an average operating life of 2,000 hours and the time to failure distribution was normal.
The 5 year life claim implied that the unit would last failure free for 5 years, or did it mean an average (unknown failure distribution) of 5 years, or was it a 5 year MTBF value? The sales folks liked the 5 years with little or no chance of failure. The evidence supporting the claim was a poorly done Mil Hdbk 217 based parts count prediction with component not found in the handbook excluded from the analysis.
When you see the ‘life’ metric you really should be asking a lot questions to find what is meant.
The American Bearing Manufacturers Association prefers L10 life (also called B10), which is the number of hours in service that 90% of bearings survive. It is also used in toxicology studies as the time till 10 of the 100 fish have died. The L10 provides the 10th percentile of the unknown time to failure distribution. It provides either a single experimental result or a tabulated average.
L1, or the first percentile is similar, yet has the same issues. Neither provide information on the dispersion of the time to fail data. Thus L10 like MTBF, AFR, and other averaged based metrics tends to obscure the information we need to make rationale decisions.
I should mention that a plot of the L10 or most any metric on a monthly basis is commonly done to ‘spot’ trends. Using more of a bad metric doesn’t help make it useful.
Why not just measure reliability directly? Assuming we understand or can find the documentation for the functions and environment, reliability includes the probability of successful operation over a duration. So, 98% of the inkjet printers in a home office will survive 2 years. That would work as a goal. We can then either conduct experiments or track field performance and (maybe using a Weibull distribution with the data) compare the results to the goal.
Of course, we can set metrics for the first month, say 99.9% reliable over first month of operation. Or, we can set and monitor reliability over the same duration as the warranty period. Reliability is just the probability a unit will survive the stated duration or it is percentage of units that survive the duration.
For repairable systems, like a car or plant machinery, we are interested in availability which combines reliability and time to repair.
Get beyond averages and use the probability and statistics you probably know you should know.
What metrics does your organization use and why? If you use MTBF because everyone else does, please take a look at how much money your organization could save by using reliability instead. Seriously, leave a comment and just state the metric you use.
4 thoughts on “Popular Reliability Measures and Their Problems”
Well said, Fred. Thanks for mentioning life metrics. L10, if I recall my history, came from studies of bearing failures, which are often (and sometimes successfully) modeled using the Weibull distribution. Engineers made the observation that once 10% of the bearings had failed, then the rest of them failed “rapidly,” and that knowing L50 (the median life) didn’t make much practical difference when it came to ordering spares and keeping equipment up and running. My recollection of the history is that engineering experience indicated that in this specific situation, it was possible to develop a useful rule of thumb. Unless you know that L10 is a good rule of thumb because there’s other evidence, then it’s just guess work. And as you point out, doing the wrong thing with greater precision and efficiency is still doing the wrong thing.
Thanks for the comment and kind words Paul. Like you, if my memory serves, didn’t W. Weibull’s study of bearings lead to the Weibull distribution? Interesting that lead to the use of L10 instead of the distribution, which would actually be more useful to describe the reliability performance of bearings over time. Thanks for the insight and background on L10.
Thanks for a tidy explanation Fred, on what happens when people stop at the surface of data.
We used MTBF for some components on entry into service due to their fresh design, however use of MTBF was specified as a guide only.
In the future we will graduate to proper reliability modelling when we have the data, which raises another important issue: RECORD EVERYTHING!
totally agree Mick, thanks for the comment. cheers, Fred