When Your Supplier Converts Reliability to MTBF
Oh, the trouble that will occur. The mistakes, mishaps and errors and most certainly the inability of the supplier to provide a reliability solution.
If you provide the supplier with a straightforward and complete reliability goal, and they convert it to an single number as an MTBF value, what really could go wrong? Also, why would the supplier degrade the requirement to an MTBF value?
Let’s say you have a piece of equipment that you want to have a supplier design and build for you. It is a complex piece of equipment, yet very little if any of the product is expected to be repairable. Let’s say it’s an electronics box that provides communication capabilities.
A complete reliability requirement may be summarized as:
Com box xyz shall provide communication capabilities (reference spec a for protocol, range, etc) located in unmanned outdoor and unsheltered environments worldwide (see spec b for weather and use profile details), and do with with 98% reliability over 20 years.
You might add other couplets of probability of success, reliability, and durations as needed. Yet, basically we want this box to work for 20 years with a relatively low chance of a unit failing.
This requirement is clear, measurable, and sufficient for any reliability related requirement.
The easiest way and totally incorrect way to convert the reliability objective of 98% reliable over 20 years, is to ignore the probability part and use the 20 years as the MTBF goal. Thus instead of 98% reliable the new target is 36.8% or so. Much easier target to achieve and still 20 years.
The second is to do a little math with the exponential cumulative density function (of course we’ll assume the exponential function applies as it’s so easy to work with and do calculations, predictions, and test planning). We set time to 20 years, and F(20 years) is given as 0.02 (2% of units fail over 20 years or 98% of units do not fail), then solve for theta (or MTBF). We find we setting MTBF to about 10,000,000 hours is about right.
Sounds impressive, too, 10 million hours MTBF.
Why Do the Conversion?
Why would anyone convert a reliability of 98% at 20 years into an MTBF value? Ignorance mostly. Here’s what I’ve heard over the years:
- We want to have a single number to represent reliability.
- MTBF is reliability (certainty not the definition of reliability itself?)
- We always assume the exponential.
- MTBF is so much easier to work with
- We only understand MTBF (really? Let’s test that…. Evil grin)
- Our MTBF value is the same as your goal.
What have you heard? Do any of these make any business or common sense?
Mostly the conversion is to simplify any calculations concerning reliability. This was standard practice in the 1960’s when calculations relied on manual, slide rule, log tables, and mechanical adders. We do not have those limitations today, so what is the need for simplification?
If your vendor does a similar conversion, ask them why and why again until they see the folly in the effort to use MTBF. Remember if you specify 20 years with 98% reliability, that is what you want. Using MTBF is most likely guaranteed to befuddle the supplier team enough that they willingly or mistakenly will not achieve your specification.
What Should You Do Now?
Find a new supplier – while not always possible, the risk to your program has gone up beyond the cost of awarding the work to a new supplier. Seriously this supplier really isn’t worth your organizations time.
Double check and challenge the use of MTBF on every element of the program. Very few components or elements of a system actually exhibit a constant failure rate, or exponential distribution time to failure behavior. Make them prove, as they should be able to do anyway, the data supporting the assumed validity of MTBF.
Remind them you are interested in the 2nd percentile of failures being at or above 20 years, not the anything to do with the point in time when any other number of failures occurs (if exponential then about 63rd percentile of failures, if another distribution then the percentile will vary dramatically and likely not be near 2%).
Refuse any results based on erroneous assumptions.
Double-check all testing sample size calculations. If assuming exponential they can stack sample run times to represent actual time. For example, running two units for 100 hours each, would represent 200 hours of run time for a unit, which is only true if the failure mechanisms are actually described by the exponential distribution. Almost never true.
Reliability growth, prediction, and many other reliability calculations rely on being simple because they rely on the assumed exponential distribution – expunge the use of the assumption.
And, finally, if you have to work with this team, be prepared to teach, educate and encourage them to actually understand the risks, model the failure mechanisms, evaluate tests that actually have failures, do exploration of designs and prototypes for failures, and actually abandon the shackles of MTBF.