Have you ever wondered by we use the assumption of a constant failure rate? Or considered why we assume our system is ‘in the flat part of the curve [bathtub curve]’?
Where did this silliness first arise?
In part, I lay blame on Mil Hdbk 217 and parts count prediction practices. Yet, there is a theoretical support for the notion that for large, complex systems the overall system time to failure will approach an exponential distribution.
Thanks go to Wally Tubell Jr., a professor of systems engineering and test. He recently sent me his analysis of Drenick’s theorem and it’s connection to the notion of a flat section of a bathtub curve.
A simplifying assumption associated with using MTTF or MTBF implies a constant hazard rate. Some assume we’re in the useful life section of the bathtub curve. Others do not understand what assumptions they are making.
Using MTTF or MTBF has many problems and as regular reader here know, we should avoid using these metrics.
By using MTTF or MTBF we also lose information. We are unable to measure or track the rate of change of our equipment or system’s failure rates (hazard rate). The simple average is just an average and does not contain the essential information we need to make decisions.
Creating a product or system that lasts as long as expected, or longer, is a challenge.
It’s a common challenge that reliability engineering and entire engineering team face on a regular basis. It’s also not our only challenge.
We face and solve a myriad of technical, political, and engineering challenges. Some of our challenges are born and carried forward by our own industry. We have tools suitable for a given purpose altered to ‘fit’ another situation (inappropriately and creating misleading results). We have terms that we, and our peers, struggle to understand.
Sometimes, we, as reliability engineers have set up challenges that thwart our best efforts to make progress.
Is there any useful result from a parts count prediction?
In most cases that I’ve seen parts count predictions used they are absolutely worthless. Worse, is the folks receiving the results believe they are accurate estimates of reliability performance (or at least use the results as such).
In my opinion, the range of parts count prediction methods and databases harm the field of reliability engineering.
In today’s complex product environment becoming more and more electronic, do the designers and manufacturers really understand what IS Reliability ??
It is NOT simply following standards to test in RD to focus only on Design Robustness as there is too much risk in prediction confidence, it only deals with the ‘intrinsic’ failure period and rarely has sufficient Test Strength to stimulate failures. Continue reading What is Reliability?→
Recently heard from a reader of NoMTBF. She wondered about a supplier’s argument that they meet the reliability or MTBF requirements. She was right to wonder.
Estimating reliability performance a new design is difficult.
There are good and better practice to justify claims about future reliability performance. Likewise, there are just plain poor approaches, too. Plus there are approaches that should never be used.
The Vendor Calculation to Support Claim They Meet Reliability Objective
Let’s say we contract with a vendor to create a navigation system for our vehicle. The specification includes functional requirements. Also it includes form factor and a long list of other requirements. It also clearly states the reliability specification. Let’s say the unit should last with 95% probability over 10 years of use within our vehicle. We provide environmental and function requirements in detail.
The vendor first converts the 95% probability of success over 10 years into MTBF. Claiming they are ‘more familiar’ with MTBF. The ignore the requirements for probability of first month of operation success. Likewise they ignore the 5 year targeted reliability, or as they would convert, MTBF requirements.
[Note: if you were tempted to calculate the equivalent MTBF, please don’t. It’s not useful, nor relevant, and a poor practice. Suffice it to say it would be a large and meaningless number]
RED FLAG By converting the requirement into MTBF it suggests they may be making simplifying assumptions. This may permit easier use of estimation, modeling, and testing approaches.
The Vendor’s Approach to ‘Prove’ The Meet the MTBF Requirement
The vendor reported they met the reliability requirement using the following logic:
Of the 1,000 (more actually) components we selected 6 at random for accelerated life testing. We estimated the lower 60% confidence of the probability of surviving 10 years given the ALT results. Then converted the ALT results to MTBF for the part.
We then added the Mil Hdbk 217 failure rate estimate to the ALT result for each of the 6 parts.
RED FLAG This one has me wondering the rationale for adding failure rates of an ALT and a parts count prediction. It would make the failure rate higher. Maybe it was a means to add a bit of margin to cover the uncertainty? I’m not sure, do you have any idea why someone would do this? Are they assuming the ALT did not actually measure anything relevant or any specific failure mechanisms, or they used a benign stress? ALT details were not provided.
The Approach Gets Weird Here
Then we use a 217 parts count prediction along with the modified 6 component failure rates to estimate the system failure rate, and with a simple inversion estimated the MTBF. They then claimed the system design will meet the field reliability performance requirements.
Hence, a reliability prediction should never be assumed to represent the expected field reliability …
If you are going to use a standard, any standard, one should read it. Read to understand when and why it is useful or not useful.
What Should the Vendor Have Done Instead?
There are a lot of ways to create a new design and meet reliability requirements.
The build, test, fix approach or reliability growth approach works well in many circumstances.
Using similar actually fielded systems failure data. It may provide a reasonable bound for an estimate of a new system. It may also limit the focus on the accelerated testing to only the novel or new or high risk areas of the new design — given much of the design is (or may be) similar to past products.
Using a simple reliability block diagram or fault tree analysis model to assembly the estimates, test results, engineering stress/strength analysis (all better estimation tools then parts count, in my opinion) and calculate a system reliability estimate.
Using a risk of failure approach with FMEA and HALT to identify the likely failure mechanisms then characterize those mechanisms to determine their time to failure distributions. If there is one or a few dominant failure mechanisms, that work would provide a reasonable estimate of the system reliability.
In all cases focus on failure mechanisms and how the time to failure distribution changes given changes in stress / environment / use conditions. Monte Carlo may provide a suitable means to analysis a great mixture of data to determine an estimate. Use reliability, probability of success over a duration.
In short, do the work to understand the design, it’s weaknesses, the time to failure behavior under different use/condition scenarios, and make justifiable assumptions only when necessary.
We engage vendors to supply custom subsystems given their expertise and ability to deliver the units we need for our vehicle. We expect them to justify they meet reliability requirements in a rationale and defendable manner. While we do not want to dictate the approach tot he design or the estimate of reliability performance, we certainly have to judge the acceptability of the claims they meet the requirements.
What do you report when a customer asks if your product will meet the reliability requirements? Add to the list of possible approaches in the comments section below.
The exponential distribution would be the only time to failure distribution. We wouldn’t need Weibull or other complex multi parameter models. Knowing the failure rate for an hour would be all we would need to know, over any time frame.
Sample size and test planning would be simpler. Just run the samples at hand long enough to accumulated enough hours to provide a reasonable estimate for the failure rate.
Would the Design Process Change?
Yes, I suppose it would. The effects of early life and wear out would not exist. Once a product is placed into service the chance to fail the first hour would be the same as any hour of it’s operation. It would fail eventually and the chance of failing before a year would solely depend on the chance of failure per hour.
A higher failure rate would suggest it would have a lower chance of surviving very long. Although it could still fail in the first hour of use as if it had survived for one million hours and then it’s chance to fail the next hour would still be the same.
Would Warranty Make Sense?
Since by design we cannot create a product with a low initial failure rate we would only focus on the overall failure rate. Or the chance of failing over any hour, the first hour being convenient and easy to test, yet still meaningful. Any single failure in a customer’s hands could occur at any time and would not alone suggest the failure rate has changed.
Maybe a warranty would make sense based customer satisfaction. We could estimate the number of failures over a time period and set aside funds for warranty expenses. I suppose it would place a burden on the design team to create products with a lower failure rate per hour. Maybe warranty would still make sense.
How About Maintenance?
If there are no wear out mechanisms (this is a make believe world) changing the oil in your car would not make any economic sense. The existing oil has the same chance of engine seize failure as any new oil. The lubricant doesn’t breakdown. Seals do not leak. Metal on metal movement doesn’t cause damaging heat or abrasion.
You may have to replace a car tire due to a nail puncture, yet the chance of an accident due to worn tire tread would not occur any more often than with new tires. We wouldn’t need to monitor tire tread or break pad wear. Those wouldn’t occur.
If a motor is running now, if we know the failure rate we can calculate the chance of running for the rest of the shift, even when the motor is as old as the building.
The concepts of reliability centered maintenance or predictive maintenance or even preventative maintenance would not make sense. There would be advantage to swapping a part of a new one, as the chance to fail would remain the same.
Physics of Failure and Prognostic Health Management – would they make sense?
Understanding failure mechanisms so we could reduce the chance of failure would remain important. Yet when the failures do not
Then many of the predictive power of PoF and PHM would not be relevant. We wouldn’t need sensors to monitor conditions that lead to failure, as no specific failure would show a sign or indication of failure before it occurred. Nothing would indicate it was about to fail as that would imply it’s chance to failure has changed.
No more tune-ups or inspections, we would pursue repairs when a failure occurs, not before.
A world of random failures, or a world of failures each of which occurs at a constant rate would be quite different than our world. So, why do we so often make this assumption?
Spotted a Current Reference to Mil Hdbk 217 Recently
After a short convulsion of disbelieve I became shocked. This was a guide for a government agency advising design teams and supporting reliability professional. It suggested using 217 to create the estimate of field reliability performance during the development phase.
Have we made not progress in 30 years?
What would do?
Let’s say you are reviewing a purchase contract and find a request for a reliability estimate based on Mil Hdbk 217F (the latest revision that is also been obsolete for many years), what would you do? Would you contact the author and request an update to the document? Would you pull to 217 and create the estimate? Would you work to create and estimate the reliability of a product using the best available current methods? Then convert that work to an MTBF and adjust the 217 inputs to create a similar result. Or would you ignore the 217 requirement and provide a reliability case instead?
Requirements are the requirements
When a customer demands a parts count prediction as a condition of the purchase, is that useful for either the development team or the customer?
So, given the contract is signed and we are in the execution phase, what are your options?
Do the prediction and send over the report while moving on with other work.
Ask the customer to adjust the agreement to include a meaningful estimate.
Ignore the 217 requirement and provide a complete reliability case detailing the reliability performance of the product.
Find a new position that will not include MTBF parts count prediction.
The choice is yours.
I hope you would call out the misstep in the contract and help all parties get the information concerning reliability that they can actually use to make meaningful decisions.
A troublesome question arrived via email the other day. The author wanted to know if I knew how and could help them adjust the parameters of a parts count prediction such that they arrived at the customer’s required MTBF value.
In the section on predictions you mention Dr. Box’s oft quoted
statement that “..all models are wrong, but some are useful.” In the
same book Dr. Box also wrote, “Remember that all models are wrong; the
practical question is how wrong do they have to be to not be useful.” [see these and other quote by Dr. George Box here]
Reliability predictions are intended to be used as risk and resource
management tools. For example, a prediction can be used to:
Compare alternative designs.
Used to guide improvement by showing the highest contributors to failure.
Evaluate the impact of proposed changes.
Evaluate the need for environmental controls.
Evaluate the significance of reported failures.
None of these require that the model provide an accurate prediction of
field reliability. The absolute values aren’t important for any of the
above tasks, the relative values are. This is true whether you express
the result as a hazard rate/MTBF or as a reliability. Handbook methods
provide a common basis for calculating these relative values; a
standard as it were. The model is wrong, but if used properly it can
Think about the use of RPN’s in certain FMEA. The absolute value of
the RPN is meaningless, the relative value is what’s important. For
sure, an RPN of 600 is high, unless every other RPN is greater than
600. Similarly, an RPN of 100 isn’t very large, unless every other RPN
is less than 100. The RPN is wrong as a model of risk, but it can be
I once worked at an industrial facility where the engineers would dump
a load of process data into a spreadsheet. Then they would fit a
polynomial trend line to the raw data. They would increase the order
of the polynomial until R^2 = 1 or they reached the maximum order
supported by the spreadsheet software. The engineers and management
used these “models” to support all sorts of decision making. They were
often frustrated because they seemed to be dealing with the same
problems over and over. The problem wasn’t with the method, it was
with the organization’s misunderstanding, and subsequent misuse, of
regression and model building. In this case, the model was so wrong it
wasn’t just useless, it was often a detriment.
Reliability predictions often get press. In my experience, this is
mostly the result of misunderstanding of their purpose and misuse of
the results. I haven’t used every handbook method out there, but each
that I have used state somewhere that the prediction is not intended to
represent actual field reliability. For example, MIL-HDBK-217 states,
“…a reliability prediction should never be assumed to represent the expected field reliability.”
I think the term “prediction” misleads
the consumer into believing the end result is somehow an accurate
representation of fielded reliability. When this ends up not being the
case, rather than reflecting internally, we prefer to conclude the
model must be flawed.
All that said, I would be one of the first to admit the handbooks could
and should be updated and improved. We should strive to make the
models less wrong, but we should also strive to use them properly.
Using them as estimators of field reliability is wrong whether the
results are expressed as MTBF or reliability.