What is Reliability?

14784844872_7b7908dd94_zGuest Post by Martin Shaw

In today’s complex product environment becoming more and more electronic, do the designers and manufacturers really understand what IS Reliability ??

It is NOT simply following standards to test in RD to focus only on Design Robustness as there is too much risk in prediction confidence, it only deals with the ‘intrinsic’ failure period and rarely has sufficient Test Strength to stimulate failures. Continue reading “What is Reliability?”

The Magic Math of Meeting MTBF Requirements

Even old machines met reliability or MTBF requirementsThe Magic Math of Meeting MTBF Requirements

Recently heard from a reader of NoMTBF. She wondered about a supplier’s argument that they meet the reliability or MTBF requirements. She was right to wonder.

Estimating reliability performance a new design is difficult.

There are good and better practice to justify claims about future reliability performance. Likewise, there are just plain poor approaches, too. Plus there are approaches that should never be used.

The Vendor Calculation to Support Claim They Meet Reliability Objective

Let’s say we contract with a vendor to create a navigation system for our vehicle. The specification includes functional requirements. Also it includes form factor and a long list of other requirements. It also clearly states the reliability specification. Let’s say the unit should last with 95% probability over 10 years of use within our vehicle. We provide environmental and function requirements in detail.

The vendor first converts the 95% probability of success over 10 years into MTBF. Claiming they are ‘more familiar’ with MTBF. The ignore the requirements for probability of first month of operation success. Likewise they ignore the 5 year targeted reliability, or as they would convert, MTBF requirements.

[Note: if you were tempted to calculate the equivalent MTBF, please don’t. It’s not useful, nor relevant, and a poor practice. Suffice it to say it would be a large and meaningless number]

RED FLAG By converting the requirement into MTBF it suggests they may be making simplifying assumptions. This may permit easier use of estimation, modeling, and testing approaches.

The Vendor’s Approach to ‘Prove’ The Meet the MTBF Requirement

The vendor reported they met the reliability requirement using the following logic:

Of the 1,000 (more actually) components we selected 6 at random for accelerated life testing. We estimated the lower 60% confidence of the probability of surviving 10 years given the ALT results. Then converted the ALT results to MTBF for the part.

We then added the Mil Hdbk 217 failure rate estimate to the ALT result for each of the 6 parts.

RED FLAG This one has me wondering the rationale for adding failure rates of an ALT and a parts count prediction. It would make the failure rate higher. Maybe it was a means to add a bit of margin to cover the uncertainty? I’m not sure, do you have any idea why someone would do this? Are they assuming the ALT did not actually measure anything relevant or any specific failure mechanisms, or they used a benign stress? ALT details were not provided.

The Approach Gets Weird Here

Then we use a 217 parts count prediction along with the modified 6 component failure rates to estimate the system failure rate, and with a simple inversion estimated the MTBF. They then claimed the system design will meet the field reliability performance requirements.

RED FLAG Mil HDBK 217 F in section 3.3 states

Hence, a reliability prediction should never be assumed to represent the expected field reliability …

If you are going to use a standard, any standard, one should read it. Read to  understand when and why it is useful or not useful.

What Should the Vendor Have Done Instead?

There are a lot of ways to create a new design and meet reliability requirements.

  • The build, test, fix approach or reliability growth approach works well in many circumstances.
  • Using similar actually fielded systems failure data. It may provide a reasonable bound for an estimate of a new system. It may also limit the focus on the accelerated testing to only the novel or new or high risk areas of the new design — given much of the design is (or may be) similar to past products.
  • Using a simple reliability block diagram or fault tree analysis model to assembly the estimates, test results, engineering stress/strength analysis (all better estimation tools then parts count, in my opinion) and calculate a system reliability estimate.
  • Using a risk of failure approach with FMEA and HALT to identify the likely failure mechanisms then characterize those mechanisms to determine their time to failure distributions. If there is one or a few dominant failure mechanisms, that work would provide a reasonable estimate of the system reliability.

In all cases focus on failure mechanisms and how the time to failure distribution changes given changes in stress / environment / use conditions. Monte Carlo may provide a suitable means to analysis a great mixture of data to determine an estimate. Use reliability, probability of success over a duration.

In short, do the work to understand the design, it’s weaknesses, the time to failure behavior under different use/condition scenarios, and make justifiable assumptions only when necessary.

Summary

We engage vendors to supply custom subsystems given their expertise and ability to deliver the units we need for our vehicle. We expect them to justify they meet reliability requirements in a rationale and defendable manner. While we do not want to dictate the approach tot he design or the estimate of reliability performance, we certainly have to judge the acceptability of the claims they meet the requirements.

What do you report when a customer asks if your product will meet the reliability requirements? Add to the list of possible approaches in the comments section below.

Related

How to Calculate MTBF

Questions to ask a vendor

MTBF: According to a Component Supplier

A World of Constant Failure Rates

14760970966_18c932956c_zWhat if all failures occurred truly randomly?

The math would be easier.

The exponential distribution would be the only time to failure distribution. We wouldn’t need Weibull or other complex multi parameter models. Knowing the failure rate for an hour would be all we would need to know, over any time frame.

Sample size and test planning would be simpler. Just run the samples at hand long enough to accumulated enough hours to provide a reasonable estimate for the failure rate.

Would the Design Process Change?

Yes, I suppose it would. The effects of early life and wear out would not exist. Once a product is placed into service the chance to fail the first hour would be the same as any hour of it’s operation. It would fail eventually and the chance of failing before a year would solely depend on the chance of failure per hour.

A higher failure rate would suggest it would have a lower chance of surviving very long. Although it could still fail in the first hour of use as if it had survived for one million hours and then it’s chance to fail the next hour would still be the same.

Would Warranty Make Sense?

Since by design we cannot create a product with a low initial failure rate we would only focus on the overall failure rate. Or the chance of failing over any hour, the first hour being convenient and easy to test, yet still meaningful. Any single failure in a customer’s hands could occur at any time and would not alone suggest the failure rate has changed.

Maybe a warranty would make sense based customer satisfaction. We could estimate the number of failures over a time period and set aside funds for warranty expenses. I suppose it would place a burden on the design team to create products with a lower failure rate per hour. Maybe warranty would still make sense.

How About Maintenance?

If there are no wear out mechanisms (this is a make believe world) changing the oil in your car would not make any economic sense. The existing oil has the same chance of engine seize failure as any new oil. The lubricant doesn’t breakdown. Seals do not leak. Metal on metal movement doesn’t cause damaging heat or abrasion.

You may have to replace a car tire due to a nail puncture, yet the chance of an accident due to worn tire tread would not occur any more often than with new tires. We wouldn’t need to monitor tire tread or break pad wear. Those wouldn’t occur.

If a motor is running now, if we know the failure rate we can calculate the chance of running for the rest of the shift, even when the motor is as old as the building.

The concepts of reliability centered maintenance or predictive maintenance or even preventative maintenance would not make sense. There would be advantage to swapping a part of a new one, as the chance to fail would remain the same.

Physics of Failure and Prognostic Health Management – would they make sense?

Understanding failure mechanisms so we could reduce the chance of failure would remain important. Yet when the failures do not

  • Accumulated damage
  • Drift
  • Wear
  • Abrade
  • Diffuse
  • Degrade
  • Etc.

Then many of the predictive power of PoF and PHM would not be relevant. We wouldn’t need sensors to monitor conditions that lead to failure, as no specific failure would show a sign or indication of failure before it occurred. Nothing would indicate it was about to fail as that would imply it’s chance to failure has changed.

No more tune-ups or inspections, we would pursue repairs when a failure occurs, not before.

A world of random failures, or a world of failures each of which occurs at a constant rate would be quite different than our world. So, why do we so often make this assumption?

The Constant Failure Rate Myth

14597315009_8dec5d425e_zThe Constant Failure Rate Myth

Have you said or have you heard someone say,

  • “Let’s assume it’s in the flat part of the curve”
  • “Assuming constant failure rate…”
  • “We can use the exponential distribution because we are in the useful life period.”

Or something similar? Did you cringe? Well, you should have. Continue reading “The Constant Failure Rate Myth”

Spotted a Current Reference to Mil Hdbk 217

Spotted a Current Reference to Mil Hdbk 217 Recently

Ben Bashford LOL Reliable https://www.flickr.com/photos/bashford/2659100054/in/gallery-fms95032-72157649635411636/
Ben Bashford
LOL Reliable

After a short convulsion of disbelieve I became shocked. This was  a guide for a government agency advising design teams and supporting reliability professional. It suggested using 217 to create the estimate of field reliability performance during the development phase.

Have we made not progress in 30 years?

What would do?

Let’s say you are reviewing a purchase contract and find a request for a reliability estimate based on Mil Hdbk 217F (the latest revision that is also been obsolete for many years), what would you do? Would you contact the author and request an update to the document? Would you pull to 217 and create the estimate? Would you work to create and estimate the reliability of a product using the best available current methods? Then convert that work to an MTBF and adjust the 217 inputs to create a similar result. Or would you ignore the 217 requirement and provide a reliability case instead?

Requirements are the requirements

When a customer demands a parts count prediction as a condition of the purchase, is that useful for either the development team or the customer?

No.

So, given the contract is signed and we are in the execution phase, what are your options?

  1. Do the prediction and send over the report while moving on with other work.

  2. Ask the customer to adjust the agreement to include a meaningful estimate.

  3. Ignore the 217 requirement and provide a complete reliability case detailing the reliability performance of the product.

  4. Find a new position that will not include MTBF parts count prediction.

The choice is yours.

I hope you would call out the misstep in the contract and help all parties get the information concerning reliability that they can actually use to make meaningful decisions.

Adjusting Parameters to Achieve MTBF Requirement

 How to Adjust Parameters to Achieve MTBF

Alex Ford, Reliable Loan & Jewelry | | Isaac's
Alex Ford, Reliable Loan & Jewelry | | Isaac’s

A troublesome question arrived via email the other day. The author wanted to know if I knew how and could help them adjust the parameters of a parts count prediction such that they arrived at the customer’s required MTBF value.

I was blunt with my response. Continue reading “Adjusting Parameters to Achieve MTBF Requirement”

How to Estimate MTBF

How to Estimate MTBF

Eva the Weaver - no longer very reliable https://www.flickr.com/photos/evaekeblad/14504747666/in/gallery-fms95032-72157649635411636/
Eva the Weaver – no longer very reliable

Every now and then I receive an interesting question from a connection, colleague or friend. The questions that make me think or they discussion may be of value to you, I write a blog post.

In this case, there are a couple of interesting points to consider. Hopefully you are not facing a similar question. Continue reading “How to Estimate MTBF”

What is the Purpose of Reliability Predictions

In Response to ‘What was the Original Purpose of MTBF Predictions?’

Staci Myers, The Old Reliable

Guest Post by Andrew Rowland, Executive Consultant, ReliaQual Associates, LLC, www.reliaqual.com in response to the ‘Reliability Predictions‘ article.

Hi Fred,

In the section on predictions you mention Dr. Box’s oft quoted
statement that “..all models are wrong, but some are useful.”  In the
same book Dr. Box also wrote, “Remember that all models are wrong; the
practical question is how wrong do they have to be to not be useful.” [see these and other quote by Dr. George Box here]

Reliability predictions are intended to be used as risk and resource
management tools.  For example, a prediction can be used to:

  • Compare alternative designs.
  • Used to guide improvement by showing the highest contributors to failure.
  • Evaluate the impact of proposed changes.
  • Evaluate the need for environmental controls.
  • Evaluate the significance of reported failures.

None of these require that the model provide an accurate prediction of
field reliability.  The absolute values aren’t important for any of the
above tasks, the relative values are.  This is true whether you express
the result as a hazard rate/MTBF or as a reliability.  Handbook methods
provide a common basis for calculating these relative values; a
standard as it were.  The model is wrong, but if used properly it can
be useful.

Think about the use of RPN’s in certain FMEA.  The absolute value of
the RPN is meaningless, the relative value is what’s important.  For
sure, an RPN of 600 is high, unless every other RPN is greater than
600.  Similarly, an RPN of 100 isn’t very large, unless every other RPN
is less than 100.  The RPN is wrong as a model of risk, but it can be
useful.

I once worked at an industrial facility where the engineers would dump
a load of process data into a spreadsheet.  Then they would fit a
polynomial trend line to the raw data.  They would increase the order
of the polynomial until R^2 = 1 or they reached the maximum order
supported by the spreadsheet software.  The engineers and management
used these “models” to support all sorts of decision making.  They were
often frustrated because they seemed to be dealing with the same
problems over and over.  The problem wasn’t with the method, it was
with the organization’s misunderstanding, and subsequent misuse, of
regression and model building.  In this case, the model was so wrong it
wasn’t just useless, it was often a detriment.

Reliability predictions often get press.  In my experience, this is
mostly the result of misunderstanding of their purpose and misuse of
the results.  I haven’t used every handbook method out there, but each
that I have used state somewhere that the prediction is not intended to
represent actual field reliability.  For example, MIL-HDBK-217 states,

“…a reliability prediction should never be assumed to represent the expected field reliability.”

I think the term “prediction” misleads
the consumer into believing the end result is somehow an accurate
representation of fielded reliability.  When this ends up not being the
case, rather than reflecting internally, we prefer to conclude the
model must be flawed.

All that said, I would be one of the first to admit the handbooks could
and should be updated and improved.  We should strive to make the
models less wrong, but we should also strive to use them properly.
Using them as estimators of field reliability is wrong whether the
results are expressed as MTBF or reliability.

Best Regards,

Andrew

 

Reliability Growth without MTBF

Reliability Growth and MTBF

Peter Lee Reliable https://www.flickr.com/photos/oldpatterns/5858406571/in/gallery-fms95032-72157649635411636/
Peter Lee, Reliable

Really? Is MTBF the only way to work with reliability growth?

Received this question via LinkedIn (feel free to connect with me there) and hadn’t given it much thought before. I am familiar with a few growth models and regularly have seen MTBF in use. Thus discounted the modeling as an approach of little interest to me or my clients.

MTBF measures the inverse of the average failure rate, when in many cases we really want to know about the first or tenth percentile of time to failure. Measuring and tracking the average time to failure provides little information about the onset of the first few failures.

Reliability Growth Models

Did just a quick check of common reliability growth models and found a few in the NIST Engineering Statistics Handbook  http://www.itl.nist.gov/div898/handbook/apr/section1/apr19.htm .

The Homogeneous Poisson Process (HPP) when the failure rate is constant over the time period of interest. This relies on the exponential distribution and the assumption of a stable and random arrival of failures, which is almost always not true (in my experience). It’s a convenient assumption as it makes the math a lot simpler, yet provides only a crude model and poor results.

The Non-Homogeneous Poisson process (NHPP) Power Law and Exponential Law models provide information based on the cumulative number of failures over time. These models rely on the notion that any system has a finite number of design errors that once resolved create a system that has a HPP behavior.

Duane Plot provides a graphical means to show cumulative failures over time. When the arrival of failures slows the curve decreases in slope effectively bending over. This provides a means to estimate the final failure rate (average unfortunately).

What I use instead

Given my dislike of all things MTBF, I’ve not used these model to estimate MTBF. Instead stay with the Duane plot and graphically track when the team is finding and fixing enough faults in the design.

I also tend to use reliability block diagrams (RBD) with each block modeled with the appropriate reliability distribution. For a series model then all we need to do is multiple the reliability value from each block for time t (say warranty period, or mission time, etc.) to estimate the system reliability at time t.

For complex systems with some amount of redundancy the RBD does get a bit more complicated.  For very complex systems with degraded modes of operation or significant repair times then use Petri Nets or Markov Models to properly model.

In the vast majority of cases a simple RBD is sufficient to capture and understand the reliability of a system. This allows the team to focus on improving weak areas and reduce uncertainty though improving reliability estimates. An RBD does not require nor assume an exponential distribution and the math is easy enough to manage, often even in your favorite spreadsheet.

Summary

Reliability growth starts with model of the estimated number of failures over a time period. Testing then provides a value for that estimate. This does not require the use of MTBF, so instead of assuming a constant failure rate, focus on the failure mechanisms and use a simple RBD to build a system model. The reliability growth is the result of identifying areas for improvement and doing the improvement. RBD, in my experience, provides a great way to communicate with the team where to focus improvements.

No excuse to use parts count to estimate field reliability

How to Estimate Reliability Early in a Program

In a few discussions about the perils of MTBF, individuals have asked about estimating MTBF (reliability) early in a program. They quickly referred to various parts count prediction methods as the only viable means to estimate MTBF.

One motivation to create reliability estimates is to provide feedback to the team. The reliability goal exists and the early design work is progressing, so estimating the performance of the product’s functions is natural. The mechanical engineers may use finite element analysis to estimate responses of the structure to various loads. Electrical engineers may use SPICE models for circuit analysis.

Customers expect a reliable product. If they are investing in the development of the product (military vehicle, custom production equipment, or solar power plant, for examples) they may also want an early estimate of reliability performance.

Engineers and scientists estimate reliability during the concept phase as they determine the architecture, materials, and major components. The emphasis is often on creating a concept that will deliver the features in the expected environment. The primary method for reliability estimation is engineering judgement.

With the first set of designs, there is more information available on specific material, structures, and components, thus it should be possible to create an improved reliability estimate.

Is testing the true way to estimate MTBF?

Early in a program means there are no prototypes available for testing, just bill of materials and drawings. So, what is a reliability engineer to do?

One could argue that without prototypes or production units available for testing (exercising or aging the system to simulate use conditions) we do not really know how the system will respond to use conditions. While it is true it is difficult to know what we do not know, we often do know quite a bit about the system and the major elements and how they individually will respond to use conditions.

Even with testing, we often use engineering judgement to focus the stresses employed to age a system. We apply prior knowledge of failure mechanism models to design accelerated tests. And, we use FMEA tools to define the areas most likely to fail, thus guiding our test development.

Creating a reliability estimate without a prototype

Engineering judgement is the starting point. Include the information from FMEA and other risk assessment methods to identify the elements of a product that are most likely to fail, thus limit the system reliability. Then there are a few options available to estimate reliability, even without a prototype.
First, it is rare to create a new product using all new materials, assembly methods, and components.

Often a new product is approximately 80% the same as previous or similar products. The new design may be a new form factor, thus mostly a structural change. It may includes new electronic elements – often just one or two components, where the remaining components in the circuit regularly used. Or, it may involve a new material, reusing known structures and circuits.

Use the field history of similar products or subsystems and engineering judgement for the new elements to create an estimate. A simple reliability block diagram may be helpful to organize the information from various sources.

For the new elements of a design, base the engineering judgement on analysis of the potential failure mechanisms, employ any existing reliability models, or use simulations to compare known similar solutions to the new solution.

Second, for the elements without existing similar solutions and without existing failure mechanism models, we would have to rely on engineering judgement or component or test coupon level testing. Rather than wait for the system prototypes, early in a program it is often possible to obtain samples of the materials, structures, or components for evaluation.

The idea is to use our engineering judgement and risk analysis tools to define the most likely failure mechanisms for the elements with unknown reliability performance. Let’s say we are exploring a new surface finish technique. We estimate that exposure to solar radiation may degrade the finish. Therefore, obtain some small swatches of material, apply the surface finish and expose to UV radiation. While not the full product using fully developed production processes, it is a way to evaluate the concept.

Another example, is a new solder joint attachment technique. Again, use your judgement and risk analysis tools to estimate the primary failure mechanisms, say thermal cycling and power cycling, then obtain test packages with same physical structures (the IC or active elements do not have to be functional) and design appropriate tests for the suspected failure mechanisms.

Estimate combine the available knowledge

With a little creativity we can provide a range of estimates for elements of a design that have little or no field history. We do not need to rely on a tabulate list of failure rates for dissimilar product created by a wide range of teams for diverse solutions. We can draw from our team’s prior designs actual field performance for the bulk of the estimate. Then fill in the remaining elements of the estimate with engineering judgement, comparative analysis, published reliability models, or coupon or test structure failure mechanism evaluations.

In general, we will understand the bulk of the reliability performance and have rational estimates for the rest. It’s an estimate and the exercise will help us and the team focus on which areas may require extensive testing.

Solving Type III problems

Solving Type III problems

There are occasions when we perfectly solve the wrong problem. This is a Type III error.

Following the statistics idea of Type I and Type II errors, when a sample provides information incorrectly about a population, Type III is the error of asking the wrong question to start.

Solving the wrong problem, even perfectly, is still an error.

So, how do you know you’re in a Type III situation?

Continue reading “Solving Type III problems”

Reliabilty Predictions

Predicting MTBF or creating an estimate is often requested by your customer or organization. You are being specifically asked for MTBF for a new product.

Who are you fooling with MTBF Predictions?

All models are wrong, some are useful. ~ George E. P. Box

If you know me, you know I do not like MTBF. Trying to predict MTBF, which I consider a worthless metric, is folly.

So, why the article on predicting MTBF?

Predicting MTBF or creating an estimate is often requested by your customer or organization. You are being specifically asked for MTBF for a new product.

You have to come up with something.

Continue reading “Reliabilty Predictions”

Calculating reliability from data

A short example on calculating reliability from data. And, a comparison to the calculation assuming the failure rate is constant.

In the last note, we calculated MTBF using some test data. Now let’s start with the same situation and calculate reliability instead. As before: There are occasions when we have either field or test data that includes the duration of operation and whether or not the unit failed.

Continue reading “Calculating reliability from data”

Eliminating early life failures

Finding and eliminating early life failures

MTBF for electronics life entitlement measurements is a meaningless term. It says nothing about the distribution of failures or the cause of failures and is only valid for a constant failure rate, which almost never occurs in the real world. It is a term that should be eliminated along with reliability predictions of electronics systems with no moving parts. Continue reading “Eliminating early life failures”

Electronics Failure Prediction Methodology does not work

Posted 12-11-2012 by Kirk Gray,

Accelerated Reliability Solutions, L.L.C.

“When the number of factors coming into play in a phenomenological complex is too large, scientific method in most cases fails.  One need only think of the weather, in which case the prediction even for a few days ahead is impossible.” ― Albert Einstein

“Prediction is very difficult, especially about the future.” – Niels Bohr* We have always had a quest to reduce future uncertainties and know what is going to happen to us, how long we will live, and what may impact our lives.  Horoscopes, Tarot

Continue reading “Electronics Failure Prediction Methodology does not work”