Give me a place to stand on, and I will move the Earth.
Archimedes
Its known HALT is an effective way to find the weaknesses in your product during the reliability improvement program. In doing so, we view HALT as a qualitative test only. We cannot define the reliability and lifetime of the product from its results. So, unfortunately, we cannot use HALT for purposes of Type Certification, confirm the lifetime of Critical Parts, predict the warranty and maintenance costs, which are required, for example, for aviation.
If we could combine the effectiveness of HALT (high acceleration of testing) with the benefits of quantitative testing, we would get a very powerful tool for the Reliability Demonstration and the Reliability Development of the new products.
Three prototypes survive the gauntlet of stresses and none fail. That is great news, or is it? No failure testing is what I call success testing.
We often want to create a design that is successful, therefore enjoying successful testing results, I.e. No failures means we are successful, right?
Another aspect of success testing is in pass/fail type testing we can minimize the sample size by planning for all prototypes passing the test. If we plan on running the test till we have a failure or two, we need more samples. While it improves the statistics of the results, we have to spend more to achieve the results. We nearly always have limited resources for testing.
Let’s say we want to characterize the reliability performance of a vendor’s device. We’re considering including the device within our system, if and only if, it will survive 5 years reasonably well.
The vendor’s data sheet lists an MTBF value of 200,000 hours. A call to the vendor and search of their site doesn’t reveal any additional reliability information. MTBF is all we have.
We don’t trust it. Which is wise.
Now we want to run an ALT to estimate a time to failure distribution for the device. The intent is to use an acceleration model to accelerate the testing and a time to failure model to adjust to our various expected use conditions.
If you have been a reliability engineer for a week or more, or worked with a reliability engineer for a day or more, someone asked about testing planning. The conversation may have started with “how many samples and how long will the test take?”
Recently heard from a reader of NoMTBF. She wondered about a supplier’s argument that they meet the reliability or MTBF requirements. She was right to wonder.
Estimating reliability performance a new design is difficult.
There are good and better practice to justify claims about future reliability performance. Likewise, there are just plain poor approaches, too. Plus there are approaches that should never be used.
The Vendor Calculation to Support Claim They Meet Reliability Objective
Let’s say we contract with a vendor to create a navigation system for our vehicle. The specification includes functional requirements. Also it includes form factor and a long list of other requirements. It also clearly states the reliability specification. Let’s say the unit should last with 95% probability over 10 years of use within our vehicle. We provide environmental and function requirements in detail.
The vendor first converts the 95% probability of success over 10 years into MTBF. Claiming they are ‘more familiar’ with MTBF. The ignore the requirements for probability of first month of operation success. Likewise they ignore the 5 year targeted reliability, or as they would convert, MTBF requirements.
[Note: if you were tempted to calculate the equivalent MTBF, please don’t. It’s not useful, nor relevant, and a poor practice. Suffice it to say it would be a large and meaningless number]
RED FLAG By converting the requirement into MTBF it suggests they may be making simplifying assumptions. This may permit easier use of estimation, modeling, and testing approaches.
The Vendor’s Approach to ‘Prove’ The Meet the MTBF Requirement
The vendor reported they met the reliability requirement using the following logic:
Of the 1,000 (more actually) components we selected 6 at random for accelerated life testing. We estimated the lower 60% confidence of the probability of surviving 10 years given the ALT results. Then converted the ALT results to MTBF for the part.
We then added the Mil Hdbk 217 failure rate estimate to the ALT result for each of the 6 parts.
RED FLAG This one has me wondering the rationale for adding failure rates of an ALT and a parts count prediction. It would make the failure rate higher. Maybe it was a means to add a bit of margin to cover the uncertainty? I’m not sure, do you have any idea why someone would do this? Are they assuming the ALT did not actually measure anything relevant or any specific failure mechanisms, or they used a benign stress? ALT details were not provided.
The Approach Gets Weird Here
Then we use a 217 parts count prediction along with the modified 6 component failure rates to estimate the system failure rate, and with a simple inversion estimated the MTBF. They then claimed the system design will meet the field reliability performance requirements.
Hence, a reliability prediction should never be assumed to represent the expected field reliability …
If you are going to use a standard, any standard, one should read it. Read to understand when and why it is useful or not useful.
What Should the Vendor Have Done Instead?
There are a lot of ways to create a new design and meet reliability requirements.
The build, test, fix approach or reliability growth approach works well in many circumstances.
Using similar actually fielded systems failure data. It may provide a reasonable bound for an estimate of a new system. It may also limit the focus on the accelerated testing to only the novel or new or high risk areas of the new design — given much of the design is (or may be) similar to past products.
Using a simple reliability block diagram or fault tree analysis model to assembly the estimates, test results, engineering stress/strength analysis (all better estimation tools then parts count, in my opinion) and calculate a system reliability estimate.
Using a risk of failure approach with FMEA and HALT to identify the likely failure mechanisms then characterize those mechanisms to determine their time to failure distributions. If there is one or a few dominant failure mechanisms, that work would provide a reasonable estimate of the system reliability.
In all cases focus on failure mechanisms and how the time to failure distribution changes given changes in stress / environment / use conditions. Monte Carlo may provide a suitable means to analysis a great mixture of data to determine an estimate. Use reliability, probability of success over a duration.
In short, do the work to understand the design, it’s weaknesses, the time to failure behavior under different use/condition scenarios, and make justifiable assumptions only when necessary.
Summary
We engage vendors to supply custom subsystems given their expertise and ability to deliver the units we need for our vehicle. We expect them to justify they meet reliability requirements in a rationale and defendable manner. While we do not want to dictate the approach tot he design or the estimate of reliability performance, we certainly have to judge the acceptability of the claims they meet the requirements.
What do you report when a customer asks if your product will meet the reliability requirements? Add to the list of possible approaches in the comments section below.
Reliability testing is expensive. The results are often not conclusive.
Yet we spend billions on environmental, accelerated, growth, step stress and other types of reliability tests. We bake, shake, rattle and roll prototypes and production units alike. We examine the collected data in hopes of glimpsing the future. Continue reading Is Your Reliability Testing Adding Value?→
A result of life testing can be measurement or evaluation of the lifetime.
Measurement of the lifetime requires a lot of testing to failure. The results provide us with the life (time-to-failure) distribution of the product itself. It is long and expensive.
Evaluation of the lifetime does not require as many test samples and these tests can be without failures. It is faster and cheaper [1]. A drawback of the evaluation is that it does not give us the lifetime distribution. The evaluation checks the lower bound of reliability only, and interpretation of the results depends on the method of evaluation (the number of samples, test conditions, and the test time). Continue reading Lifetime Evaluation vs. Measurement. Part 2.→
Not our personal or moral standards, rather the set of documents we rely upon as a foundation for reliability engineering tools and techniques.
We have a wide array of standards for reporting reliability test data to calculating confidence intervals on field returns. We have standards that describe various environmental conditions and appropriate testing levels suitable to evaluate your product. We define terms, concepts, processes, and techniques.
A Missing Element
Despite the many documents and impressive titles of numbers and abbreviations or acronyms, most of the standard related to reliability engineer fail to include sufficient context and rationale concerning when and why to use or modify the standard. If a specific test is to determine the expected lifetime of solder joints, well, which type of solder joints (shape, size, configuration, material, and process) is the standard appropriate and when does it not apply? Make the boundaries of applicability clear.
No single test works for all situations.
For example, a wrist watch standard defining how to test for specific water resistance claims does not evaluate the effects of corrosion. The standard has the watch or similar device exposed to a set of water conditions, then evaluate if the system is operating, nearly immediately after the water exposure.
We know that water encourages corrosion, yet takes time to occur. Water alone on a circuit board is no big deal (much of the time) it’s when the water facilitates the creation of additional and unwanted current paths that there is a problem. Metal migration and rusting, take time to occur.
If the standard for water resistance doesn’t evaluate corrosion, and it’s one of the ways your product fails, too bad. You can ‘pass’ the test, meet the standard, add it to your data sheet, and the customer will still experience a failure.
Same for many environmental testing, FMEA, life testing, field data analysis, and a range of other standards. They do not include the critical information necessary for appropriate application of the standard to your particular situation.
Connection to Value
Many, not all, standards provide a recipe to accomplish as task or evaluation. One of the values of the standard is different teams may replicate the results of one team by repeating the steps outlined in the standard.
One of issues with standards is they do not include how and why to actually accomplish the set of tasks and what to do with the results. In part, we need to clearly connect, say the task of testing a product across a range of temperature and humidity conditions, only if it will provide meaningful information.
Don’t run the test if the information is not needed, unnecessary or meaningless.
For example, if we expect that exposure to high temperature and humid conditions may increase the chance of product failure. We may want to know
how many failures will occur;
how the product will actually fail;
how the failure will initiate and progress;
when the failures occur under use conditions;
Or any number of reasons to use the results of the testing. Often we run a standard test with very few samples, experience no failures and erroneously conclude all it good. Then surprised that failures occur anyway when the product is in use.
The standard let us down.
The standard provided only a recipe or outline for a procedure and now that guidance and rationale on how it may or may not help us and our team resolve very real questions. Testing 3 units that all pass does not mean your solar panel will survive hot and humid conditions for 20 years with no failures. It doesn’t.
Only run the test or work to accomplish a process only if it is tied to answering a question. Focus on business decisions and the questions we have to resolve in order to make better decisions (i.e. Wrong less often).
Summary
Let’s change the way we read and use standards. You may need to add the how and why, the boundaries, and the connection to value for your situation. It’s not always easy. The people writing the standard often have sufficient experience to include guidelines to help you — when possible contact them and ask what was their thinking and what are the limitations.
If enough of us avoid simply meeting the requirements of the standard, we will
Enjoy reliable product performance
Create value to our organization with each test or task
In some cases we have to conduct testing and are asked to not break the product. Now, that isn’t all that fun as a reliability engineer. We want to find what fails and understand it. Or, we want to confirm what we expect will fail, actually does as expected.
So, what do we do when confronted with a very small sample size (that is one issue) and are expected to conduct failure free testing (second issue)? Let’s explore each issue separately and come up with a few suggestions on how to proceed.
I endured a difficult conversation with a project manager yesterday. The meeting agenda included an initial discussion about the product development reliability plan. She agreed that we needed to identify risks and provide feedback to the team concerning product reliability. Continue reading Is Reliability Just Testing?→
Sometimes making an assumption is a good thing. You can achieve more with less. A well placed assumption saves you time, work, and worry. The right assumption may even be left unstated, it’s so good.
Have you ever assumed the failures for a system follow an exponential distribution? Did you assume tallying up the total hours and dividing by the number of failures was appropriate? Did you even check? (You don’t need to answer.) Continue reading The Convenient Use of MTBF→
In an ideal world the design of a product or system will have perfect knowledge of all the risks and failure mechanisms. The design then is built perfectly without any errors or unexpected variation and will simply function as expected for the customer.
Wouldn’t that be nice.
The assumption that we have perfect knowledge is the kicker though, along with perfect manufacturing and materials. We often do not know enough about:
Customer requirements
Operating environment
Frequency of use
Impact of design tradeoffs
Material variability
Process variability
We do know that we do not know everything we need to create a perfect product, thus we conduct experiments.
Normally, we life test a sample of products in order to make sure the products will last as long as expected. We assume that the sample we select will represent the total population of products that we eventually ship. It is not a perfect system, and there is some risk involved. Continue reading Sample size and MTBF→