In some cases we have to conduct testing and are asked to not break the product. Now, that isn’t all that fun as a reliability engineer. We want to find what fails and understand it. Or, we want to confirm what we expect will fail, actually does as expected.
So, what do we do when confronted with a very small sample size (that is one issue) and are expected to conduct failure free testing (second issue)? Let’s explore each issue separately and come up with a few suggestions on how to proceed.
Thanks to Олег (@OlegV_Ivanov) via Twitter for the article suggestion. Thanks for the idea Олеr.
Sample Size Limitation
When doing product or reliability testing the first step is to be clear about the objective of the testing. If the reliability testing is to clearly prove a 98% reliability over 5 years with at least a 90% statistical confidence, then we have a problem. A sample size of one or two samples will not be sufficient to demonstrate the reliability performance with the statistical confidence.
On the other hand, if we want to check the functionality of the system integration, and not make claims about the population of future builds, then we’re fine. Not all testing has the same objective and understanding how constraints limit what is possible via testing, the better you can craft and achieve meaningful objectives.
HALT is a great tool to discover limitations or weaknesses of a design or assembly. And, it can be accomplished with just one or two samples. So, if the objective is to determine the margins, weaknesses, etc. Then HALT may well be the approach made for the situation.
A good idea even if not limited by sample size is to fully model the expected failure mechanisms and the system reliability. Using the best available information to populate the model, plus experimental results on elements of the system, allows the estimate of system reliability performance.
Furthermore a solid reliability model sets up the testing well. It allows us to focus on what is likely to fail or change over time, thus allowing testing a single unit with the purpose of confirming the model. Ideally run the testing to failure to confirm the failure mechanism models. I’m thinking of the Boeing 777 wing testing as an example.
Subsystem testing is another approach when facing limited system level samples. If you know the compressor is a critical element for the reliability of the overall unit, then testing just the compressor may allow using sufficient samples for a statistical based test.
Another benefit of subsystem testing is the ability to focus on specific failure mechanisms and generally apply the appropriate stresses without confusion of the many other elements of system confounding or obscuring results. Accelerated testing of a compressor outside a system may permit operating at higher loads and temperatures then otherwise possible, for example.
Performance monitoring of the operation of one or two units, if related to degradation of performance that leads to or indicates system failure may provide the necessary reliability information. The catch here is understanding the measures that related to failure mechanisms of interest.
For example, if we have a motor has a primary failure mechanism caused by brush wear, we may be able to monitor current draw as a means to monitor the brush wear and eventual failure. Current draw may also change as the impedance or rotating friction or some other mechanism changes over time, hence this approach works well when you really understand the dominant failure mechanisms well.
Statistics and samples representing the population are not feasible when testing just one or two units in most cases. So, think though what you are trying to do with the testing and what is possible. Using a sample to make a statement about the population may not be possible using any reasonable statistical confidence. Be clear about that when designing the test.
A common approach for no failure testing (success testing is what I call it) is the use of Binomial (pass/fail) testing with zero failures. The success testing sample size formula is
Where, n is sample size to show at least R reliability with C confidence. For example you would need to run 22 samples over a lifetime of stress to show 90% reliable with 90% confidence.
The issue of small sample sizes and success testing results in the test being able to only confirm rather poor lower limits for reliability. Setting confidence below 50% implies the samples results are more likely to not represent the population, thus I recommend not going lower than 60% for confidence.
What can we learn from a test with no failures. No much, if my initial response. Even with a solid set of failure mechanism models without a failure we do not have confirmation the testing aged the product as expected nor that the system would fail as expected.
Of course, if we do have a failure when designing a success test, that is usually a surprise and contains all kinds of great information.
So, what can we reasonably expect to learn? If the testing and models are all in good shape, we gain a little confirmation that our assumptions are ok. That is about it.
If we are able tear down the units after testing and examine the wear or aging effects, we gain model confirmation that is now supported by evidence. If we can monitor degradation paths to failure, even if there is not failure during the testing, we again learn about the systems ability to follow the failure mechanism models (or we gain some information that we can use to update our models).
Reliability monitoring and measurements on even one sample with success testing helps to create a profile of what to expect during operation. It’s a start. As we run units in production or in actual use, continuing to collect reliability data helps us eventually build and confirm the models of reliability performance.
Reliability block diagrams, Petri Net or Markov Modeling, along with subsystem testing and analysis allow us to understand and estimate system level performance even for complex systems. Simple testing one or two units using success testing without the understanding of failure mechanisms is, in my opinion, a waste of time.
2 Samples and No Failures
Is this really reliability testing? In some cases it is not. It simply a basic demonstration that the prototype works as expected for the duration of the testing. Using any form of acceleration requires a fundamental understanding of the failure mechanisms and how they respond to the applied stress.
When you are asked to setup a reliability test and limited to one unit that you should not break, you’re first step is to come a clear understanding with those needing the test results. What do they expect the test results will mean? What decisions are going to be made based on the results? And, does everyone understand the limitation of reliability testing in general and specifically given the constraints of limited samples and success testing?
Make it clear what can and cannot be said based on the results. Then go out and run the test the best possible way you can sort out to do. Have fun, it might break and you and your team will certainly learn something.