Posted 12-11-2012 by Kirk Gray,
Accelerated Reliability Solutions, L.L.C.
“When the number of factors coming into play in a phenomenological complex is too large, scientific method in most cases fails. One need only think of the weather, in which case the prediction even for a few days ahead is impossible.” ― Albert Einstein
“Prediction is very difficult, especially about the future.” – Niels Bohr* We have always had a quest to reduce future uncertainties and know what is going to happen to us, how long we will live, and what may impact our lives. Horoscopes, Tarot
Cards, tea leaves, and crystal balls have been used as specialized “tools” by fortune tellers to gaze into the future. The paradox of fortune telling is that by knowing the future, we can change it. The risk side of believing we know the future is also that if we incorrectly guess (assume) the causes of a future event, our prevention action may create additional costs or higher risk of an even worse event.
We all want to know the future – what crystal ball do you use? This is also true when making predictions of the future life of electronics. Without clear traceability to actual physics of failure in electronics, assumptions about the causes of failures have added costs without benefits. Still today the long established belief that electronics systems failures are mostly driven by thermal stress (missing predicate). This belief is in spite of the fact that there is no traceability to an intrinsic physical mechanism in components or systems driven by thermal stress in today’s electronics. Many in management and marketing of electronics companies want to believe and wish reliability engineers could predict the life of electronics systems. By knowing the future failure rates, we could budget warranty costs and the correct number of spare parts and replacement units before the product is launched. In 1996 my friend Professor Michael Pecht wrote and published the article “Why traditional reliability prediction models do not work –is there an alternative?”. In it he provides the history of one of the foundational documents of electronics reliability engineering, Military Handbook 217 (MIL HDBK 217) , and why it cannot predict electronic system failure rates. It was removed as a military reference document in 1995, largely due to the work of Prof. Pecht. It is amazing that MIL HDBK 217, removed almost 17 years is still being referenced and its progeny are still being used for reliability predictions in many electronics companies today. Needless to say electronics materials and manufacturing methods have changed tremendously in the last 17 years, but the continued belief that electronics systems reliability can be predicted has changed little in that time. Electronics reliability cannot be predicted at a system level. The vast majority of failures of electronics hardware are due to design margin errors, component misapplication, errors in manufacturing processes, and customer misuse or abuse. It is very easy to confirm this is the case if you have access to the root causes of real field failures in real electronics products.
“All models are wrong, but some are useful” – George E. P. Box
Mathematical models to predict future events are, in many cases, valid and useful. Computer models and measurement systems that are used in meteorology to forecast weather conditions are improving, yet the ability to predict the weather more few days has been elusive. There would be huge benefits economically and in human lives if we could project longer than a few hours in advance when and where extreme weather events such as tornados or hurricanes will occur. With more inputs of contributing atmospheric conditions and computer algorithms, weather forecasting is getting better. Yet extreme weather prediction is limited to a few hours for tornados or a few days for hurricanes, before we know where they will hit. Of course reliability prediction can be performed more accurately if we knew all of the many inherent potential failure mechanisms in an electronics system and the fatigue responses to the life cycle environmental profile (LCEP) stresses. Even if we could know all the inherent failure mechanisms in components, we would also need to include some information the time distributions of manufacturing variations and excursions that would modify the strength or rate of degradation of those mechanisms during manufacturing. In many mechanical and electromechanical systems we do have physical wear mechanisms that can be mathematically modeled and from those models we can mathematically project the “life” of the mechanism. We know that in electric motors, wear of contact brushes, evaporation of lubricants, and wear of ball bearings eventually use up life, leading to failures due to wear out. Mechanical switches and hinges have a limited fatigue life. Through those models we can extend the life in mechanical systems by increasing the reservoir of material or reducing the driving stress conditions. In electronics there are a few devices, such as batteries, that do have short wear out modes relative to technological obsolescence and modeling life is very useful and necessary. It is much more difficult to determine the underlying life-limiting mechanisms of solid state electronics components such as IC’s in a complex system and much less in a PWB. Not only the intrinsic physics causing component degradation and failure must be known, but also the PWB and solder fatigue mechanisms must be known for each package. BGA solder joints and PTH (Plated through Hole) vias do not fatigue at the same rate under the same stress inputs. Of course the stresses for all the mechanisms on the PWB and components can vary widely depending on the PWB locations.
LCEP for most electronics systems is a very rough guesstimate Reliability prediction must also determine the Life Cycle Environmental Profile (LCEP) and also the LCEP distributions for the future field population. We must know to some precision the actual LCEP stress distributions along with the inherent product “strength entitlement” distributions to know where the strength distribution overlaps the stress distribution resulting in product failures. Please see my blog post “
Reliability Paradigm Shift From Time to Stress Metrics” for more explanation of the Stress/Strength relationship in reliability. So many electronics systems have a wide variety of LCEP’s with new applications of systems that result in new LCEP’s that were never considered. Take an example of VGA projectors that we see in many conference and meeting rooms. Some projectors are permanently mounted on the ceiling and many others are mobile. The ceiling mounted units fatigue stress most likely comes from thermal cycling during power cycling, and the mobile units have that stress plus the shock and vibration from transporting. The mobile units’ populations have a much wider distribution of LCEPs. I doubt the manufacturers of these products know the distribution of the LCEP for these two distinct end use environments. End users will expect the same reliability in both, regardless of the very different LCEP’s. Of course some of the mobile units will break instantaneously from an accidental drop. If and when it breaks from an accidental drop, will the user blame their own mishandling for the cause, or blame the manufacturer for making a “fragile” projector and never buying again from the same manufacturer? Certainly we do not expect our cell phones to fail after a waist high drop, but again at what height of drop would we blame the failure being caused by us? When it comes to electronics systems reliability modeling and prediction, we really cannot know all the mechanisms or the distributions of the LCEP. Even if all the degradation models were known and all the combinations of stress distributions and effects in the assembly were known, the challenge of reliability prediction is compounded by variations over time in manufacturing.
Focus on real weakness discovery – less on guessing a very uncertain future We have even less time to model partial or whole systems and the resulting fatigue damage and degradation as the design and manufacturing cycle times for new electronics continue to decrease. Even if we are able to model the degradation and fatigue damage of every potential failure mechanism in a PWB, the models must be based on the units from capable manufacturing, not variations, and we know there will be variations. Additionally, modeling can only establish a failure rate based on inherent wear out mechanisms around the center of the distribution of known LCEPs, even though there may be new applications and different future LCEP’s that were not known when the product was designed.
“The best way to predict the future is to create it.” – Peter Drucker
Just as the prediction of our future, many would like to know what the future holds for the electronics we make and use. Yet for complex electronic systems, most failures are not due to intrinsic wear out mechanisms, but are due to a wide variety of assignable causes that mostly errors in design or manufacturing. Because of the very diverse nature of causes of electronics failures, there has been no evidence that we can model and predict the future failure rates, regardless of the fact that many still want to believe it can be done and want it to be true. Deterministic empirical stress limit and weakness discovery is a vastly more efficient tool for building a reliable electronic system. Using stepped stress to limits methods (such as HALT) and focusing on discovery of potential weaknesses that could be a reliability risk (missing predicate). We can very quickly find the strength limits of complex electronic systems under stress conditions in order to establish a benchmark of strength based on current standard electronics technologies. By knowing empirical stress limits, we can develop safe and efficient ongoing accelerated reliability testing to precipitate and detect manufacturing errors or excursions that result in latent defects. Unfortunately, there is still the wish that accurate prediction is possible and many are still feeding the wish that reliability of electronics hardware can be predicted based on past invalid documents.
Without the ability to share real field reliability data that belief is likely to continue. For additional and funny look at Reliability Prediction, see the link here to Electronics Cooling Magazines article”
Nostradamitech” and CALCE’s “Mil-Hdbk-217 Alternate Methods”
[Blog Addendum 4-29-2014] Since this blog was originally posted, an excellent US Government article reinforcing the fallacy of using reliability prediction as a tool for building reliable systems has been presented at the 2013 RAMS. The Paper “Reliability Predictions – Continued Reliance on a Misleading Approach” was written by Christopher Jais, US Army Materiel Systems Analysis Activity, Benjamin Werner, US Army Material Systems Analysis Activity. and Diganta Das, Center for Advanced Life Cycle Engineering, University of Maryland. You may download the article Reliability Predictions – Continued Reliance on a Misleading Approach. It is a public domain document, so please distribute widely to all those that still believe there is validity and value in using reliability predictions for electronics systems development.
It is most unfortunate that the same arguments against the use of reliability predictions for electronics systems development are having to be repeated almost two decades after Micheal Pecht’s 1995 article “Why traditional reliability prediction does not work – is there an alternative”.
From Reliasoft’s Reliability Discussion Forum
leeping: It is said that someone can use a very few data to analyse in alt, and the result is perfect.The methold is called ‘small sample technical’. Is it possible?
Pantelis: Does it use a magic crystal ball?
Hi Oleg,
I’m not sure if Pantelis follows this forum, yet small sample ALT is the best we can do sometimes and does provide some evidence. Perfect – NO! – that is why we also compute and disucss the error bounds for the results as a gauge of the risk taken using so few samples.
cheers,
Fred
Great article Kirk. Humans apply components wrong, commit errors in manufacturing, and abuse products — then we blame it on the component and give it some failure rate.
Thanks Chet, and thanks for your help.