No Evidence of Correlation: Field failures and Traditional Reliability Engineering

Historically Reliability Engineering of Electronics has been dominated by the belief that 1) The life or percentage of complex hardware failures that occurs over time can be estimated, predicted, or modeled and 2) Reliability of electronic systems can be calculated or estimated through statistical and probabilistic methods to improve hardware reliability.  The amazing thing about this is that during the many decades that reliability engineers have been taught this and believe that this is true, there is little if any empirical field data from the vast majority of verified failures that shows any correlation with calculated predictions of failure rates.

The probabilistic statistical predictions based on broad assumptions of the underlying physical causes begin with the first electronics reliability prediction guide  begin November 1956, with the publication of the RCA release TR-1100, “Reliability Stress Analysis for Electronic Equipment”, which presented models for computing rates of component failures. This publication was followed by the “RADC Reliability Notebook” in October 1959, and the publication of a military reliability prediction handbook format known as MIL-HDBK-217.

It still continues today with various software applications which are progenies of the MIL-HDBK-217. Underlying these “reliability prediction assessment” methods and calculations is the assumption that the main driver of unreliability is due to components that have intrinsic failure rates moderated by the absolute temperature. It has been assumed that the component failure rates follow the Arrhenius equation and that component failure rates approximately doubles for every 10 °C.

MIL-HDBK-217 was removed from the military as reference document in 1996 and has not been updated since that time; it is still being reference unofficially by military contractors and still believed to have some validity even without any supporting evidence.

Much of the slow change in the industry is due to the fact that electronics reliability engineering has a fundamental “knowledge distribution” problem in that real field failure data, and the root causes of those failures can never be shared with the larger reliability engineering community. Reliability data is some of the most confidential sensitive data a manufacturer has, and short of a court order will never be published. Without this real data and information being disseminated and shared, one can expect little change in the beliefs of the vast majority of the electronics reliability engineering community.

Even though the probabilistic prediction approach to reliability has been practiced and applied for decades any engineer who has seen the root causes of verified field failures will observe that most all failures that occur before the electronic system is technologically obsolete, are caused by 1) errors in manufacturing 2) overlooked design margins 3) or accidental overstress or abuse by the customer.  The timing of the root causes of these failures, which many times are driven by multiple events or stresses, are random and inconsistent. Therefore there is no basis for applying statistical or probabilistic predictive methods. Most users of predictions have observed the non-correlation between estimated and actual failure rates.

It is long past time that the electronics design and manufacturing organizations to abandon these invalid and misleading approaches, acknowledge that reliability cannot be estimated from assumptions and calculations, and start using “stress to limits” to find latent failure mechanisms before a product is released to market.  It is true that you cannot derive a time to failure for most systems, but then no test can provide an actual field “life” estimate for a complex electronic system nor do we need to. There is more life than needed in most electronics for most applications.

Fortunately, there is an alternative. A much more pragmatic and effective approach is to find to put most engineering and testing resources to discovery of  overlooked design margins or a weakest link  early in the design process (HALT) and then use that strength and durability to  quickly screen (HASS) for errors during  manufacturing.  HALT and HASS have little to do with a specific type of chamber or chamber capabilities. It is a fundamental change in the frame of reference for reliability development, moving instead  from time metrics to stress/limit metrics. Many have already realized this new frame of reference. Since they have found these methods much more efficient and cost effective for developing robust electronics systems, it gives them a competitive advantage. They are not about to let the world or their competitors know of how successful these methods are.

About Kirk Gray

Founder and Principal Consultant of Accelerated Reliability Solutions, L.L.C. , Kirk Gray, has over thirty two years of experience in the electronics manufacturing industry. Mr. Gray began his career in electronics at the semiconductor level and followed the manufacturing process as a through systems level testing. As a field engineer for Accelerators Inc. and Veeco Instruments from 1977 to 1982, he installed and serviced, helium mass spectrometers (leak detection), Ion Implantation Systems, and many other thin-film, high vacuum systems used in semiconductor fabrication. As a Sales Engineer for Veeco Instruments and CVC from 1982 through 1986, he worked with semiconductor process engineers to solve thin-film application and etching process issues and equipment applications. As the Environmental Stress Screening (ESS) Process Engineering Manager in manufacturing test at Storage Technology from 1989 to 1992, he worked with Dr. Gregg K. Hobbs, the inventor of the terms and techniques of Highly Accelerated Life Test (HALT) and Highly Accelerated Stress Screening (HASS). In 1994 he formed AcceleRel Engineering, Inc. a consulting company. He led a wide variety of electronic companies including the bio-medical, telecommunications, power supply, and other electronic systems producers, to methods of HALT and HASS and rapidly improving reliability of electronic and electromechanical hardware. From 2003 until 2010 Kirk was a Sr. Reliability Engineer at Dell, Inc. where he created new HALT based test processes for desktop and portable computers and a HASA process required for all Dell Power Supply providers. He is a Senior Member of the IEEE and is a charter member of the IEEE/CPMT Technical Committee on Accelerated Stress Testing and Reliability (ASTR) and the 2012 General Chair of the IEEE/CPMT Workshop on ASTR to be held in Toronto, Canada in the fall of 2012. Now he is Principal Consultant at Accelerated Reliability Solutions, L.L.C. dedicated to leading companies to rapid development of reliability in electronics and electromechanical systems. He is also a senior collaborator with the University of Maryland's CALCE consortium.

4 thoughts on “No Evidence of Correlation: Field failures and Traditional Reliability Engineering

  1. Good post! Keep on improving reliability by understanding why and how components and systems fail, what the worst scenarios are and eliminate the (root) hazards and/or protect against overload and system failures. Physics based Engineering approaches and HALT / HASS testing should be encouraged, not empirical Accounting studies (this is already done far too often and may result in “The numbers Game” and promotes re-active management…). Reliability for new, complex systems can not be predicted!

  2. Hi,
    Thanks for a thought-provoking article. I can easily agree than SR-332 predictions don’t match observed reliability from field data and I’ve seen several studies that show this to be the case. And I can also agree that wear-out isn’t important for electronics products, these days even the fans have an expected life that is longer that the product is likely to remain in service. On a related question, what is the evidence that Arrhenuis’ Law is or isn’t valid for electronics? I’d guess there is some relationship between temperature and failure rate, what is that relatinoship? Are there any studies or tests that cover this? It is an important question if you want to calcualte acceleration factors for an accelerated life test.
    Regards
    Mike

    1. Hi Mike, thanks for your comments and concurrence with my assertions.
      For certain there are physical failure mechanisms that have a chemical reaction element and therefore may have an Arrhenius law relationship.That being said, the vast majority of physical failure mechanisms in electronics at the system level have no relation to Arrhenius (ie.loose connectors, solder defects, via cracking) and it has been widely assumed and misapplied in reliability development. In many cases it has added unnecessary costs and possibly made a system less reliable. You can get a PDF copy of a paper by Michael Pecht and I on long term high temperature testing of PC’s here http://www.acceleratedreliabilitysolutions.com/images/Long-Term_Overstressing_of_Computers.pdf from my website. http://www.acceleratedreliabilitysolutions.com.
      You might also be interested in another paper too! It is written by the US Government and is public domain so please reprint and distribute widely – http://www.acceleratedreliabilitysolutions.com/images/Reliability_Predictions_Continued_Reliance_on_a_Misleading_Approach.pdf

Leave a Reply

Your email address will not be published.