What will Advance Reliability Engineering?

Kirk Gray, Accelerated Reliability Solutions, L.L.C.

In all aspects of engineering we only make improvements and innovation in technology by building on previous knowledge. Yet in the field of reliability engineering (and in particular electronics assemblies and systems), sharing the knowledge about field failures of electronics hardware and the true root causes is extremely limited. Without the ability to share data and teach what we know about the real causes of “un-reliability” in the field, it is more easily understood why the belief in the ability able to model and predict the future of electronics life and MTBF continue to dominate the field of electronics reliability engineering. Please note that I refer to solid state electronics and assemblies. I  make a distinction between Failure Prediction Methodologies (FPM) for electronic assemblies (PWBA and systems) that typically has more life than needed (or really known) for most applications, as opposed to mechanical systems (i.e. motors, gears, switches) that can have a more limited life assignable to friction and wear out and have some ability to be modeled.


I have been teaching HALT and HASS methods for over 20 years and a common question I have heard many  times is “If HALT is so great, why aren’t there more examples published on its benefits?”  There are several reasons why details of real HALT case histories, as well as any other actual empirical electronics reliability data, are rarely published.

Three key reasons are:

  • Competitive advantages of not sharing the most effective reliability practices
  • Potential legal liability for disclosing real causes of field failures
  • Engineers have little time  to write and publish.

When a company discovers a new product development process that leads to significantly faster times and lower costs to release a mature product to market, they are not likely to tell their competitors. Doing so would cause them to lose the competitive advantages those new processes. Does your company publish its best methods for product reliability development?  So why expect it from any other company?

Legal liability is a huge risk for manufacturers. Failures of electronics systems might lead to significant loss of property or in the worst case human lives. Publishing the cause of electronics failures might provide evidence of liability leading to costly judgments for the product designer or manufacturer. For this reason, most companies will never voluntarily allow failure data to become public. Reliability engineers that may want to help the industry by publishing the real failure data typically face many challenges to have the legal departments give permission to publish. Even if they are able to publish something on actual reliability the paper has so much redacted and “sanitized” for public disclosure that the most significant data may not be published.

It takes time to write and publish any technical paper. In today’s current economic challenges many electronics companies have trimmed engineering departments down to the bone. Engineers are challenged for enough time in the day to complete projects and timelines. Few are motivated to take the extra time necessary, and face the companies’ legal obstacles of publishing the case histories of real field failures of electronics. Without engineers being willing or able to publish the real case histories, details on the root causes of the failures, and best methods to prevent them, little can be expected in the advancement of the science of electronics reliability development and testing.

As previously stated many times in my previous blogs, if the reliability engineers really look at the root causes of field failures in their own products, they would see the same confirming evidence that in general the reliability of an electronics system cannot be predicted from statistics and probabilities. In the 22 years I have spent working with companies to improve reliability of electronics hardware I have seen many root causes of field unreliability in electronics systems.I do not recall ever seeing a wear out mode of a nominal electronic component as a cause field failures. It has always been an overlooked design margin (which may appear to be an early wear out), misapplication of a component, an error in manufacturing, or a misuse or abuse by the user. If I could publish the real evidence and data and root causes of all the failures of electronics in a wide variety of equipment and companies I have worked with, or heard about through colleagues in the field, the case for using empirical stress limit methods of finding weaknesses would be much clearer.

The conditions that limit the sharing of real field reliability data are not likely to change in the foreseeable future. This is why many companies are still doing FPM fundamentally derived from the invalid  MIL HNBK 217 methods and still use the meaningless term MTBF. While statistical or probabilistic methods are used for many valid engineering design and analysis applications, they have few applications to predicting the random errors and combinations of events that cause failure solid state electronics. Most electronics systems become technological obsolete before any inherent wear out modes cause failures.

We still can make progress in the field of electronics reliability, but we must validate the methodology and results from engineering basis. We must use our knowledge of physics and material science in electronics, as well as what lessons have been learned in the causes of real field failures. We must make sure when electronics reliability is developed now and in the future that there is traceability and references to the physics of the failure mechanisms.  For instance, traditional electronics FPM has used the Arrhenius equation and the broad assumption of 0.7 eV for the activation energy in silicon components as a major factor in predicting the rate of failure of component. This belief continues today, even though there is little or no evidence of traceability to physical mechanisms in today’s components and over 16 years since MIL HNBK 217 was removed as a DoD reference document. .

When we find a weakness in an electronics system through stepped-stress methods, we should know enough about the materials to know whether the weakness is due to a fundamental limit of technology (FLT) such as the melting of plastics, solder, or limits of LCD operation at temperature, or if the weakness is due to the in-circuit application of a particular component. After uncovering the causes, we can understand what physics drove the failure and the element to change to increase the systems strength or capability. Usually it is only necessary to strengthen one or two “weak links”, to bring a products strength up to the FLT. Sometimes those weak links are software, not hardware, and changing code may be the only change necessary to add significant thermal strength capability and margins. Occasionally the system is  designed and built and reaches the stress FLT with no change needed, and this becomes a benchmark for subsequent designs. Although, if you are not testing to empirical stress limits, you will never find out.

The causes for the inherent limits in sharing field and test lab reliability data are not likely to change anytime soon. Yet, we can change our orientation and approach to electronics reliability development. Realizing the random and unpredictable nature of most electronics failures in the first 5-7 years of use would result in a major shift in the activities for many companies developing electronics systems. It is  a change that has been slowly being adopted by more companies, but they are not going to spread the word to competitors.

The development of reliability based on empirical stress limits still has a long way to go before it becomes the dominant electronics reliability engineering development paradigm and activity. Using the limited data available in your own products failures and physics and material science, must be the basis for validating the use of empirical stress limit methodologies. The need  for faster reliability development demands it. I have seen the evidence, but cannot share most of it, that HALT discovery methods are valid regardless of the rapidly changing materials and manufacturing processes in electronics components  and systems in the past and know it will be in the future.

“A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die and a new generation grows up that is familiar with it”-Max Planck, Scientific Autobiography

About Kirk Gray

Founder and Principal Consultant of Accelerated Reliability Solutions, L.L.C. , Kirk Gray, has over thirty two years of experience in the electronics manufacturing industry. Mr. Gray began his career in electronics at the semiconductor level and followed the manufacturing process as a through systems level testing. As a field engineer for Accelerators Inc. and Veeco Instruments from 1977 to 1982, he installed and serviced, helium mass spectrometers (leak detection), Ion Implantation Systems, and many other thin-film, high vacuum systems used in semiconductor fabrication. As a Sales Engineer for Veeco Instruments and CVC from 1982 through 1986, he worked with semiconductor process engineers to solve thin-film application and etching process issues and equipment applications. As the Environmental Stress Screening (ESS) Process Engineering Manager in manufacturing test at Storage Technology from 1989 to 1992, he worked with Dr. Gregg K. Hobbs, the inventor of the terms and techniques of Highly Accelerated Life Test (HALT) and Highly Accelerated Stress Screening (HASS). In 1994 he formed AcceleRel Engineering, Inc. a consulting company. He led a wide variety of electronic companies including the bio-medical, telecommunications, power supply, and other electronic systems producers, to methods of HALT and HASS and rapidly improving reliability of electronic and electromechanical hardware. From 2003 until 2010 Kirk was a Sr. Reliability Engineer at Dell, Inc. where he created new HALT based test processes for desktop and portable computers and a HASA process required for all Dell Power Supply providers. He is a Senior Member of the IEEE and is a charter member of the IEEE/CPMT Technical Committee on Accelerated Stress Testing and Reliability (ASTR) and the 2012 General Chair of the IEEE/CPMT Workshop on ASTR to be held in Toronto, Canada in the fall of 2012. Now he is Principal Consultant at Accelerated Reliability Solutions, L.L.C. dedicated to leading companies to rapid development of reliability in electronics and electromechanical systems. He is also a senior collaborator with the University of Maryland's CALCE consortium.

Leave a Reply

Your email address will not be published.