Kirk Gray, Accelerated Reliability Solutions, L.L.C.
Many reliability engineers have discovered HALT will quickly find the weaknesses and reliability risks in electronic and electromechanical systems from the capability of thermal cycling and vibration to create rapid mechanical fatigue in electronic assemblies. Assemblies that have latent defects such as cold solder or cracked solder joints, loose connectors or mechanical fasteners, or component package defects can be brought to a detectable, or patent, condition by which we can observe and potentially improve the robustness of an electronics system. Thermal cycling creates expansion and contraction, stressing mismatched material thermal coefficients of expansion (TCE) interfaces. Applying vibration to an assembly, especially the pneumatic repetitive shock of HALT chambers, creates very rapid mechanical fatigue. When Gregg Hobbs, Ph.D., PE created HALT and HASS methods back in the 1980’s, digital systems were not as prevalent and bus speeds were much slower than today’s electronics. As the signal speeds continue to increase and circuit features get smaller in electronics HALT has a potentially significant new benefit for signal integrity (SI) and operational reliability during new product development.
Today’s electronics are requiring bus speeds that have to have ten times better resolution than the time it takes light to bounce off your nose and hit your eye, which takes about 85 picoseconds. As data bus speeds increase affects in data transmission that were second and third order affects are now becoming dominant in SI issues. These new variables may be difficult if not impossible to model accurately. The continue decrease in metallization dimensions and higher bus frequencies will result in increased sensitivity to fabrication variations. SI issues are likely to become more dominant in reliability of hardware as a result of the continued decrease of metallization and increase in bus speeds. Yet, the effect of these developments on operational reliability may also be more difficult to find and reproduce before thousands or millions are sent to the field.
Failures in SI in many times results in marginal operational reliability or “soft failures” where a system can be reset and operate normally. Depending on the frequency of these operational failure events, the user may or may not tolerate their occurrence. When too frequent, intermittent operational reliability may result in returning the system to the manufacturer. The returned system then may then be broken down and all subassemblies subjected to failure analysis. When divided up, the subsystems tested will likely be declared “No Fault Found” (NFF) as the marginality may only come from the stack up of parametric variations, or unique environmental conditions of original system in the end-use environment. To modify an old adage “If you cannot find what broke, you cannot fix it” and the cause of the marginality and returns will continue. The result is a churn of “good” parts being returned being sent out to replace “good” parts. The returned parts may be sent to a repair depot to be used for repair or replacement. Those returned parts may or may not work with a different system depending on the systems stack up, but it is likely the manufacturer will never come to know one of the potential real contributors to the high NFF rate. Of course there are many other causes of NFF returns not necessarily related to hardware issues. If the issues come from SI and timing marginality thermal stress to operational limits can be a very useful tool to discover these issues before mass production.
We know that in mass manufacturing of anything there will be variation in any parameter that is measured. We know that during PWB manufacture that some dimensional variations will occur during mass manufacturing, although hopefully the variations are small. Dimensional variations in PWB can affect impedance crosstalk, noise, and EMI issues in the system. Dimensional expansion and contraction of the PWBA of course is what induces the thermo-mechanical fatigue damage during thermal cycling that has been a primary focus of HALT and HASS methodology, but the dimensional variations also effects SI quality. We know from the SPC teachings of Dr. W. Ed Deming that reduction of manufacturing variation is the path to making a defect free product and “six sigma” production capability is the goal. When we design and build a complex high speed digital electronics system we cannot know necessarily how the stack up of all the real future variations in component manufacturing, circuit board fabrication, solder quality, and second sources of these possibly impact operational reliability. Yet we do know for sure that there will be parametric variations created at all the levels of assembly, and the affect operational reliability may only be discovered after a large numbers are produced and sold.
The challenge of finding marginal operation during early product development is illustrated graphically in the Figure 1. . Early samples of a new electronics product are typically expensive and scarce and all development teams want the limited samples. The graphic shown on the left side of figure 1 represents the parametric timing distribution found with a limited number of units. With a small number of units the parametric variation that could be near the upper and lower limits of would likely remained undiscovered before the product is released to from development to be manufactured in mass.
The graph on the right side illustrates the potential of the larger variation found during mass manufacturing and the higher probability that the stack up of parametric variations could fall near operational limits resulting in soft operational failures.
The benefits of the effect of thermal stress in inducing mechanical fatigue to expose mechanical and material weaknesses is well established, but there is another aspect of thermal stimulation that may be become more important in the future for assuring reliable operation of high speed digital systems. A little known fact to those who have not performed real thermal HALT on digital electronics is that it almost always ends with finding an operational limit only. It is very rare ever to find a thermal destruct level in digital systems such as IT Hardware. Hot and cold thermal stress causes impedance shifts and signal propagation shifts in conductors and semiconductors resulting in “skewing” of signals throughout the system. This is probably why thermal HALT on most digital systems results in finding an operational limit and not destruct limit. At the thermal operation limits the SI fails, and a lock up or shut down occurs, but it can easily be reset when the stress is removed.
The graphic in figure 2 represents how using small number of samples stressed to empirical thermal limits we can skew the systems signal propagation timings. Higher temperatures slow signals and cold increases the signal speeds. Through thermal stressing a small number of samples we can observe the thermal hot and cold operating limit and this can be repeated many times without causing a catastrophic damage. Marginal operational reliability may be realized later from worst case stack up of parametric variations in smaller percentage of products when thousands or millions are produced. As manufacturing volumes ramp up, a wider distribution of parametric variations may then extend near or over the stable operational limit as previously shown on the right graphic in figure 1. Of course the stimulation of timing variations using thermal stress on a system moves all the components parametric skew to either slower or faster. In the larger mass manufacturing population, the lot to lot and second source of components parametric variation is mixed with high and low speed distributions. Rapid thermal cycling stress found in HALT chambers helps discover more mixing of timing variations by differentially skewing timings across a PWBA. This is created by very fast air temperature transitions producing thermal gradients across the PWBA. Low mass components have higher thermal transition rates than larger mass or high wattage components resulting in a mix of temperatures across a PWBA. An even more detailed understanding of the risk of variations timing distributions could be created by individually heating and cooling of active components. Individual heating and cooling of components is a good way to isolate a limiting component found during a thermal HALT.
Examples of the benefits of HALT techniques on finding software issues are have been documented by Allied Telesis. Donovan Johnson and Ken Franks of Allied Telesis wrote and published a white paper several years ago on how the use of HALT has benefited their discovery of reliability issues due to software. In the paper they give examples of significantly increasing thermal operational margins and limits from only software changes. Click on the following link to access the paper: “Software Fault Isolation using HALT and HASS” . Please download and read it. Most companies have not realized Thermal HALT has so much potential for rapid discovery of operational issues, not just catastrophic hardware failures.
The benefits of HALT to find mechanical issues in electronics assemblies have been well established over the last several decades. As the speed and density of electronics continue to increase, operational reliability may be more sensitive to manufacturing variations that result in parametric variations, leading to marginal SI and operational reliability. Along with the traditional established benefits of HALT, there is a growing benefit of improving operational reliability by using thermal HALT for finding how parametric variations that will ultimately occur in mass manufacturing over time.
2 thoughts on “Why Parametric Variation Can Lead to Failures and HALT Can Help”
My head is not that big. Light travels about 1 foot in 1 nanosecond.
Mark, Thanks for catching that error. I have corrected it. It should have been 85 picoseconds.