Have you ever wondered by we use the assumption of a constant failure rate? Or considered why we assume our system is ‘in the flat part of the curve [bathtub curve]’?
Where did this silliness first arise?
In part, I lay blame on Mil Hdbk 217 and parts count prediction practices. Yet, there is a theoretical support for the notion that for large, complex systems the overall system time to failure will approach an exponential distribution.
Thanks go to Wally Tubell Jr., a professor of systems engineering and test. He recently sent me his analysis of Drenick’s theorem and it’s connection to the notion of a flat section of a bathtub curve.
Wally did a little research and found the theorem lacking for practical use. I agree and will explain below.
What is Drenick’s Theorem
Dr. Kececioglu in Reliability Engineering Handbook vol. 2 chapter 13 describes Drenick’s theorem as a ‘limit law of the time-to-failure distribution of a complex system’. He devoted chapter 13 of the handbook to the theorem.
Here is Dr. Kececioglu’s definition of the theorem:
Consider a complex system with n units connected in series reliability wise and each unit has its own pattern of malfunction and replacement. Further assume that (1) the components are independent, (2) every unit failure causes system failure, and that (3) a failed unit is replaced immediately with a new one of the same kind. Then, under some reasonable general conditions, the distribution of the time between failures of the whole system tends to the exponential as the complexity and the time of operation increase.
R. F. Drenick in his 1960 paper, “The Failure Law of Complex Equipment”, opens the paper with a brief explanation of the ‘law’.
In theoretical studies of equipment reliability, one is often concerned with systems consisting of many components, each subject to an individual pattern of malfunction and replacement, and all parts together making up the failure pattern of the equipment as a whole. The present note is concerned with that overall pattern and more particularly with the fact that it grows the more simple, statistically speaking, the more complex the equipment. Under some reasonably general conditions, the distribution of the time between failures tends to the exponential as the complexity and the time of operation increases; and somewhat less generally, so does the time up to the first failure of the equipment.
Cautions from Drenick Concerning the Theorem
In section 6 of Drenick’s paper, he provides a few comments. These comments highlight the limitations for the use of the work, plus outline potential extensions.
On the assumption that components within the system are statistically independent, Drenick mentions “this assumption is debatable.” I agree as it is rare that the failure rate of an individual component is not influenced by the behavior of other components or the immediate environment/use conditions.
On the assumption that it “makes good sense to lump the failures of many, presumably dissimilar, devices into one collection pattern.” The theorem focuses on the overall system failure pattern, which is made up of many different failure mechanisms within the many components. Drenick states, “The fact is that this may sometimes be quite inappropriate.” He goes on to explain that the failure of some components may be rather inconsequential while other component failures may be catastrophic. Not all failures have the same impact on the system.
Drenick cautions that this ‘simple and comprehensive statement of a complex state of affairs…. can also be interpreted.” He cautions that assuming one only needs the means (MTBF) of the components. I agree with Drenick that we need more information to make meaningful decisions concerning the design, sourcing, and maintenance of a system.
Other Concerns About the Theorem
From the email enhance with Wally, here are a few more ‘issues’ with the practical use of the theorem.
- Assuming all components are in series and any component failure causes a system failure. This is rarely true even in a simple circuit. Decoupling capacitors may fail or even just fall off and the circuit may operate just fine. Other designs deliberately include components in parallel such that a failure of one component may only degrade performance not cause a system failure. Also, for large, complex systems, there is likely multiple levels of redundancy, safeguards, and supporting subsystems. A combat vehicle often carries multiple communication systems such that the loss of one radio may only degrade the vehicle’s ability to accomplish its mission.
Assuming failed components are replaced immediately. This is in part due to the theorem’s reliance on renewal process theory. In practice, it may take time to identify failed components and execute a replacement. Some are quick if the diagnostics and there is a readily available and appropriate set of spares.
Assuming replacement parts are identical (albeit new) components. We use refurbished, upgrades, or otherwise similar parts, not just an identical replacement.
Assuming just the failed component is replaced leaving all other components in place. We replace subsystems which may have dozens to thousands of components. Thus, we replace aged, yet not yet failed components robbing the system of other potential imminent failures.
Assumes there are no components that failure significantly more or less often than other components. There is an allowance for different failure patterns, yet the theorem doesn’t work if there is one or two components that contribute the majority of systems failures. Recall Pareto and the notion that a few components will cause the most failures.
Assumes the failures occur independently of use or environmental conditions. Sure some components fail more often due to storage stress, while others fail due to use. Yet the assumption of statistical independence includes the notion that the chance of failure does not change for a set of components when in storage or in use.
Assumes the system level failure pattern is useful. To me, Fred, this is purely an academic exercise and not useful for any practical decision making or modeling. If I need to estimate the cost of operation, cost of spares, availability, etc. I need more than a system MTBF value to make a meaningful decision.
A Most Grave Misuse of Drenick’s Theorem
Assuming it is true for your system and further assuming you’re system is in the ‘flat part of the curve’. This line of reasoning leads to the baseless assumption that every component follows the exponential distributions as well.
Most of you, kind readers of the NoMTBF blog know this to be untrue, yet not everyone is as enlightened.
All too often we hear, along with a wave of a hand, that we’re in the flat part of the curve, or that this is a large complex system so we can assume exponential… cringe.
Drenick’s Theorem does not justify the use of MTBF. Drenick mentions that we need to more than just the mean life to do our work as reliability engineers.
How have you seen this theorem misused? Add your thoughts to the comments section below.
Drenick, R F. 1960. The failure law of complex equipment. Journal of the Society for Industrial and Applied Mathematics 8 (4): 680-690
Kececioglu, Dimitri. 1991. Reliability Engineering Handbook Vol. 2 Vol. 2. Englewood Cliffs, NJ: Prentice-Hall
Email exchange with Wallace Tubell, October 30, 2017