A note from Scott – providing feedback on the NoMTBF site.
Hi Fred,
Your website has generated quite a bit of valid conversation about MTBF. I applaud you for that. Honestly though I have mixed feelings about some of what you present and thought I’d write this lengthy e-mail to provide some feedback. I hope you take this in the right light as constructive criticism from someone who, overall, appreciate your efforts.
Clarifications
Let me start with a point I disagree with. In your opening slide show “Thinking about MTBF” I think the “Common Confusion” slide could be better presented. Many viewers would interpret that slide to say that the MTBF is not the mean. Of course MTBF is the mean. Your point is that, while it is the mean, the distribution is not Gaussian. Fair enough. Funny thing is I’ve actually had quality engineers try and tell me the MTBF is not the mean of the distribution and I’m afraid your slide may perpetuate that misunderstanding.
In the same vein, later in the talk, and in the other sections on your site, you seem to indicate that the MTBF is not the expected value (See Perils “I heard one design team manager explain MTBF as the time to expect from one failure to the next.”). Of course the MTBF is the expected value. That is from a pure mathematical sense (as you discuss earlier in this section). So I’m confused on your point here. I guess you are commenting on the laymen’s feeling for “expected” value. Which leads me to my next section.
Lack Of Understanding of Statistics
It almost appears that one of the premises of NoMTBF is that many people do not understand statistics and therefore we should not confuse them by using MTBF. I disagree with this. For example, many people don’t understand the difference between median and mean but no one is suggesting we remove those terms. Similarly because many people incorrectly assume a Gaussian distribution when they hear the term mean is hardly justification for removing the term MTBF. The problem is education not the definition. Same point for expectation. Because the average is some value does not imply all samples will be equal to that value. Anyone who thinks that, in my opinion needs more education in statistic and we shouldn’t try and “simplify” to account for lack of education.
Constant Failure Rate
I don’t really accept your implication that using MTBF implies constant failure rate. The proper definition is the integral form you present in a number of spots but I agree that many tie these two together. I think one of the themes of your website is that the constant failure rate assumption is not valid. In that, I’m in 100% agreement and applaud your efforts. (I guess the site name would not have the same panache if it was called NoConstantFailureRate). Clearly the constant failure rate model often does not apply and reducing all of reliability to one number is a gross simplification.
Leadership
So where should people go instead? Just bashing something is not a solution. Your website really has had an impact but in a strange way sometimes it has had the opposite impact than what I think we would both like. I’ve had quality managers who did not want to gather the data on field failure with, in part, the justification that MTBF is bogus statistics. OK MTBF is not perfect but I’m sure we agree that the way to improve reliability is to gather data as a first step.
You have quite a following and, personally, I’d like to see you to lead more. Yes MTBF is a simplification but I also don’t expect to pick up a data sheet and see physics of failure paper stabled to the back of it or a chart of reliability over time. Fact is many complex things get reduced to a few key numbers (e.g. horsepower, MPG, 0 to 60 time for a car). I think your Actions/Alternative Metric is addressing this. Stating a reliability percentage over a time interval is an intriguing alternative. I like it. If that is your alternative then, personally I’d like to see it more clearly emphasized across the site. I’d also like to see you develop it more. How does one determine reliability % and duration from the Weibull parameters? How would one put together a reliability block diagram and estimate overall reliability if subcomponents were specified in this manner? I don’t know that answer to these questions and I’d be interested in reading more.
As I stated in the beginning, I hope you take this in the right light. While obviously I don’t agree with everything on your site you have many extremely valid points and you are doing a great job stimulating discussion. Thanks for your efforts.
Scott Diamond
Vice President of Quality and Customer Excellence
Surveillance Group
FLIR Systems Inc.
— Ed note:
Thanks Scott for the insightful and meaningful feedback – I will be making some adjustments and improvements. Thanks for the careful reading and taking time to provide you suggestions and comments. Very much appreciated. Fred
Scott,
I completely agree that most people don’t understand statistics. Most people don’t understand what MTBF is.
Strictly speaking, MTBF **only** applies when a constant hazard rate applies, and does not apply to other distributions. Many people believe that you can compute MTBF by computing the number of unit-hours in some interval and dividing by the number of failures during that interval. This procedure is always only approximate.
In the constant hazard rate model, the MTBF is the time at which 36.8% survives [i.e., exp(-t / MTBF), mathematically]. Similarly, the Weibull and Extreme Value distributions have a characteristic life at the same reliability. Also similarly, you don’t find the characteristic life by dividing unit-hours by the number of failures. The Normal (Gaussian) distribution has a measure of central tendency that could be used as a typical or average life, and that is the 50th percentile of survival.
There is no mystery as to how compute reliability. Much of the necessary guidance and math is available at the NIST web site: http://www.itl.nist.gov/div898/handbook/ or in Wayne Nelson’s very fine book “Life Data Analysis.”
I share Fred’s frustration. MTBF, though often leading to ineffective reliability engineering decisions, has become a kind of standard because the math is easy and several common approximations can be applied. As Charlie Munger often says, “If your only tool is a hammer, then every problem looks like a nail.”
The reality is that we need (i.e., it’s not just a good idea, we actually require) multiple models to cope with real world situations. We need effective ways of thinking about early life failures and their causes. Likewise, we need effective models for useful life failure modes as well as end of life failure modes. We need various ways of modeling how failure modes develop over time. Sometimes this will be empirical; other times we may have physical models (e.g., vibration, fatigue, or corrosion) to help guide us. Test and field return data is almost always useful, especially when there’s some analysis of the failures that occur. We need cost of reliability models. We sometimes even need qualitative models. But if an organization (or an individual engineer) only has a constant failure rate model, then it is absolutely certain that reality will be distorted so as to fit the model, and it is a good bet that ineffective decisions will be made.
All of this requires a couple things from managers. First, you have to be willing to pay for quality and reliability processes that work. And you have to be willing to pay for this even when you are successful at avoiding failures. It’s easy to justify the cost of solving a problem: you can point to the cost of the problem, and the solution becomes a savings. When processes are effective enough that problems are prevented, quality and reliability engineers look like a cost and the benefit in avoided cost doesn’t show up in reduction in warranty cost or similar accounting measures.
Managers also have to be willing to accept the notion that dashboards, while useful, are summaries, and don’t tell the whole story. Euclid told Ptolemy that there is no royal road to geometry. Likewise, there’s no shortcut to understanding how to make a design robust against a failure mode or how to structure a warranty or how to devise a service level agreement. The other issue with dashboards is that they only tell you what, and not why. Your speedometer tells you how fast you are going, but not why you are going that fast. And the gear you are in tells you more about how likely you are to have an accident (being in reverse is when most accidents occur) than your speed (which tells you that if you have an accident while going fast, then the people involved are more likely to be injured or killed). Metrics have to be correlated to your business, and this falls out of the approach to modeling discussed above.
This is why I support Fred’s views. Using only one model is a setup for an unhappy ending. Putting more than one tool in the box requires both effort and expense.
Good analysis and reaction.
The words about NoConstantFailureRate are fine!
Thanks for the comments – ever since the first discussion with peers on issues around MTBF, it’s been clear that we, as a community of reliability professionals, needed to do something about the use of MTBF.
Education, awareness, involvement, updated standards, alternatives that work, connection to value, etc all matter and each of use can help make a difference. Defaulting to the use of MTBF is a cop-out – you know better, so stand up and use the appropriate metrics, explain why using MTBF is not useful, and help those around you make better decisions.
Cheers,
Fred
PS: I do appreciate the feedback and insights – it will help to make this a much better site.