Please don’t remove MTBF, part 2
This note is the second part of my response to a forum entry by HL concerning two arguments he is attempting to refute. Of course, my arguments for the eradication of MTBF may stir up some resistance. My plea to use a better approach may challenge the status quo or ruffle a few feathers. So be it. That is expected.
Change has to happen and airing contrary opinions and arguments is part of the process. If you see a flaw in my rationale (some may say I’m not rational – which is different) then please let me and others know. Let’s sort out the right best path forward, to increase our professional ability to talk about and measure reliability.
“MTBF is often misused”
The basic argument that MTBF is often misused led me to start this campaign to eradicate the use of MTBF. It was one too many vendors claiming MTBF and meaning a failure free period of time. It was one too many colleagues exasperated at trying to explain MTBF to a design engineer. It was one too many million dollar mistakes make due the use of MTBF. I had enough. Something had to be done.
Imagine you’re about to start a technical talk to peers at a reliability conference and a good friend and colleague asks, “What works to quickly explain what MTBF is and is not?” Image that you know a few in the audience that have shared similar frustrations with you. Imagine that you then ask, out of curiosity, how many of the 150 in attendance have explained MTBF to someone due to some misunderstanding of what it is or is not. What would you do when every single person in the room raises their hand?
In that situation I ditched my prepared presentation and instead led a discussion on what could go wrong when someone uses MTBF unwittingly and without knowledge of the underlying assumptions. That hour went by very quickly and was probably the best presentation I’ve made at a conference. It is from that experience that I began my campaign to eradicate the use of MTBF from our profession.
A short time later I took the notes gathered during that discussion and wrote the Perils essay outlining the many ways MTBF is often mistaken for being something it is not.
The HL argument
HL begins the refutation of my argument by agreeing that MTBF is often misused. While a great opening statement and one debaters employ to set up a contrary point, HL proceeds to claim that MTBF is what it is, and causes misunderstandings due to the nature of the metric. It is the people that lack knowledge and the proper use of MTBF.
I do blame the metric, and in the Perils article I break down the words and their common meanings along with how that name itself leads to much of the MTBF confusion. I’ll not repeat it here other than to ask you if the mean for a distribution has to be at the 50th percentile? (It doesn’t and isn’t for the exponential distribution)
The mean is just the mean and when commonly understood it is useful. Yet, most professionals in and about our world have enjoyed a class or two in probability & statistics. Many hope to not repeat that experience. Unfortunately what most recall from those statistics classes is rooted in the study of the normal distribution. In my argument, this widespread education provides the widespread understanding that a mean or average is the 50th percentile, which for life distributions used to model life data, is simply not true.
Any metric we use should be obvious, understood, meaningful and accurate enough for the task or decision. MTBF fails on all counts, over and over.
HL also postulates that we misuse the mean value for other summaries of data, like the 400 richest people on the Forbes list, then use it for a mortgage calculation, thus we should protest the use of mean values by Forbes. He also suggests that the average is just a use of statistics and since many do not understand statistics, we should thus campaign to eliminate statistics (I just heard a few rallying cheers for that idea). He also suggests that much of our science is based on inaccurate measures, therefore science is flawed and should be stopped.
What a Luddite.
In the measure of the central tendency of a dataset we are taught three measures. Mean, median and mode. Mode is not commonly used, as it is often not meaningful. It is still there, though, and we can calculate the mode for a dataset, although it rarely will be useful when faced with a decision based on the data.
Statistics is a way to discuss, measure, and understand the variation that occurs. It is a method to permit us to sample a population and make informed decisions. Yet, with just a little understanding of basic statistics and the occasional assistance of a professional statistician (or reliability engineer, since we should have the working professional knowledge of reliability statistics) we enable better use of data and decisions.
In science we continue to learn and ask questions. We use measures for describing various dimensions. Reliability or the probability of survival over a period of time is one of these measures. MTBF is not the only way to describe this measure. MTBF is one of the more information-poor, insufficient, inaccurate, and misunderstood measures for reliability that exists. In the science of reliability engineering related to how to describe reliability we have many other measures that are easy to use, easy to calculate, and that are also more accurate and informative than MTBF. The Weibull distribution has only been around for about 60 years and trivial to calculate for 30 years. Let’s all stop using MTBF because it is easy, simple, or common, and avoid the mistakes, errors, and loss due to not communicating clearly.
Where you in that conference room when I asked about your experience with MTBF? What are your issues with MTBF and how it is used? What have you experienced as the worst misunderstanding of MTBF?
4 thoughts on “Please don’t remove MTBF, part 2”
Good point about the mean being only one measure of central tendency. A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. The mean is particularly susceptible to the influence of outliers.
With data from a symmetrical distribution such as the Gaussian, the mean is an appropriate statistic (and is the same as the median and mode). With skewed distributions, we run into a similar problem as the skewed data “drags” the mean away from the most typical value. Those professional statisticians you refer to would likely recommend using the median as a measure of central tendency for skewed distributions.
It would be rare to find Gaussian distributed survival data. Thus, it would be rare that the mean is a good statistic as a measure of central tendency for survival data. It may not be a good statistic for Forbes’ 400 richest either, but what risk-informed decision is anyone trying to make that the Forbes’ 400 richest is important?
The above states, “It would be rare to find Gaussian distributed survival data.”
Unless the component is a bearing, or a thousand other moving parts all around us. Or it is an incandescent light bulb, or a rechargeable battery, or a connector.
Here I tend to disagree. Bearings was, if not mistaken, the source data for Weibull in his paper on famous distribution. A beta of about 2 and not the beta of about 3.4 where the Weibull distribution mimics the Gaussian.
Bill Meeker reported that students did a study of incandescent light bulbs and found they did have a normal life distribution. Yet, in my experience very few moving parts shows a normal distribution, especially when governed by a dominate wear mechanism.
My experience is similar to Fred’s in that Guassian distributed survival data is rare. If you have some data sets you could share, especially if you could share details about the source of the data, I’d be interested in “playing around” with them. Always looking to add to my repertoire of knowledge about how hardware behaves in different applications, etc.