Illuminating MTBF’s Lack of Information
Here’s a simple illustration of how MTBF oversimplifies data concealing essential information.
By convention, we tend to use MTBF for repairable data. That is fine.
You may also be aware of my dislike for the use of MTBF, for many different reasons. If you find yourself suggesting your organization, customer, industry or whomever to stop using MTBF, you may want to use this simple example to illustrate the ‘value’ of MTBF.
Three Data Sets of Collected Time of Repair Data
Let’s say we have three assets on the shop floor that have been running for 1,000 hours each. Each has experienced 10 failures requiring repair. The repair time is typically less than an hour (keeping repair time short compared to run time to keep the analysis simple.)
Machine 1 experienced the failures at the following hours of operation:
The times are in hours since the equipment was installed. The first failures occurred at 112 hours after installation. The second occurred 198 hours after installation, and so on.
Machine 2 experienced the failures at the following hours of operation:
Machine 3 experienced the failures at the following hours of operation:
Given this data what would you typically do to glean a better understanding of your equipment?
Just reviewing the data, you can detect the differences between the three machines. Given the differences, you may adjust your maintenance program, or work to determine why the differences exist.
A Simple DotPlot View
One way to view the data is with a one-dimensional plot. The dot plot provides the location of each failure along the timeline. Here is machine 1’s dot plot:
Plus the plots for machines 2 and 3:
This provides a little more visibility over the table of numbers. Machine 1 seems to have evenly spaced failures. Machine 2 has more failures as the equipment ages (like my car did when I was in high school). And Machine 3 seems to be running longer between failures as it ages.
These simple plots reduce the work necessary when just viewing a table of numbers. They further illustrate the differences in the datasets.
What If You Calculate MTBF?
Considering MTBF is so popular and widely used, you may feel compelled to calculate MTBF for these three examples.
It’s easy to do, each machine has run for 1,000 hours and enjoyed 10 failures, thus all three machines have 100 hour MTBF.
- Machine 1 has 100 hour MTBF
- Machine 2 has 100 hour MTBF
- Machine 3 has 100 hour MTBF
The use of MTBF suggests there is no difference. We have reduced the information available for consideration. Using MTBF we would treat the three machines exactly the same.
The MTBF values limit the value of the data and preclude our ability to identify differences, take appropriate action, or understand what is happening.
I suggest that is not a good metric.
So, stop using it.
Please feel free to use this example, maybe change it to fit your industry or situation. Help those around you understand their data.
Please let me know of any examples you use to make the point, MTBF is not helping here. Let’s collect and post what helps us get the message across.
6 thoughts on “Illuminating MTBF’s Lack of Information”
You might also have asked: “Why did they all fail after 112 hours?” !!
Good point, I didn’t get to creative in the first time to failure in my examples…. yet, I totally missed asking that question. Good eye. cheers, Fred
Since MTBF is intended for the constant failure rate period, one could argue that at least two of those data sets would not have warranted characterization by MTBF in the first place.
Sure one could do that… I suppose. Yet, without the plotting and potentially fitting the data to a curve (line) would we really be able to tell if it really was showing a constant failure rate?
Keep in mind there really isn’t such a thing or period as a constant failure rate period. There are many types of failure mechanisms some with increasing or descreasing failure rates. The closest to a ‘flat part of curve’ we see, is when the changes are small enough to not matter much concerning the decisions the data is supporting. Assuming constant failure rate doesn’t change the actual failure rate. Also, keep in mind that it is rare that a system will remain with little change in failure rate for very long.
PS: been meaning to reply to this comment for a few days – pending the site recovering from an attack which took us offline for 3 days.
Are these three machines all the same type/model?
They could be, yet these were just set up for examples. I have seen similar behavior when when different teams install equipment, one with faulty instructions. One group fails early, the other wears out nicely. Cheers, Fred