I’ll have some Pi, you can have the MTBF


Picture of a pie from Wikipedia

Do you know what an irrational number is? It is a number that cannot be expressed as a definite number but is often a useful shortcut in performing complex mathematical calculations. Pi is an irrational number that provides a very useful shortcut in calculating the circumference, area, surface, and volume of round things. Pi happens to be my favorite irrational number because you get to celebrate it, if you follow the western calendar, every March 14th (3.14 are the first three digits in Pi) by eating a nice big piece of pie (Pi sounds like pie and pies are round).

Do you know any other irrational numbers? I do. Mean Time Between Failure (MTBF) and variants of it such as Mean Time To Failure (MTTF) are irrational numbers. But they are not irrational in a good and useful way like Pi is. Sure, MTBF once had some usefulness to it and provided a useful shortcut for some reliability, maintenance, and logistics applications, but it has become so misused that it had become irrational in the primary definition of the word irrational that MTBF is something that is not logical, not reasonable, groundless, baseless, and not justifiable.

So how did MTBF, a once useful thing, get to be so irrational?

Here are some reasons:

  1. Apparently, to make the logistics for large populations of items simpler, people took the failure rate of the item and inverted it to create MTBF. They did this mostly out of convenience when dealing with large populations such as fleets of vehicles to address the random failures that were being experienced and to make the mathematics simple. And this approach worked fairly well before better approaches came into play. But this approach also worked fairly well because other reliability and maintainability practices were also enforced, namely planned/preventive/scheduled maintenance whereby serviceable items were serviced to keep them in proper operating condition, wearable items were replaced or restored, life limited items were replaced and good operating and failure data was kept. Without enforcing the maintainability and good data side of this, MTBF becomes very misleading.
  2. Then people who didn’t understand that MTBF was the failure rate of an item inverted began to take the “mean time” in MTBF a bit too literally, ignoring the fact that most items have a limited useful life, and began thinking that MTBF was some sort of indication of the mean life of the item. You can have an electrolytic capacitor that has a failure rate of 0.0000001 failures per operating hour and invert that to get a MTBF of 10,000,000 hours. Does that mean that a single capacitor will last for 10,000,000 hours or 1,142 years? Of course not. Because the capacitor may only have a useful life of 5 to 20 years before it leaks and dries out and fails. Whenever you use MTBF or even Failure Rate, you not only need to know that number but you also need to know over what useful life the number is valid.
  3. Then people started collecting failure rate data and putting it in databases and selling reliability analysis packages that enabled people to predict the MTBF of complex systems with hundreds and thousands of components in them. That made MTBF predictions very easy to do and people were too lazy in not also indicating the relevant useful life limits of life limited components in the system. But the MTBF numbers that the computer models spit out were big numbers and that made people very happy. Naïve and unaware, but happy. Except for the poor guys who had to use the systems struggled with the systems not performing as promised and then being blamed when the systems didn’t perform.
  4. Then people stopped collecting failure rate data and now the databases underlying many of the computer models still in use today not only have misleading data but also have outdated and obsolete data.

Irrational numbers indeed. To me, a self-professed Reliability subject matter expert, MTBF just confuses me and causes confusion. So I say to stay away from it as much as you can.

So, what should you do?

The best thing to do is to not use MTBF and instead use Failure Rate. And when you use failure rate, make sure that you are using and representing it properly by stating the failure rate during the intended time period. Most of the time, people are interested in knowing the expected failure rate of something over its useful life. So, you may indicate that an item has an expected failure rate of 0.000001 failures per operating hour over its 10 year expected useful life. Some people write this as a failure rate of 1E-6 per hour over its 10 year useful life (there are other failure rate conventions used such as FIT rate that I won’t go into). If the customer knows failure rate over the expected useful life, they then know two very useful things; how long they should expect the product to last and how reliable they can expect the product to be. And if customers know these two things, they can plan for the support, spares, maintenance, and replacement of items they need to be doing to keep their products or systems up and running.

One example is that you may use a non-repairable power supply in your system that has an expected usage life of 10 years and a very low failure rate during those 10 years. But what if you need your system to run for 20 or even 30 years? You either need to find a power supply with a longer life or be prepared to replace the power supply proactively before it nears its end of life. You should also design your system so that it is easy to replace the power supply.

When repairable items are involved, the maintenance required should be indicated so that the customer knows what they need to do to preserve the performance of their product or system. One example is that you should expect your car to last for 200,000 miles, but you need to stick to the recommended maintenance schedule to ensure this. If you decide to never change the oil in your car, you should not expect it to last for 200,000 miles and certainly should not expect it to perform reliably.

How do you get failure rate?

You can get failure rate a few ways:

  1. Most component data sheets indicate Failure Rate or how to calculate it based on certain use and environmental parameters. Some data sheets even indicate MTBF, so make sure to invert it to get Failure Rate. And do not forget to look for information that shows or explains the useful life that you can expect for the component so that you have both pieces of information that you need; failure rate over what expected useful life. This gives you a decent engineering estimate for useful life and reliability until you have actual data for your product.
  2. You can conduct testing or even accelerated testing on products to determine their failure rate. However, you may need a lot of samples and incur a lot of cost to test to demonstrate a certain reliability or failure rate.
  3. The best way to get failure rate, in my opinion, is to get it from your own products in service. You need to collect data either on the entire product population or a large enough sample population to know the actual number of units in service, operating hours, and failures. You can then develop your own failure rates for your products that reflect the markets you serve and how your product is used.

Move away from the irrational numbers

As you move away from the irrational numbers of MTBF and towards knowing the real failure rates and reliability of your products in the markets you serve and how your products are used, you will be better able to drive reliability improvement when needed, understand and correctly price warranties and service agreements, and provide confidence and satisfaction to your customers. You can then reward yourself with a nice piece of pie.