Considerations When You Calculate MTBF
It is deceptively easy to calculate MTBF given a count of failure and an estimate of operating hours. Just tally up the total hours the various systems operate and divide by the number of failures. Easy.
This simple calculation is the unbiased estimator for the inverse of the parameter lambda for the exponential distribution, or directly to estimate theta (MTBF). We use theta to represent the 1 / lambda.
What could go wrong with such a simple calculation?
What is a failure?
Let’s start with what we count or do not count as a failure. This directly changes the resulting MTBF value. If we only count confirmed hardware failures, and do not count intermittent or unreproducible or software failures, are we under counting what the customer experiences as a failure?
Over what duration do we count the failures? Should we focus only on the first month of operation, the first year, the warranty or service contract period or the entire operating life of the system? How do you calculate MTBF?
Some organizations only count failures they expect to occur. The unexpected ones are ‘special’ causes and require further study before counting as failure officially.
Another organization only counted failures that completely shut down the system. A partial loss of functionality, a degradation of capability or the failure of a redundant element all did not count a system failure.
In my opinion if the customer calls it a failure, it’s a failure. If a failure, by any definition, costs your organization time and money to address, acknowledge, resolve or repair, it’s a failure.
What is operating time?
This one is tricky. If the system does include the appropriate sensors and tracking mechanisms (hour meter) and a way to gather that operating time of units both failed and still operating, then we have a pretty good way to track total operating hours. Some situations and systems make this easy.
Most do not.
Let’s say we ship 100 systems a month for 10 months. At the end of ten months the first shipments have accumulated 10 months of operating time. IF….
… They are all placed into service immediately
… They are all operated full time for the full 10 months
… They are have each failure reported including down time
In general, we do have to make a few assumptions to determine the operating time for shipped systems. We tend to be conservative and err on the side that would make the MTBF value a little smaller than if we had the full set of carefully tracked data. Or do we?
- Some organization count from date/time of shipment ignoring shipping and installation time.
- Some organization assume all systems are installed and operated 24/7.
- Some organization assume no news is good news and the systems with no information are still operating.
And a few organization assume systems run indefinitely, even systems 20 years old, unless notified that it is decommissioned, assume it is still running full tilt. i.e. No retirement or replacement policy.
How about when you calculate MTBF?
By convention when there are no failures we assume in the next instant there will be one failure. This avoid dividing by zero which causes fits for calculators and spreadsheets and mathematicians.
Another issue is how often are the calculations made? Do we gather data hourly, daily, weekly, monthly, annually? Some use a rolling set of data, for example only units shipped in the last year count for both operating time and failures. This result will ignore or discount the longer term wear out failures as the bulk of the units are young.
Some organization do the calculations weekly in order to detect trends. If there are trends you probably should not be using MTBF…. If it’s changing, if there are early life or wear out failure mechanisms, you should not be using MTBF.
Even though you can calculate MTBF easily, the complexities of getting it right still do not provide a useful metric. Instead focus on getting better data including time to failure information so you can explore and report the data with other tools and methods. Treat the data appropriately and make better decisions
Sure, better data will improve the ability to calculate the MTBF value, if you’d like to be like some organizations, that is fine.
How have you seen MTBF calculated poorly? Share your thoughts and stories in the comments below.
Related:
Perils of using MTBF
We calculate our MTBF / MTTF using only the confirmed failures, since those most often represent the ones most likely to be design controllable. That said, we and our customer also monitor the MTBUR (Mean time between unscheduled removals) which reflects the pain the end user feels, including installation and handling damage, the NFF count, and other causes that may need attention but aren’t strictly the functional failures. Thanks to more and better data collection gong on in the commercial aero industry, better fleet utilization data is available, so we are using Weibull analysis to calculate MTTF, since the beta is a better indicator of where to go looking for cause or causes, and the shape of the plotted data can also hold some clues.
Hi Kevin,
thanks for the note. So, if a failure is not confirmed it is assumed out of the design’s ability to control? Doesn’t that leave a large area for intermittent failures to lurk, which often are directly able to be designed out if desired?
And, if using Weibull analysis, why bother to calculate MTTF, why strip your data of information going from a Weibull CDF for example, or % surviving so many airmails, to MTTF… seems a waste of good data.
Cheers,
Fred
We’re on the same path, I just didn’t put the full story into first message. I can’t count failures if I don’t know what failed. In many cases, NFF’s get additional scrutiny like a run thru production ESS, a hot soak or a vibe test to try and get the failure to recur. We also recognize that sometimes troubleshooting at next level is a shotgun approach and good parts are removed so NFF is not always an intermittent or lurking condition, it’s an expected outcome when the part gets back to us.
As far as why I calculate a MTTR from Weibull, it’s because it’s what was expected or requested, not because I like it any more than you. That’s why I also report a time to 1% failures along with it, because that’s really what they wanted to know, they just didn’t know to ask for it.
Cheers,
Kevin
Thanks Kevin and you know I like the idea of reporting the way you describe.. MTBF or MTTF along with time to first percentile or similar useful information. Cool! cheers, Fred
how to find the MTBF value for Hour meter( Part.no: 20018). otherwise give me Equivalent formula for find MTBF for Hour meter
First off, do not use MTBF to describe the reliability of the part. You can ask the vendor, yet better to ask for reliability information instead. You can calculate MTBF by dividing the total time by the number of failures… which is typically not very useful.
Cheers,
Fred
I am curious on how to calculate MTBF for a day when there are no stops. We currently use a system called plant focus which automatically calculates our MTBF and Reliability coefficient. We track this daily, rolling 30 day, and rolling year. If we have a do with no failures then the MTBF is calculated as 0 by the software. However this 0 will bring down our overall 30 day and yearly. This does not make sense as we are looking to increase our MTBF and not having a failure should not decrease this measurement.
Hi Zachary,
One of many reasons to never use MTBF.
By convention, you should divide by 1 when you have no failures over some time period.
Actually, if you want to avoid the issue entirely, and make the use of MTBF rather more useless, just use total time from some point in time in the distant pass over which you have a count of failures…. the same effect of a rolling average and if you have enjoyed at least one failure then you can always avoid the divide by zero issue.
Use reliability or operational available directly and skip using MTBF. For repairable systems you may also want to consider using Mean Cumulative Function instead.
Cheers,
Fred
Another way to handle it would be to invert your metric from MTBF to failure rate, and track that failures divided by hours. Smaller is better, and your zero is the ideal condition. It can stay zero for any length of time and the calculation and charting still will be correct. It’s a mindset change from bigger is better to wanting small, but it could work.
Hi. I wanna ask, the equation of MTBF is running hours/number of failures. If some equipment did not failed, so how about the MTBF value? Is it infinity?
given the problem with dividing by zero – the convention is to divide by one instead – assuming the first failure will occur in the next instance. cheers, Fred