There are occasions when we have either field or test data that includes the duration of operation and whether or not the unit failed. This can be, say, 10 large motors. For sake of argument, the test ran each motor for 1,000 hours and when a motor failed it was repaired quickly and returned to the test. There were 3 failures.

Sadly, this is all we need to calculate an estimate for the motor MTBF.

Total time divided by number of failures in this case is 10 times 1,000 hours for a total time of 10,000 hours. Divide 10,000 by the three failures to find, 3,333 hr MTBF.

What I find interesting is I could find the same MTBF value using 10,000 motors each run for one hour. Or, the same MTBF if we ran one motor for 10,000 hours. IF in each case there were three failures we would find the MTBF of 3,333 hours.

Now that works perfectly well when there is a constant failure rate. Meaning there is equal chance of failure each hour of operation. Old motors would have the same chance of failure as brand new motors.

Of course, you know why I choose motors for the example. To reinforce the idea that the chance of failure is not always a constant. Be sure to think about the failure mechanisms before using MTBF (or MTTF). If the failure rate is time dependent then this simple calculation is not useful.

I used this example during a class last week and it seemed to spark a good discussion. How have you explained MTBF to others? Any suggestions on how to best describe what the MTBF value really means, or doesn’t mean?

In my case, trying to calculate reliability and/or MTBF for a subsystem is very frustrating. The advertised reliability of many components are based on the OEM’s projection or engineering analysis because no one wants to spend the dollars and time required to truly test the component thoroughly. Trying to validate their projection or engineering analysis at the system level usually means that my only data is based on one or two failures for a series of tests that accumulate a total of 30 to 40 hours of operation. For simplicity, I am usually forced to ignore the conditions of testing (e.g., temperature, altitude, load). Components that require a high level of reliability (R) and confidence (C) need several hundred hours of operation to fully demonstrate R&C.

Hi Bill, I feel your pain. Vendors have to deal with many operating conditions and use cases. They tend to do what is requested by the majority. Unfortunately, so many seem happy with very poor information, that those that need and request better information and thwarted. I suggest we continue to ask for meaningful information, educate our peers to do likewise, and when all else fails do the testing ourselves.

Cheers,

Fred

Hi Fred,

At the moment I am doing an internship at a company concerning MTBF.

My research is forcing the same question into my mind everytime: Is it even wise to help them calculate their MTBF? The FR is not constant at all, as they mainly produce flowmeters.

MTBF for me is not an estimation of how long an asset will last at all, for me it says more about the improvement/ decrease of the reliability of an asset or system.

Your topic intrigued me as I am starting to very much agree about whether it is wise to use MTBF at all.

I would be very excited if you could tell me your experience with other, similair and preferably more representative metrics.

Greatings,

Stefan

Hi Stefan, you should be concerned as MTBF most likely is misleading or not representing the actual failure rate at any particular time of interest.

Instead use reliability, probability of success at a specific duration. 98% reliable over 1 year, for example.

Use multiple points in time, or better and you have the time to failure information, fit a Weibull distribution (or appropriate distribution) and have the entire picture of probability of failure over time in a CDF plot.

Cheers,

Fred

Hi Fred,

Thank you alot! I will have a look into that, this is very usefull and fun information for me to work with! Thanks again.

Greetings,

Stefan

Hi Fred, I appreciate your site a lot. How important mission it is one can understand searching the internet for exmaples of MTTF calculation and FIT.

I’ve started reading about hazard rate, failure rate and MTTF’s etc. but can’t find any advise how to interpret test data toward obtaining failure rate. Let me put here example:

I’m testing 10 devices (nonrepairable system) over e.g. 400 hrs.

Recorded failure times in hrs: {30, 45, 60, 90, 120, 180, 240, 300}, 2 devices survived.

Can I say my failure rate is 8/(30+45+60+90+120+180+240+300+2*400) and MTTF as reciprocal of Failure rate? maybe even I shouldn’t even try to calculate failure rate from this data? Does this method imply any problems with reliability calculation? I agree that mean value for particular distribution yields different reliability but please advise how to process this data in correct way.

I appreciate your feedback in advance.

Hi Rafal, thanks for the note and example problem. While you can estimate the failure rate and MTTF as described it is not all that useful in most cases.

Instead use a Weibull analysis (with so few data points Weibull is often a great starting point as it is versatile ) This will provide the probability of failure at various points in time.

I’m traveling at the moment and have limited access, so will follow up later when I can either work out the problem for you, or point to a better reference and example.

Cheers,

Fred

HI Fred.

Thank you for your interest. Meanwhile I was sitting and struggling to understand meaning of Failure Rate and its interpretation. I think it can be good supplement to the previous question if I ask if when for example failure rate equal 0,004 fail/hr or in other words 4 fails per 1000hrs means 4 of them will fail every 1000 hrs assuming exponential distribution and constant hazard rate. It also means that if I had 4 components then I could expect none of them functioning after 1000 hrs but it also means that if I had 1000 components 4 of them will fail within 1000 hrs and 996 will remain healthy until next 1000 hrs left? I read somewhere failure rate example: FR=0.1 means 10% of population will fail every time stamp and in fact it plots exponential curve but in this case having specific amount of devices e.g. 1000 components, it lineary drops to 0 after 250 000 of hrs gone (1000 * 250) where 250 is mean time to fail. Maybe it shouldn’t be understood so straight forward? Maybe if it is an average value and follows assumed exponential pdf then we can say in average 4/1000hrs fails but in our case 37% will fail in first 250hrs and remaining 63% within the next 250000-250 = 249750 hrs. If this is correct I’m home if not I’m lost…