“What’s the MTBF of a Human?” That’s a bit of a strange question?

Guest post by Adam Bahret

I ask this question in my Reliability 101 course. Why ask such a weird question? I’ll tell you why. Because MTBF is the worst, most confusing, crappy metric used in the reliability discipline. Ok maybe that is a smidge harsh, it does have good intentions. But the amount of damage that has been done by the misunderstanding it has caused is horrendous.

MTBF stands for “Mean Time Between Failure.” It is the inverse of failure rate. An MTBF of 100,000 hrs/failure is a failure rate of 1/100,000 fails/hr = .00001 fails/hr. Those are numbers, what does that look like in operation? Continue reading MTBF of a Human→

The MTBF calculation is widely used to evaluate the reliability of parts and equipment, in the industry is usually defined as one of the key performance indicators. This short article is intended to demonstrate in practice how we can fool ourselves by evaluating this indicator in isolation. Continue reading MTBF Paradox: Case Study→

Thanks to a reader that noticed my question on why MTBF came into existence, we have a new (new to me at least) rationale for using MTBF. Basically, MTBF provides clarity on the magnitude of a number, because a number in scientific notation is potentially confusing.

What is doubly concerning is the use of MTTF failure rate values in ISO standards dealing with system safety.

The term MTBF (Mean Time Between Failures) within maintenance management, it is the most important KPI after Physical Availability. Unlike MTTF (Mean Time To Failure), which relates directly to available equipment time, MTBF also adds up the time spent inside a repair. That is, it starts its count from a certain failure and only stops its counter when this fault was remedied, started and repeated itself again. According to ISO 12849: 2013, this indicator can only be used for repairable equipment, and MTTF is the equivalent of non-repairable equipment.

The graphic below illustrates these occurrences:

Calculating the MTBF in the Figure 01, we have added the times T1 and T2 and divided by two. That is, the average of all times between one failure and another and its return is calculated. It is, therefore, a simple arithmetical calculation. But what does it mean?

Generally speaking, this indicator is associated with a reliability quality of assets or asset systems, and may even reach a repairable item, although it is rarer to have data available to that detail. Maintenance managers set some benchmark numbers and track performance on a chart over time. In general, the higher the MTBF the better, or fewer times of breaks and repairs over the analyzed period.

Once we have fixed the concepts, some particularities need to be answered:

1. Can we establish periodicity of a maintenance plan based on MTBF time?

2. Can I calculate my failure rates based on my MTBF?

3. Can I calculate my probability of failure based on my MTBF?

4. If the MTBF of my asset or system is 200 hours, after that time will it fail?

It is interesting to answer these questions separately:

1. Can we establish periodicity of a maintenance plan based on MTBF time?

The MTBF is an average number calculated from a set of values. That is, these values can be grouped into a histogram to generate a data distribution where the average value is its MTBF, or the average of the data. Imagine that this distribution follows the Gaussian law and we have a Normal curve that was modeled based on the failure data. The chart below shows that the MTBF is positioned in the middle of the chart.

In a modeled PDF curve (Probability Density of Failure) the mean value, or the MTBF, will occur after 50% of the failure frequencies have occurred. If we implement the preventive plan with a frequency equal to the MTBF time, it will already have a 50% probability of failing. Therefore, the MTBF is not a number that indicates the optimal time for a scheduled intervention.

2. Can I calculate my failure rates based on my MTBF?

Considering the modeling of the failure data to calculate the MTBF, it´s only possible in the exponential distribution fix a value where the failure rate is the inverse of the MTBF:

MTBF = 1 / ʎ

In this distribution, the MTBF time already corresponds to 63.2% probability of failure.

Any modeling other than exponential, the failure rate will be variable and time dependent, so its calculation will also depend on factors such as the probability density function f(t) and the reliability function R(t).

ʎ(t) = h(t) = f(t) / R(t)

Although the exponential distribution is the most adopted in reliability projects, which would generate a constant failure rate over time, most of the assets have variations within their “bathtub curve”, as exemplified by Moubray:

This means that the exponential expression is not best suited to reflect the behavior of most assets in an industrial plant.

3. Can I calculate my probability of failure based on my MTBF?

As seen above, only in the exponential distribution has a constant failure rate that can be calculated as the inverse of the MTBF. In this case, yes, we can calculate the probability of failure of an asset using the formula below:

f(t) = ʎˑexp(-ʎt)

For other models where the failure rate depends on the time, it is only possible to calculate the probability of failure through a data modeling and determination of a parametric statistical curve.

4. If the MTBF of my asset or system is 200 hours, after that time will it fail?

The question is, what exactly does that number mean? It was shown that MTBF isn´t used as a maintenance plan frequency. According to the items explained above, this time means nothing as it is not comparable to its history over the months. If the parametric model governing the behavior of the assets in a reliability study is not determined, the time of 200 hours has no meaning for a probability of failure. In the case of the MTBF provided by equipment manufacturers is different, through life tests they determine exponential curves and thus calculate the time in which there will be 63.2% of sample failures.

I hope the article has helped us to reflect on the definitions of an indicator that is both used but also so misunderstood within industrial maintenance management.

Reliability activities serve one purpose, to support better decision making.

That is all it does. Reliability work may reveal design weaknesses, which we can decide to address. Reliability work may estimate the longevity of a device, allowing decisions when compared to objectives for reliability.

Creating a report that no one reads is not the purpose of reliability. Running a test or analysis to simply ‘do reliability’ is not helpful to anyone. Anything with MTBF involved … well, you know how I feel about that. Continue reading Consider the Decision Making First→

Despite standing for the time between failures, MTBF does not represent a duration. Despite having units of hours (months, cycles, etc.) is it not a duration related metric.

MTBF is a symptom of a bigger problem. It is possibly a lack of interest in reliability. Which I doubt is the case. Or it is a bit of fear of reliability.

Many shy away from the statistics involved. Some simply do not want to know the currently unknown. It could be the fear of potential bad news that the design isn’t reliable enough. Some do not care to know about problems that will requiring solving.

What ever the source of the uneasiness, you may know one or more coworkers that would rather not deal with reliability in any direct manner. Continue reading The Fear of Reliability→

What Does Being In The Flat Part of the Curve Mean?

To mean it means very little, as it rarely occurs. Products fail for a wide range of reasons and each failure follows it’s own path to failure.

As you may understand, some failures tend to occur early, some later. Some we call early life failures, out-of-box failures, etc. Some we deem end of life or wear out failures. There are a few that are truly random in nature, just as a drop or accident causing an overstress fracture, for example. Continue reading Being In The Flat Part of the Curve→

The calculation of MTBF results in a larger number if we make a series of MTBF assumptions. We just need more time in the operating hours and fewer failures in the count of failures.

While we really want to understand the reliability performance of field units, we often make a series of small assumptions that impact the accuracy of MTBF estimates.

Here are just a few of these MTBF assumptions that I’ve seen and in some cases nearly all of them with one team. Reliability data has useful information is we gather and treat it well. Continue reading A Series of Unfortunate MTBF Assumptions→

It is Time to Update the Reliability Metric Book with Your Help

Let’s think of this as a crowdsourced project. The first version of this book is a compilation of NoMTBF.com articles. It lays out why we do not want to use MTBF and what to do instead (to some extent).

With your input of success stories, how to make progress using better metrics, and input of examples, stories, case studies, etc. the next version of the book will be much better and much more practical. Continue reading Time to Update the Reliability Metric Book→

Just back from the Reliability and Maintainability Symposium and not happy. While there are signs, a proudly worn button, regular mentions of progress and support, we still talk about reliability using MTBF too often. We need to avoid MTBF actively, no, I mean aggressively.

Let’s get the message out there concerning the folly of using MTBF as a surrogate to discuss reliability. We need to work relentlessly to avoid MTBF in all occasions.

Teaching reliability statistics does not require the teaching of MTBF.

Describing product reliability performance does not benefit by using MTBF.

MTBF use and thinking is still rampant. It affects how our peers and colleagues approach solving problems.

There is a full range of problems that come from using MTBF, yet how do you spot the signs of MTBF thinking even when MTBF is not mentioned? Let’s explore there approaches that you can use to ferret out MTBF thinking and move your organization toward making informed decisions concerning reliability. Continue reading 3 Ways to Expose MTBF Problems→

Let’s say we want to characterize the reliability performance of a vendor’s device. We’re considering including the device within our system, if and only if, it will survive 5 years reasonably well.

The vendor’s data sheet lists an MTBF value of 200,000 hours. A call to the vendor and search of their site doesn’t reveal any additional reliability information. MTBF is all we have.

We don’t trust it. Which is wise.

Now we want to run an ALT to estimate a time to failure distribution for the device. The intent is to use an acceleration model to accelerate the testing and a time to failure model to adjust to our various expected use conditions.

Have you ever wondered by we use the assumption of a constant failure rate? Or considered why we assume our system is ‘in the flat part of the curve [bathtub curve]’?

Where did this silliness first arise?

In part, I lay blame on Mil Hdbk 217 and parts count prediction practices. Yet, there is a theoretical support for the notion that for large, complex systems the overall system time to failure will approach an exponential distribution.

Thanks go to Wally Tubell Jr., a professor of systems engineering and test. He recently sent me his analysis of Drenick’s theorem and it’s connection to the notion of a flat section of a bathtub curve.