At first MTBF seems like a commonly used and useful measure of reliability. Trained as a statistician and understanding the use of the expected value that MTBF represented, I thought, ‘cool, this is useful’.

Then the discussions with engineers, technical sales folks and other professionals about reliability using MTBF started. And the awareness that not everyone, and at times it seems very few, truly understood MTBF and how to properly use the measure.

The MTBF calculation is widely used to evaluate the reliability of parts and equipment, in the industry is usually defined as one of the key performance indicators. This short article is intended to demonstrate in practice how we can fool ourselves by evaluating this indicator in isolation. Continue reading MTBF Paradox: Case Study→

Thanks to a reader that noticed my question on why MTBF came into existence, we have a new (new to me at least) rationale for using MTBF. Basically, MTBF provides clarity on the magnitude of a number, because a number in scientific notation is potentially confusing.

What is doubly concerning is the use of MTTF failure rate values in ISO standards dealing with system safety.

The term MTBF (Mean Time Between Failures) within maintenance management, it is the most important KPI after Physical Availability. Unlike MTTF (Mean Time To Failure), which relates directly to available equipment time, MTBF also adds up the time spent inside a repair. That is, it starts its count from a certain failure and only stops its counter when this fault was remedied, started and repeated itself again. According to ISO 12849: 2013, this indicator can only be used for repairable equipment, and MTTF is the equivalent of non-repairable equipment.

The graphic below illustrates these occurrences:

Calculating the MTBF in the Figure 01, we have added the times T1 and T2 and divided by two. That is, the average of all times between one failure and another and its return is calculated. It is, therefore, a simple arithmetical calculation. But what does it mean?

Generally speaking, this indicator is associated with a reliability quality of assets or asset systems, and may even reach a repairable item, although it is rarer to have data available to that detail. Maintenance managers set some benchmark numbers and track performance on a chart over time. In general, the higher the MTBF the better, or fewer times of breaks and repairs over the analyzed period.

Once we have fixed the concepts, some particularities need to be answered:

1. Can we establish periodicity of a maintenance plan based on MTBF time?

2. Can I calculate my failure rates based on my MTBF?

3. Can I calculate my probability of failure based on my MTBF?

4. If the MTBF of my asset or system is 200 hours, after that time will it fail?

It is interesting to answer these questions separately:

1. Can we establish periodicity of a maintenance plan based on MTBF time?

The MTBF is an average number calculated from a set of values. That is, these values can be grouped into a histogram to generate a data distribution where the average value is its MTBF, or the average of the data. Imagine that this distribution follows the Gaussian law and we have a Normal curve that was modeled based on the failure data. The chart below shows that the MTBF is positioned in the middle of the chart.

In a modeled PDF curve (Probability Density of Failure) the mean value, or the MTBF, will occur after 50% of the failure frequencies have occurred. If we implement the preventive plan with a frequency equal to the MTBF time, it will already have a 50% probability of failing. Therefore, the MTBF is not a number that indicates the optimal time for a scheduled intervention.

2. Can I calculate my failure rates based on my MTBF?

Considering the modeling of the failure data to calculate the MTBF, it´s only possible in the exponential distribution fix a value where the failure rate is the inverse of the MTBF:

MTBF = 1 / ʎ

In this distribution, the MTBF time already corresponds to 63.2% probability of failure.

Any modeling other than exponential, the failure rate will be variable and time dependent, so its calculation will also depend on factors such as the probability density function f(t) and the reliability function R(t).

ʎ(t) = h(t) = f(t) / R(t)

Although the exponential distribution is the most adopted in reliability projects, which would generate a constant failure rate over time, most of the assets have variations within their “bathtub curve”, as exemplified by Moubray:

This means that the exponential expression is not best suited to reflect the behavior of most assets in an industrial plant.

3. Can I calculate my probability of failure based on my MTBF?

As seen above, only in the exponential distribution has a constant failure rate that can be calculated as the inverse of the MTBF. In this case, yes, we can calculate the probability of failure of an asset using the formula below:

f(t) = ʎˑexp(-ʎt)

For other models where the failure rate depends on the time, it is only possible to calculate the probability of failure through a data modeling and determination of a parametric statistical curve.

4. If the MTBF of my asset or system is 200 hours, after that time will it fail?

The question is, what exactly does that number mean? It was shown that MTBF isn´t used as a maintenance plan frequency. According to the items explained above, this time means nothing as it is not comparable to its history over the months. If the parametric model governing the behavior of the assets in a reliability study is not determined, the time of 200 hours has no meaning for a probability of failure. In the case of the MTBF provided by equipment manufacturers is different, through life tests they determine exponential curves and thus calculate the time in which there will be 63.2% of sample failures.

I hope the article has helped us to reflect on the definitions of an indicator that is both used but also so misunderstood within industrial maintenance management.

Reliability activities serve one purpose, to support better decision making.

That is all it does. Reliability work may reveal design weaknesses, which we can decide to address. Reliability work may estimate the longevity of a device, allowing decisions when compared to objectives for reliability.

Creating a report that no one reads is not the purpose of reliability. Running a test or analysis to simply ‘do reliability’ is not helpful to anyone. Anything with MTBF involved … well, you know how I feel about that. Continue reading Consider the Decision Making First→

Three prototypes survive the gauntlet of stresses and none fail. That is great news, or is it? No failure testing is what I call success testing.

We often want to create a design that is successful, therefore enjoying successful testing results, I.e. No failures means we are successful, right?

Another aspect of success testing is in pass/fail type testing we can minimize the sample size by planning for all prototypes passing the test. If we plan on running the test till we have a failure or two, we need more samples. While it improves the statistics of the results, we have to spend more to achieve the results. We nearly always have limited resources for testing.

The following note and question appear in my email the other day. I had given the definition of reliability quite a bit of thought, yet have not really thought too much about a definition of ‘product life time’.

So after answering Najib’s question I thought it may make a good conversation starter here. Give it a quite read, and add how you would answer the questions Najib poses. Continue reading Defining a Product Life Time→

Despite standing for the time between failures, MTBF does not represent a duration. Despite having units of hours (months, cycles, etc.) is it not a duration related metric.

MTBF is a symptom of a bigger problem. It is possibly a lack of interest in reliability. Which I doubt is the case. Or it is a bit of fear of reliability.

Many shy away from the statistics involved. Some simply do not want to know the currently unknown. It could be the fear of potential bad news that the design isn’t reliable enough. Some do not care to know about problems that will requiring solving.

What ever the source of the uneasiness, you may know one or more coworkers that would rather not deal with reliability in any direct manner. Continue reading The Fear of Reliability→

What Does Being In The Flat Part of the Curve Mean?

To mean it means very little, as it rarely occurs. Products fail for a wide range of reasons and each failure follows it’s own path to failure.

As you may understand, some failures tend to occur early, some later. Some we call early life failures, out-of-box failures, etc. Some we deem end of life or wear out failures. There are a few that are truly random in nature, just as a drop or accident causing an overstress fracture, for example. Continue reading Being In The Flat Part of the Curve→

The calculation of MTBF results in a larger number if we make a series of MTBF assumptions. We just need more time in the operating hours and fewer failures in the count of failures.

While we really want to understand the reliability performance of field units, we often make a series of small assumptions that impact the accuracy of MTBF estimates.

Here are just a few of these MTBF assumptions that I’ve seen and in some cases nearly all of them with one team. Reliability data has useful information is we gather and treat it well. Continue reading A Series of Unfortunate MTBF Assumptions→

It is Time to Update the Reliability Metric Book with Your Help

Let’s think of this as a crowdsourced project. The first version of this book is a compilation of NoMTBF.com articles. It lays out why we do not want to use MTBF and what to do instead (to some extent).

With your input of success stories, how to make progress using better metrics, and input of examples, stories, case studies, etc. the next version of the book will be much better and much more practical. Continue reading Time to Update the Reliability Metric Book→

Just back from the Reliability and Maintainability Symposium and not happy. While there are signs, a proudly worn button, regular mentions of progress and support, we still talk about reliability using MTBF too often. We need to avoid MTBF actively, no, I mean aggressively.

Let’s get the message out there concerning the folly of using MTBF as a surrogate to discuss reliability. We need to work relentlessly to avoid MTBF in all occasions.

Teaching reliability statistics does not require the teaching of MTBF.

Describing product reliability performance does not benefit by using MTBF.

MTBF use and thinking is still rampant. It affects how our peers and colleagues approach solving problems.

There is a full range of problems that come from using MTBF, yet how do you spot the signs of MTBF thinking even when MTBF is not mentioned? Let’s explore there approaches that you can use to ferret out MTBF thinking and move your organization toward making informed decisions concerning reliability. Continue reading 3 Ways to Expose MTBF Problems→

Over 20 years ago the Assistant Secretary of the Army directed the Army to not use MIL HBK 217 in a request for proposals, even for guidance. Exceptions, by waiver only.

217 is still around and routinely called out. That is a lot of waivers.

Why is 217 and other parts count database prediction packages still in use? Let’s explore the memo a bit more, plus ponder what is maintaining the popularity of 217 and ilk. Continue reading The Army Memo to Stop Using Mil HDBK 217→