# Category Archives: MTBF

Mean Time Between Failures or MTBF is a common metric for reliability and is often misused or misunderstood.

# A Quick MTBF Search Reveals Distressing Results

I was preparing to write this article and wondered how many search hits would appear for MTBF? So, opened Google and did an MTBF search. It is a common if misunderstood, acronym.

Beyond the 5,200,000 Google search results, it was the first page results that got me thinking. Keep in mind that Google often serves up a combination of what it thinks you are seeking and which sites have been useful for others.

Let’s break down what you find when you do an MTBF search. Continue reading MTBF Search Result Sadness

# 4 Questions to Ask When Confronted with MTBF

MTBF comes up a bit too often. When it does I have found rolling my eyes and arguing against using MTBF is not very effective.

So, what should a knowing reliability professional do instead?

Let’s explore four questions that you can ask that may help others find the value in no longer talking about MTBF. Continue reading 4 Questions to Ask When Confronted with MTBF

# Replace After MTTF Time To Avoid Failures – Right?

If we replace an item after a duration equal to the MTTF value, we would avoid failures, right?

Well, no, most likely not, was my response. What is your response? How would you answer this question? Continue reading Replace After MTTF Time To Avoid Failures – Right?

# Yet Another Confused MTBF Definition

Just when I thought we had experienced every possible MTBF definition confusion, here’s another.

This one is courtesy the thread concerning the impact to reliability when adding redundancy to a system. Continue reading Yet Another Confused MTBF Definition

# Yet Another Way to Misunderstand MTBF

In a Q&A forum, the response to a question concerning failure rate and repair times for a redundant system demonstrated yet another person confusing MTBF with something it is not.

The responder to the question mentioned the reference to repair time implied the need for MTBF as a metric. Then went on to describe MTBF as the duration of repair time, which should not change given a redundant system over a non-redundant system. Continue reading Another Way to Spot Someone Confusing MTBF

# A Few Simple Ideas to Improve Your Reliability Program

Spending too much on reliability and not getting the results you expect? Just getting started and not sure where to focus your reliability  program? Or, just looking for ways to improve your program?

There is not one way to build an effective reliability program. The variations in industries, expectations, technology, and the many constraints, shape each program. Here are three suggestions you can apply to any program at any time. These are not quick fix solutions, nor will you see immediate results, yet each will significantly improve your reliability program and help you achieve the results you and your customers expect. Continue reading 3 Ways to Improve your Reliability Program

# 5 Clues Using MTBF is Not Helping

Have you ever heard the claim that “We use MTBF, as it’s working just fine”?

They may be profitable and successful in the marketplace. Is MTBF serving them well?

Probably not. Continue reading 5 Clues Using MTBF is Not Helping

# The Magic Math of Meeting MTBF Requirements

Recently heard from a reader of NoMTBF. She wondered about a supplier’s argument that they meet the reliability or MTBF requirements. She was right to wonder.

Estimating reliability performance a new design is difficult.

There are good and better practice to justify claims about future reliability performance. Likewise, there are just plain poor approaches, too. Plus there are approaches that should never be used.

## The Vendor Calculation to Support Claim They Meet Reliability Objective

Let’s say we contract with a vendor to create a navigation system for our vehicle. The specification includes functional requirements. Also it includes form factor and a long list of other requirements. It also clearly states the reliability specification. Let’s say the unit should last with 95% probability over 10 years of use within our vehicle. We provide environmental and function requirements in detail.

The vendor first converts the 95% probability of success over 10 years into MTBF. Claiming they are ‘more familiar’ with MTBF. The ignore the requirements for probability of first month of operation success. Likewise they ignore the 5 year targeted reliability, or as they would convert, MTBF requirements.

[Note: if you were tempted to calculate the equivalent MTBF, please don’t. It’s not useful, nor relevant, and a poor practice. Suffice it to say it would be a large and meaningless number]

RED FLAG By converting the requirement into MTBF it suggests they may be making simplifying assumptions. This may permit easier use of estimation, modeling, and testing approaches.

## The Vendor’s Approach to ‘Prove’ The Meet the MTBF Requirement

The vendor reported they met the reliability requirement using the following logic:

Of the 1,000 (more actually) components we selected 6 at random for accelerated life testing. We estimated the lower 60% confidence of the probability of surviving 10 years given the ALT results. Then converted the ALT results to MTBF for the part.

We then added the Mil Hdbk 217 failure rate estimate to the ALT result for each of the 6 parts.

RED FLAG This one has me wondering the rationale for adding failure rates of an ALT and a parts count prediction. It would make the failure rate higher. Maybe it was a means to add a bit of margin to cover the uncertainty? I’m not sure, do you have any idea why someone would do this? Are they assuming the ALT did not actually measure anything relevant or any specific failure mechanisms, or they used a benign stress? ALT details were not provided.

## The Approach Gets Weird Here

Then we use a 217 parts count prediction along with the modified 6 component failure rates to estimate the system failure rate, and with a simple inversion estimated the MTBF. They then claimed the system design will meet the field reliability performance requirements.

RED FLAG Mil HDBK 217 F in section 3.3 states

Hence, a reliability prediction should never be assumed to represent the expected field reliability …

If you are going to use a standard, any standard, one should read it. Read to  understand when and why it is useful or not useful.

## What Should the Vendor Have Done Instead?

There are a lot of ways to create a new design and meet reliability requirements.

• The build, test, fix approach or reliability growth approach works well in many circumstances.
• Using similar actually fielded systems failure data. It may provide a reasonable bound for an estimate of a new system. It may also limit the focus on the accelerated testing to only the novel or new or high risk areas of the new design — given much of the design is (or may be) similar to past products.
• Using a simple reliability block diagram or fault tree analysis model to assembly the estimates, test results, engineering stress/strength analysis (all better estimation tools then parts count, in my opinion) and calculate a system reliability estimate.
• Using a risk of failure approach with FMEA and HALT to identify the likely failure mechanisms then characterize those mechanisms to determine their time to failure distributions. If there is one or a few dominant failure mechanisms, that work would provide a reasonable estimate of the system reliability.

In all cases focus on failure mechanisms and how the time to failure distribution changes given changes in stress / environment / use conditions. Monte Carlo may provide a suitable means to analysis a great mixture of data to determine an estimate. Use reliability, probability of success over a duration.

In short, do the work to understand the design, it’s weaknesses, the time to failure behavior under different use/condition scenarios, and make justifiable assumptions only when necessary.

## Summary

We engage vendors to supply custom subsystems given their expertise and ability to deliver the units we need for our vehicle. We expect them to justify they meet reliability requirements in a rationale and defendable manner. While we do not want to dictate the approach tot he design or the estimate of reliability performance, we certainly have to judge the acceptability of the claims they meet the requirements.

What do you report when a customer asks if your product will meet the reliability requirements? Add to the list of possible approaches in the comments section below.

Related

How to Calculate MTBF

MTBF: According to a Component Supplier

# How to Properly Calculate System Availability

Recently received a request for my opinion concerning the calculation of system availability using the classic formula

$\displaystyle A=\frac{MTBF}{MTBF+MTTR}$

The work is to create a set of goals for various suppliers and contractors to achieve. The calculation values derive from vendor data sheets and available information concerning MTBF and MTTR. The project is in the design phase thus they do not have working system’s available to measure actual availability.

How would you go about improving on this approach? Continue reading Calculating System Availability

# Considerations When You Calculate MTBF

It is deceptively easy to calculate MTBF given a count of failure and an estimate of operating hours. Just tally up the total hours the various systems operate and divide by the number of failures. Easy.

This simple calculation is the unbiased estimator for the inverse of the parameter lambda for the exponential distribution, or directly to estimate theta (MTBF). We use theta to represent the 1 / lambda.

What could go wrong with such a simple calculation?

## What is a failure?

Let’s start with what we count or do not count as a failure. This directly changes the resulting MTBF value. If we only count confirmed hardware failures, and do not count intermittent or unreproducible or software failures, are we under counting what the customer experiences as a failure?

Over what duration do we count the failures? Should we focus only on the first month of operation, the first year, the warranty or service contract period or the entire operating life of the system? How do you calculate MTBF?

Some organizations only count failures they expect to occur. The unexpected ones are ‘special’ causes and require further study before counting as failure officially.

Another organization only counted failures that completely shut down the system. A partial loss of functionality, a degradation of capability or the failure of a redundant element all did not count a system failure.

In my opinion if the customer calls it a failure, it’s a failure. If a failure, by any definition, costs your organization time and money to address, acknowledge, resolve or repair, it’s a failure.

## What is operating time?

This one is tricky. If the system does include the appropriate sensors and tracking mechanisms (hour meter) and a way to gather that operating time of units both failed and still operating, then we have a pretty good way to track total operating hours. Some situations and systems make this easy.

Most do not.

Let’s say we ship 100 systems a month for 10 months. At the end of ten months the first shipments have accumulated 10 months of operating time. IF….

… They are all placed into service immediately

… They are all operated full time for the full 10 months

… They are have each failure reported including down time

In general, we do have to make a few assumptions to determine the operating time for shipped systems. We tend to be conservative and err on the side that would make the MTBF value a little smaller than if we had the full set of carefully tracked data. Or do we?

• Some organization count from date/time of shipment ignoring shipping and installation time.
• Some organization assume all systems are installed and operated 24/7.
• Some organization assume no news is good news and the systems with no information are still operating.

And a few organization assume systems run indefinitely, even systems 20 years old, unless notified that it is decommissioned, assume it is still running full tilt. i.e. No retirement or replacement policy.

## How about when you calculate MTBF?

By convention when there are no failures we assume in the next instant there will be one failure. This avoid dividing by zero which causes fits for calculators and spreadsheets and mathematicians.

Another issue is how often are the calculations made? Do we gather data hourly, daily, weekly, monthly, annually? Some use a rolling set of data, for example only units shipped in the last year count for both operating time and failures. This result will ignore or discount the longer term wear out failures as the bulk of the units are young.

Some organization do the calculations weekly in order to detect trends. If there are trends you probably should not be using MTBF…. If it’s changing, if there are early life or wear out failure mechanisms, you should not be using MTBF.

Even though you can calculate MTBF easily, the complexities of getting it right still do not provide a useful metric. Instead focus on getting better data including time to failure information so you can explore and report the data with other tools and methods. Treat the data appropriately and make better decisions

Sure, better data will improve the ability to calculate the MTBF value, if you’d like to be like some organizations, that is fine.

How have you seen MTBF calculated poorly? Share your thoughts and stories in the comments below.

Related:

How to calculate MTTF

Perils of using MTBF

# Fred talks about the NoMTBF blog and movement

Did you know I was interviewed for the Dare to Know podcast? The interview was fun, check it out here.

## The Dare to Know podcast Interview

Tim Rodgers interviews Fred Schenkelberg concerning his blog, No MTBF and his mission to eradicate the common mis-use of MTBF.

Fred Schenkelberg is a reliability engineering and management consultant at FMS Reliability. He’s a lecturer at the University of Maryland, and he’s been an active contributor to both the IEEE and the ASQ Reliability Divisions.

Fred re-established Hewlett Packard’s corporate reliability program in the late 1990s, and also worked as a reliability consultant at Microsoft and a manufacturing engineer at Raychem.

In this episode, Fred Schenkelberg discusses:

• Why MTBF is a poor reliability metric
• Common objections to eliminating MTBF
• Alternatives to MTBF

# MTBF: According to a Component Supplier

A reader sent me an except of a document found on Vicor’s site.

“Reliability is quantified as MTBF (Mean Time Between Failures) for repairable product and MTTF (Mean Time To Failure) for non-repairable product. A correct understanding of MTBF is important. A power supply with an MTBF of 40,000 hours does not mean that the power supply should last for an average of 40,000 hours. According to the theory behind the statistics of confidence intervals, the statistical average becomes the true average as the number of samples increase. An MTBF of 40,000 hours, or 1 year for 1 module, becomes 40,000/2 for two modules and 40,000/4 for four modules…”

source: http://www.vicorpower.com/documents/quality/Rel_MTBF.pdf

The except came with the following note and question

“In my opinion this is completely wrong but as I’m fledgling in this subject I’m sensitive to any statements like this.

Could you be so kind and help me a bit on it?”

# WIIFT and Reliability Measures

WIIFT is “what’s in it for them”. Similar to what’s in it for me, yet the focus is your consideration of what value are you providing your audience.

As a reliability engineer you collection, analyze and report reliability measures. You report reliability estimates or results. Do you know how your audience is going to use this information?

Consider WIIFT when reporting reliability. Continue reading Considering WIIFT When Reporting Reliability

# What if all failures occurred truly randomly?

The math would be easier.

The exponential distribution would be the only time to failure distribution. We wouldn’t need Weibull or other complex multi parameter models. Knowing the failure rate for an hour would be all we would need to know, over any time frame.

Sample size and test planning would be simpler. Just run the samples at hand long enough to accumulated enough hours to provide a reasonable estimate for the failure rate.

## Would the Design Process Change?

Yes, I suppose it would. The effects of early life and wear out would not exist. Once a product is placed into service the chance to fail the first hour would be the same as any hour of it’s operation. It would fail eventually and the chance of failing before a year would solely depend on the chance of failure per hour.

A higher failure rate would suggest it would have a lower chance of surviving very long. Although it could still fail in the first hour of use as if it had survived for one million hours and then it’s chance to fail the next hour would still be the same.

## Would Warranty Make Sense?

Since by design we cannot create a product with a low initial failure rate we would only focus on the overall failure rate. Or the chance of failing over any hour, the first hour being convenient and easy to test, yet still meaningful. Any single failure in a customer’s hands could occur at any time and would not alone suggest the failure rate has changed.

Maybe a warranty would make sense based customer satisfaction. We could estimate the number of failures over a time period and set aside funds for warranty expenses. I suppose it would place a burden on the design team to create products with a lower failure rate per hour. Maybe warranty would still make sense.

If there are no wear out mechanisms (this is a make believe world) changing the oil in your car would not make any economic sense. The existing oil has the same chance of engine seize failure as any new oil. The lubricant doesn’t breakdown. Seals do not leak. Metal on metal movement doesn’t cause damaging heat or abrasion.

You may have to replace a car tire due to a nail puncture, yet the chance of an accident due to worn tire tread would not occur any more often than with new tires. We wouldn’t need to monitor tire tread or break pad wear. Those wouldn’t occur.

If a motor is running now, if we know the failure rate we can calculate the chance of running for the rest of the shift, even when the motor is as old as the building.

The concepts of reliability centered maintenance or predictive maintenance or even preventative maintenance would not make sense. There would be advantage to swapping a part of a new one, as the chance to fail would remain the same.

Physics of Failure and Prognostic Health Management – would they make sense?

Understanding failure mechanisms so we could reduce the chance of failure would remain important. Yet when the failures do not

• Accumulated damage
• Drift
• Wear
• Diffuse
• Etc.

Then many of the predictive power of PoF and PHM would not be relevant. We wouldn’t need sensors to monitor conditions that lead to failure, as no specific failure would show a sign or indication of failure before it occurred. Nothing would indicate it was about to fail as that would imply it’s chance to failure has changed.

No more tune-ups or inspections, we would pursue repairs when a failure occurs, not before.

A world of random failures, or a world of failures each of which occurs at a constant rate would be quite different than our world. So, why do we so often make this assumption?