Category Archives: MTBF

Mean Time Between Failures or MTBF is a common metric for reliability and is often misused or misunderstood.

Sample Size and Duration and MTBF

14586653159_c098ab23c9_m_dSample Size and Duration and MTBF

If you have been a reliability engineer for a week or more, or worked with a reliability engineer for a day or more, someone asked about testing planning. The conversation may have started with “how many samples and how long will the test take?”

You have heard the sample size question.

Continue reading Sample Size and Duration and MTBF

Learn to Notice MTBF Everyday

14586638599_24177bfb25_m_dLearn to Notice MTBF Everyday

Did you notice the speed limit signs in your neighborhood today?

If like me, you went about your commute or regular travels relatively blind. You watched for the neighbor’s dog that jumped into the road last week, yet didn’t register seeing the speed limit sign.

It’s a cognitive burden to notice the mundane or known. Continue reading Learn to Notice MTBF Everyday

The 3 Best Reasons to Use MTBF

14586620299_60a4c792ef_m_dThe 3 Best Reasons to Use MTBF

This may seem an odd article for the NoMTBF site. Stay with me for a moment longer.

Over the years of speaking out on the perils of MTBF, there has been some push back. A few defend using MTBF. Here are three of the most common (maybe not exactly the best, per se) reasons to use MTBF. Continue reading The 3 Best Reasons to Use MTBF

Illuminating MTBF’s Lack of Information

14586612669_cc57c310e0_m_dIlluminating MTBF’s Lack of Information

Here’s a simple illustration of how MTBF oversimplifies data concealing essential information.

By convention, we tend to use MTBF for repairable data. That is fine.

You may also be aware of my dislike for the use of MTBF, for many different reasons. If you find yourself suggesting your organization, customer, industry or whomever to stop using MTBF, you may want to use this simple example to illustrate the ‘value’ of MTBF. Continue reading Illuminating MTBF’s Lack of Information

MTBF Search Result Sadness

Equipment that didn't advertise with MTBFA Quick MTBF Search Reveals Distressing Results

I was preparing to write this article and wondered how many search hits would appear for MTBF? So, opened Google and did an MTBF search. It is a common if misunderstood, acronym.

Beyond the 5,200,000 Google search results, it was the first page results that got me thinking. Keep in mind that Google often serves up a combination of what it thinks you are seeking and which sites have been useful for others.

Let’s break down what you find when you do an MTBF search. Continue reading MTBF Search Result Sadness

4 Questions to Ask When Confronted with MTBF

14805045513_43a0509d1b_z4 Questions to Ask When Confronted with MTBF

MTBF comes up a bit too often. When it does I have found rolling my eyes and arguing against using MTBF is not very effective.

So, what should a knowing reliability professional do instead?

Let’s explore four questions that you can ask that may help others find the value in no longer talking about MTBF. Continue reading 4 Questions to Ask When Confronted with MTBF

Replace After MTTF Time To Avoid Failures – Right?

MTTF and maintenanceReplace After MTTF Time To Avoid Failures – Right?

Received a short question last week. The person writing seems to already know the answer, yet asked:

If we replace an item after a duration equal to the MTTF value, we would avoid failures, right?

Well, no, most likely not, was my response. What is your response? How would you answer this question? Continue reading Replace After MTTF Time To Avoid Failures – Right?

Another Way to Spot Someone Confusing MTBF

Vintage machine image, without confusing MTBFYet Another Way to Misunderstand MTBF

In a Q&A forum, the response to a question concerning failure rate and repair times for a redundant system demonstrated yet another person confusing MTBF with something it is not.

The responder to the question mentioned the reference to repair time implied the need for MTBF as a metric. Then went on to describe MTBF as the duration of repair time, which should not change given a redundant system over a non-redundant system. Continue reading Another Way to Spot Someone Confusing MTBF

3 Ways to Improve your Reliability Program

The reliability performance of equipment is a reflection of your reliability programA Few Simple Ideas to Improve Your Reliability Program

Spending too much on reliability and not getting the results you expect? Just getting started and not sure where to focus your reliability  program? Or, just looking for ways to improve your program?

There is not one way to build an effective reliability program. The variations in industries, expectations, technology, and the many constraints, shape each program. Here are three suggestions you can apply to any program at any time. These are not quick fix solutions, nor will you see immediate results, yet each will significantly improve your reliability program and help you achieve the results you and your customers expect. Continue reading 3 Ways to Improve your Reliability Program

The Magic Math of Meeting MTBF Requirements

Even old machines met reliability or MTBF requirementsThe Magic Math of Meeting MTBF Requirements

Recently heard from a reader of NoMTBF. She wondered about a supplier’s argument that they meet the reliability or MTBF requirements. She was right to wonder.

Estimating reliability performance a new design is difficult.

There are good and better practice to justify claims about future reliability performance. Likewise, there are just plain poor approaches, too. Plus there are approaches that should never be used.

The Vendor Calculation to Support Claim They Meet Reliability Objective

Let’s say we contract with a vendor to create a navigation system for our vehicle. The specification includes functional requirements. Also it includes form factor and a long list of other requirements. It also clearly states the reliability specification. Let’s say the unit should last with 95% probability over 10 years of use within our vehicle. We provide environmental and function requirements in detail.

The vendor first converts the 95% probability of success over 10 years into MTBF. Claiming they are ‘more familiar’ with MTBF. The ignore the requirements for probability of first month of operation success. Likewise they ignore the 5 year targeted reliability, or as they would convert, MTBF requirements.

[Note: if you were tempted to calculate the equivalent MTBF, please don’t. It’s not useful, nor relevant, and a poor practice. Suffice it to say it would be a large and meaningless number]

RED FLAG By converting the requirement into MTBF it suggests they may be making simplifying assumptions. This may permit easier use of estimation, modeling, and testing approaches.

The Vendor’s Approach to ‘Prove’ The Meet the MTBF Requirement

The vendor reported they met the reliability requirement using the following logic:

Of the 1,000 (more actually) components we selected 6 at random for accelerated life testing. We estimated the lower 60% confidence of the probability of surviving 10 years given the ALT results. Then converted the ALT results to MTBF for the part.

We then added the Mil Hdbk 217 failure rate estimate to the ALT result for each of the 6 parts.

RED FLAG This one has me wondering the rationale for adding failure rates of an ALT and a parts count prediction. It would make the failure rate higher. Maybe it was a means to add a bit of margin to cover the uncertainty? I’m not sure, do you have any idea why someone would do this? Are they assuming the ALT did not actually measure anything relevant or any specific failure mechanisms, or they used a benign stress? ALT details were not provided.

The Approach Gets Weird Here

Then we use a 217 parts count prediction along with the modified 6 component failure rates to estimate the system failure rate, and with a simple inversion estimated the MTBF. They then claimed the system design will meet the field reliability performance requirements.

RED FLAG Mil HDBK 217 F in section 3.3 states

Hence, a reliability prediction should never be assumed to represent the expected field reliability …

If you are going to use a standard, any standard, one should read it. Read to  understand when and why it is useful or not useful.

What Should the Vendor Have Done Instead?

There are a lot of ways to create a new design and meet reliability requirements.

  • The build, test, fix approach or reliability growth approach works well in many circumstances.
  • Using similar actually fielded systems failure data. It may provide a reasonable bound for an estimate of a new system. It may also limit the focus on the accelerated testing to only the novel or new or high risk areas of the new design — given much of the design is (or may be) similar to past products.
  • Using a simple reliability block diagram or fault tree analysis model to assembly the estimates, test results, engineering stress/strength analysis (all better estimation tools then parts count, in my opinion) and calculate a system reliability estimate.
  • Using a risk of failure approach with FMEA and HALT to identify the likely failure mechanisms then characterize those mechanisms to determine their time to failure distributions. If there is one or a few dominant failure mechanisms, that work would provide a reasonable estimate of the system reliability.

In all cases focus on failure mechanisms and how the time to failure distribution changes given changes in stress / environment / use conditions. Monte Carlo may provide a suitable means to analysis a great mixture of data to determine an estimate. Use reliability, probability of success over a duration.

In short, do the work to understand the design, it’s weaknesses, the time to failure behavior under different use/condition scenarios, and make justifiable assumptions only when necessary.

Summary

We engage vendors to supply custom subsystems given their expertise and ability to deliver the units we need for our vehicle. We expect them to justify they meet reliability requirements in a rationale and defendable manner. While we do not want to dictate the approach tot he design or the estimate of reliability performance, we certainly have to judge the acceptability of the claims they meet the requirements.

What do you report when a customer asks if your product will meet the reliability requirements? Add to the list of possible approaches in the comments section below.

Related

How to Calculate MTBF

Questions to ask a vendor

MTBF: According to a Component Supplier

Calculating System Availability

Considering system availability is essential when designing complex equipmentHow to Properly Calculate System Availability

Recently received a request for my opinion concerning the calculation of system availability using the classic formula

\displaystyle A=\frac{MTBF}{MTBF+MTTR}

The work is to create a set of goals for various suppliers and contractors to achieve. The calculation values derive from vendor data sheets and available information concerning MTBF and MTTR. The project is in the design phase thus they do not have working system’s available to measure actual availability.

How would you go about improving on this approach? Continue reading Calculating System Availability

How to Calculate MTBF

Considerations When You Calculate MTBF

You should calculate MTBF for machines tooIt is deceptively easy to calculate MTBF given a count of failure and an estimate of operating hours. Just tally up the total hours the various systems operate and divide by the number of failures. Easy.

This simple calculation is the unbiased estimator for the inverse of the parameter lambda for the exponential distribution, or directly to estimate theta (MTBF). We use theta to represent the 1 / lambda.

What could go wrong with such a simple calculation?

What is a failure?

Let’s start with what we count or do not count as a failure. This directly changes the resulting MTBF value. If we only count confirmed hardware failures, and do not count intermittent or unreproducible or software failures, are we under counting what the customer experiences as a failure?

Over what duration do we count the failures? Should we focus only on the first month of operation, the first year, the warranty or service contract period or the entire operating life of the system? How do you calculate MTBF?

Some organizations only count failures they expect to occur. The unexpected ones are ‘special’ causes and require further study before counting as failure officially.

Another organization only counted failures that completely shut down the system. A partial loss of functionality, a degradation of capability or the failure of a redundant element all did not count a system failure.

In my opinion if the customer calls it a failure, it’s a failure. If a failure, by any definition, costs your organization time and money to address, acknowledge, resolve or repair, it’s a failure.

What is operating time?

This one is tricky. If the system does include the appropriate sensors and tracking mechanisms (hour meter) and a way to gather that operating time of units both failed and still operating, then we have a pretty good way to track total operating hours. Some situations and systems make this easy.

Most do not.

Let’s say we ship 100 systems a month for 10 months. At the end of ten months the first shipments have accumulated 10 months of operating time. IF….

… They are all placed into service immediately

… They are all operated full time for the full 10 months

… They are have each failure reported including down time

In general, we do have to make a few assumptions to determine the operating time for shipped systems. We tend to be conservative and err on the side that would make the MTBF value a little smaller than if we had the full set of carefully tracked data. Or do we?

  • Some organization count from date/time of shipment ignoring shipping and installation time.
  • Some organization assume all systems are installed and operated 24/7.
  • Some organization assume no news is good news and the systems with no information are still operating.

And a few organization assume systems run indefinitely, even systems 20 years old, unless notified that it is decommissioned, assume it is still running full tilt. i.e. No retirement or replacement policy.

How about when you calculate MTBF?

By convention when there are no failures we assume in the next instant there will be one failure. This avoid dividing by zero which causes fits for calculators and spreadsheets and mathematicians.

Another issue is how often are the calculations made? Do we gather data hourly, daily, weekly, monthly, annually? Some use a rolling set of data, for example only units shipped in the last year count for both operating time and failures. This result will ignore or discount the longer term wear out failures as the bulk of the units are young.

Some organization do the calculations weekly in order to detect trends. If there are trends you probably should not be using MTBF…. If it’s changing, if there are early life or wear out failure mechanisms, you should not be using MTBF.

Even though you can calculate MTBF easily, the complexities of getting it right still do not provide a useful metric. Instead focus on getting better data including time to failure information so you can explore and report the data with other tools and methods. Treat the data appropriately and make better decisions

Sure, better data will improve the ability to calculate the MTBF value, if you’d like to be like some organizations, that is fine.

How have you seen MTBF calculated poorly? Share your thoughts and stories in the comments below.

Related:

How to calculate MTTF

Perils of using MTBF

Dare to Know podcast interview with Fred Schenkelberg

Fred talks about the NoMTBF blog and movement

Did you know I was interviewed for the Dare to Know podcast? The interview was fun, check it out here.

The Dare to Know podcast Interview

Tim Rodgers interviews Fred Schenkelberg concerning his blog, No MTBF and his mission to eradicate the common mis-use of MTBF.

Fred Schenkelberg image
Fred Schenkelberg

Fred Schenkelberg is a reliability engineering and management consultant at FMS Reliability. He’s a lecturer at the University of Maryland, and he’s been an active contributor to both the IEEE and the ASQ Reliability Divisions.

Fred re-established Hewlett Packard’s corporate reliability program in the late 1990s, and also worked as a reliability consultant at Microsoft and a manufacturing engineer at Raychem.

In this episode, Fred Schenkelberg discusses:

  • Why MTBF is a poor reliability metric
  • Common objections to eliminating MTBF
  • Alternatives to MTBF