MTBF is a Statistic, Not the Only One

MTBF is a Statistic, Not the Only One

14586955417_94ef84b055_zWe often face just a sample of life data with the request of estimating the reliability of the system. Or, we have a touch of test results and want to know if the product is reliability enough, yet. Or, we gather repair times to grapple with spares stocking.

We need to know the reliability. We need to know the number.

MTBF (or close cousin MTTF) is just that number. It is easy to calculate. A higher number means the system is more reliable. And, the metrics are in the units of time, often hours, which is easy to understand (and misunderstand).

In early chapters of reliability engineering books, or in introduction to reliability, we learn about the exponential distribution and the population parameter, theta. We also learn about the sample statistics which provides an unbiased estimated for the population parameter. In both cases, MTBF, or the mean time between failure, is the one value we have to master.

Other Statistics

Reliability is pretty easy using just one statistic. One calculation, one number, and we’re done.

Then a couple of things start to happen.

First, we notice that the actual time to failure behavior is not predicted, nor follows, the expected pattern when using just MTBF and the exponential distribution. The average time to fail changes as the system ages. We find that we run out of spares based on calculations using MTBF as the parts fail more and more often.

Second, we learn just a little more. We turn the page in the book or attend another webinar. We hear about another distribution commonly used in reliability engineering. The Weibull distribution. But, wait, hold on there. The Weibull has two and sometimes three parameters. I’ll need to learn about plotting, censored data, regression analysis, goodness of fit, confidence intervals, and a bunch of statistical methods.

Life was good with just one statistic.

We didn’t sign up to be reliability statisticians.

Well, too bad.

Actually, when using even just the one statistic, MTBF, we also should have been

  • Checking assumptions
  • Fitting the data to the exponential distribution function
  • Evaluating the goodness of fit
  • Calculating confidence bounds
  • And, using those other statistical methods

In order to understand and use our sparse and expensive datasets, we need to use the tools found in the statistics textbooks.

Yes, the Weibull distribution has two or three parameters, thus we need to evaluate how well our statistics describe the data in a more rigorous way. And, we learn so much more. For example, we can model and predict a system with decreasing or increasing failure rates over time. We can estimate the number of required spares next year with a bit more accuracy then using just MTBF.

There are more benefits. Have you advanced past the basic introduction and embraced the use of reliability statistics? How’s it going and what challenges are you facing?

Plot the Data

Plot the Data

Just, please, plot the data.

If you have gathered some time to failure data. You have the breakdown dates for a piece of equipment. You review your car maintenance records and notes the dates of repairs. You may have some data from field returns. You have a group of numbers and you need to make some sense of it.

Take the average

That seems like a great first step. Let’s just summarize the data in some fashion. So, let’s day I have the number of hours each fan motor ran before failure. I can tally up the hours, TT, and divide by the number of failures, r. This is the mean time to failure.

\displaystyle \theta =\frac{TT}{r}

Or, if the data was one my car and I have the days between failures, I can also tally up the time, TT, and divide by the number of repairs, r. Same formula and we call the result, the mean time between failure.

And I have a number. Say it’s 34,860 hours MTBF. What does that mean (no pun intended) other than on average my car operated for 34k hours between failures. Sometimes more, sometimes less.

Any pattern? Is my car getting better with age, or worse?

A Histogram

In school we used to use histograms to display the data. Let’s try that. Here’s an example plot.

 

Screen Shot 2015-08-05 at 8.01.58 AM In this case the plot is of service and repair times (most likely similar to the times the garage has my car for a oil change and tune up). Right away we see more than just a number. The values range from about 50 up to about 350 with most of the data on the lower side. Just a couple of service times take over 250 minutes.

Using just an average doesn’t provide very much information compared to a histogram.

Mean Cumulative Function Plot

Over time count the number of failures. If the repair time is short compared to operating time, than this simple plot may reveal interesting patterns that a histogram cannot.

Here is a piece of equipment and each dot represented a call for service. The x-axis is time and the vertical axis is the count of service calls. While it’s not clear what happened shortly after about 3,000 hours, it may be worth learning more about what was going on then.

M90-P4 MCF

 

Even after the first there or four point after 3,000 hours would have signaled something different is happening here.
MCF plots show when something is getting worse (more frequent repairs) by curving upward, or getting better, (longer spans between repairs) by flattening out. Again, a lot more information than with just a number.

Plot the Fitted Distribution

Let’s say we really want to assume the data is from an exponential distribution. We can happily calculate the MTBF value and continue with the day. Or, we can plot the data and the fitted exponential distribution.

Let’s say we have about five failure times based on customer returns out of the 100 units placed into service. We can calculate the MTBF value including the time the remaining 95 units operated, which is about 172,572 hours MTBF. And, we can plot the data, too.

Here’s an example. What do you notice, even with a fuzzy plot image?

Exp assumed plot

 

The line intersects the point where the F(t) is 0.63 or about the 63rd percentile of the distribution, and the time is at the point we calculated as the MTBF value (off to the right of the plot area).
Like me, you may notice the line doesn’t seem to describe the data very well. It seems to have a different pattern than that described by the exponential distribution. Let’s add a fit of a Weibull distribution that also was fit to the data, including the units that have not failed.

 

W v E plot

The Weibull fit at least appears to represent the pattern of the failures. The slope is much steeper than the exponential fit. The Weibull tells a different story. A story that represents the story within the data.

Again, just plot the data. Let the data show you what it has to say. What does your data say today?

Speaking of Reliability — a new podcast series

Coming Soon

Speaking of Reliability

speaking_of_reliability_2015_250x250A new podcast show featuring discussions with reliability experts about a wide range of reliability engineering topics is in the process of development. We’ve recorded a few episodes and in editing now.

I expect to launch the podcast in the next week or two.

The show is in large part based on the questions received over the past few years from you.    You being reliability minded folks that would like to solve problems, improve reliability performance and advance your career.  Continue reading Speaking of Reliability — a new podcast series

The Fear of Reliability Statistics

The Fear of
Reliability Statistics

Eva the Weaver Soon deniable
Eva the Weaver
Soon deniable

When reading a report and there is a large complex formula, maybe a derivation, do you just skip over it? Does a phrase, 95% confidence of 98% reliability over 2 years, not help your understanding of the result?

Hypothesis testing, confidence intervals, point estimates, parameters, independent identically distributed, random sample, orthogonal array, …

Did you just shiver a bit?  Continue reading The Fear of Reliability Statistics

5 Ways Reliability Was Important to Me Today

5 Ways Reliability Was Important to Me Today

Andrew J. Cosgriff, we could live in hope
Andrew J. Cosgriff, we could live in hope

I suspect reliability of the products and services in your world plan an important role in your day to day existence. For me, maybe I just pay attention to reliability, yet today in particular I tried to notice when things were just working as expected.

Rather then consider everything that touched my life today, I’ll narrow this down to just five. Continue reading 5 Ways Reliability Was Important to Me Today

The Bad Reputation of Statistics

Statistics and the Bad Reputation

K W Reinsch Eddy "Reliable" Trustman (Windows XP Professional)
K W Reinsch
Eddy “Reliable” Trustman (Windows XP Professional)

In a recent reliability seminar I learned that the younger engineers did not have to take a statistics course, nor was it part of other courses, in their undergraduate engineering education. They didn’t dislike their stats class as so many before them have, they just didn’t have the pleasure.

Generally I ask how many ‘enjoyed’ their stats class. That generally gets a chuckle and opens an introduction to the statistics that we need to use for reliability engineering. I’ll have to change my line as more engineers just do not have any background with statistics.

I suspect this is good new for Las Vegas and other gambling based economies.

Statistics are hard

On average there are a few folks that get statistics. No me. There are those that intuitively understand probability and statistics, and demonstrate a mastery of the theory and application. No me.

I like many others that successfully use statistical tools, think carefully, consider the options, check assumptions, recheck the approach, ask for help and still check and recheck the work. Statistics is a tool and allows us to make better decisions. With practice you can get better at selecting the right tool and master the application of a range of tools.

Sure, it’s not easy, yet as many have found, mastering the use of statistics allows they to move forward faster.

Statistics are abused

Politicians, marketers, and others have a message to support and citing an interesting statistic helps. It doesn’t matter that the information is out of context nor clear. When someone claims 89% of those polled like brand x, what does that mean? Did they ask a random sample? Did they stop asking when they got the result they wanted? What was the poll section process and specific questions? What was the context?

The number may have been a simple count of positive responses vs all those questioned. The math results in a statistic, a percentage. It implies the sample represented the entire population. It may or may not, that is not clear.

We hear and read this type of statistic all too often. We discount even the well crafted and supported statistic. We associate distrust with statistics in general given the widespread poor or misleading use.

To me that means, we just need to be sure we are clear, honest and complete with our use of statistics. State the relevant information so others can fully understand. Statistics isn’t just the resulting percentage, it’s the context, too.

Statistics can be wrong

Even working to apply a statistical tool appropriately, there is a finite chance that the laws of random selection will provide a faulty result. If we test 10 items, there is a chance that our conclusion will show a 50% failure rate even though the actually population failure rate is less then 1%. Not likely to happen, yet it could.

We often do not have the luxury of the law of large numbers with our observations.

So, given the reality that we need to make a decision and that using a sample has risk, does that justify not using the sample’s results? No. The alternative of using no data doesn’t seem appealing to me, nor should it to you.

So, what can we do, we:

  • Do the best we can with the data we have.
  • Do exercise due care to minimize and quantify measurement error.
  • Strive to select samples randomly.
  • Apply the best analysis available, and,
  • Extract as much information from the experiment and analysis as possible.

As with wood working there are many ways to cut a board, with statistics there are many tools. Learn the ones that help you characterize and understand the data you have before you. Master the tools one at a time and use them safely and with confidence.

How Many Assumptions Are Too Many Concerning Reliability?

How Many Assumptions Are Too Many Concerning Reliability?

photolibrarian Riceville, Iowa, Riceville Hatchery, Ames Reliable Products
photolibrarian
Riceville, Iowa, Riceville Hatchery, Ames Reliable Products

When I buy a product, say a laptop, I am making an educated guess that Apple has done the due diligence to create a laptop that will work as long as I expect it to last. The trouble is I don’t know how long I want it to last thus creating some uncertainty for the folks at Apple. How long should a product last to meet customer expectations when customers are not sure themselves? Continue reading How Many Assumptions Are Too Many Concerning Reliability?

When to Use MTBF as a Metric?

When to Use MTBF as a Metric?

Sean Bonner Old Reliable Coffee
Sean Bonner
Old Reliable Coffee

I will not say ‘never’, which is probably what you expect. There are a rare set of circumstances which may benefit with the use of MTBF as a metric. Of course, this does not include being deceitful or misleading with marketing materials. There may actually be an occasion where the MTBF metric works well.

As you know, MTBF is often estimated by tallying up the total hours of operation of a set of devices or systems and dividing by the number of failures. If no failures occur we assume one failure to avoid dividing by zero (messy business dividing by zero and to be avoided). MTBF is essentially the average time to failure.

Expected Value as Metric

The metric we select should be measurable and of a measure we have an interest. We would like to detect changes, measure progress, and possibly make business decisions with our metrics. If we are interested in the expected value of the time to failure for our devices, then MTBF might just be useful.

When making a device we often hear of executives, engineers and customers talk about how long they expect the product to last. An office device may have an expected life of 5 years, a solar power system – 30 years, and so on. If by duration we all agree that we expect 5 years of service on average, then using the average as the metric makes sense.

Before starting the use of MTBF, just make sure that a 5 year life implies half or two thirds of the devices will fail by the stated duration of 5 years. Yes, if the time to failure distribution is actually described by the exponential distribution (and a few other distributions) it means that two third of the units are expected to failure by the MTBF value. Thus if we set the goal to 5 years MTBF we imply half or more of the units will fail by 5 years.

Product Testing Advantages

Having a goal helps the design and development team make decisions and eventually conduct testing to prove the design meets the reliability objectives. Setting the goal a the expected value allows the fewest number of samples for testing. Testing for 99% reliability over 5 years is much tougher. We may require many samples to determine a meaningful estimate of the leading tail (i.e. first 1% or 5% of failures) of the time to failure distribution.

If the time failure pattern fits an exponential distribution, then testing becomes simplified. We can test one unit for a long time, or many units a short time, and arrive at the same answer. The test planning can maximize our resources to efficiently prove our design meets the objective. When the chance of failure each hour is the same, every device-hour of testing provide an equal amount of information.

Unlike products that wear out or degrade with time, when the design and device exhibit an exponential distribution we do not need any aging studies. We can just apply use or accelerated stress and measure the hours of operation and count the failures. Also any early failures are obviously quality issues and most likely do not count toward failures that represent actual field failures. Or do they?

Metrics Should Have a Common Understanding

When the industry, organization, vendors, and engineering staff already use MTBF to discuss reliability, then management would be wise to establish a metric using MTBF. Makes sense, right? The formula to calculate MTBF is very simple. Even the name implies the meaning (no pun intended). MTBF is the mean time between (or before) failure. It’s an average, which calculators, spreadsheets, smart phones, and possibly even your watch can calculate.

While the spread of the data is often of importance when making comparisons, estimating a sample set of data’s confidence bounds, or estimating the number of failures over the warranty period, if we assume the data actually fits an exponential distribution, we find the mean equals the standard deviation. Great! One less calculation. We have what we need to move forward.

Nearly every reliability or quality textbook or guideline includes extensive discussions about MTBF the exponential distribution and a wide range of reliability related calculations. Our common understanding generally is supported by the plentiful references.

Ask a few folks around you when considering using MTBF. What do they define MTBF as representing? If you receive a consistent answer, you may just have a common understanding. If the understanding is also aligned with the underlying math and assumptions, even better.

When to Use MTBF Checklist

In summary all you need is:

  • A business interest in the time till half or more of product fail
  • A design with a fixed chance to failure each hour of operation
  • A well educated team that understands the proper use of an inverse failure rate measure

I submit we are rarely interested in the time till the bulk of devices fail, rather interested in the time to first failures or some small percentage fail

I suggest that very few devices or system actually fail with a constant hazard rate. If your product does, prove it without grand waves of assumptions.

I have found that engineers, scientists, vendors, customers, and manager regularly misunderstand MTBF and how to properly use an MTBF value.

So back to the opening statement, it is possible though not likely you will find an occasion to effectively use MTBF as a metric. Instead use reliability: the probability of successful operation over a stated period with stated conditions and definition of success. 98% of office printers will function for 5 years without failure in a office…. Pretty clear. Sure we can fully define the function(s) and environment, and we need to do that anyway.

Reliability and Availability

Reliability and Availability

Brent Moore The Old Reliable Bull Durham https://www.flickr.com/photos/brent_nashville/2163154869/in/gallery-fms95032-72157649635411636/
Brent Moore
The Old Reliable Bull Durham

In English there is a lot of confusion on what reliability, availability and other ‘ilities mean in a technical way. Reliability as used in advertising and common discussions often means dependable or trustworthy. If talking about a product or system it may mean it will work as expected. Continue reading Reliability and Availability

Looking Forward to the MTBF Report

Looking Forward to the MTBF Report

photolibrarian Matchbook, Jack Knarr, Reliable Cleaners, West Union, Iowa https://www.flickr.com/photos/photolibrarian/8127780278/in/gallery-fms95032-72157649635411636/
photolibrarian
Matchbook, Jack Knarr, Reliable Cleaners,West Union, Iowa

On social media the other day ran across a comment from someone that took my breath away. They were looking forward to starting a new reliability, no, MTBF report. They were tasked with creating a measure of reliability for use across the company and they choose MTBF.

Sigh.

Where have we gone wrong?

I certainly do not blame the person. They have read about MTBF in many textbooks. Studied reliability using MTBF and related measures, plus found technical papers using the same. They may have seen industry reports and standards also.

MTBF is prevalent and no wonder someone tasked with setting a metric would select MTBF. It’s easy to calculate. Just one number and bigger is better.

On the other hand

MTBF is roundly criticized across any reliability related forum or discussion group. There is progress in books, papers and standards. And, it’s not reaching those new to reliability engineering.

This note will be short and have one request. Please tell those just getting started in reliability engineering to please not consider using MTBF. To not request MTBF from vendors. And, to actually do some thinking before selecting MTBF as their organizations metric.

Better yet, challenge those using MTBF to explain in a coherent and rational manner why they are doing so. Ask them to validate their assumed constant failure rate or similar assumptions. Working together we can start a ripple that may help build the wave of knowledge to improve the state of reliability engineering.

Why Doesn’t Product Testing Catch Everything?

Why Doesn’t Product Testing Catch Everything?

photolibrarian West Union, Iowa, The Reliable Agency, B. Kamm, Jr., Matchbook, Farmers Casualty Company https://www.flickr.com/photos/photolibrarian/8244857538/in/gallery-fms95032-72157649635411636/
photolibrarian
West Union, Iowa, The Reliable Agency, B. Kamm, Jr., Matchbook, Farmers Casualty Company

In an ideal world the design of a product or system will have perfect knowledge of all the risks and failure mechanisms. The design then is built perfectly without any errors or unexpected variation and will simply function as expected for the customer.

Wouldn’t that be nice.

The assumption that we have perfect knowledge is the kicker though, along with perfect manufacturing and materials. We often do not know enough about:

  • Customer requirements
  • Operating environment
  • Frequency of use
  • Impact of design tradeoffs
  • Material variability
  • Process variability

We do know that we do not know everything we need to create a perfect product, thus we conduct experiments.

We test. Continue reading Why Doesn’t Product Testing Catch Everything?

Fixing Early Life Failures Can Make Your MTBF Worse

Fixing Early Life Failures Can Make Your MTBF Worse

 

change in MTBFLet’s say we 6 months of life data on 100 units. We’re charged with looking at the data and determine the impact of fixing the problems that caused the earliest failures.

The initial look of the data includes 9 failures and 91 suspensions. Other then the nine all units operated for 180 days. The MTBF is about 24k days. Having heard about Weibull plotting and using the beta value as a guide initially find the blue line in the plot. The beta value is less than one so we start looking for supply chain, manufacturing or installation caused failures, as we suspect early life failures dominant the time to failure pattern.

Initial Steps to Improve the Product

Given clues and evidence that some of the products failed early we investigate and find evidence of damage to units during installation. In fact it appears the first four failures were due to installation damage. The fix will cost some money, so the director of engineer asks for an estimate of the effect of the change on the reliability of the system.

The organization uses MTBF as does the customer. The existing MTBF of 24k days exceeds the customers requirement of 10K days, yet avoiding early problems may be worth the customer good will. The motivation is driven by continuous improvement and not out of necessity or customer complaints.

Calculation of Impact of Change on Reliability

One way to estimate the effect of a removal of a failure mechanism is to examine the data without counting the removed failure mechanism. So, if the change to the installation practice in the best case completely prevents the initial four failures observed we are left with just the 5 other failures that occurred over the 6 months.

Removing the four initial failures and calculating MTBF we estimate MTBF will change to about 300 days.

Hum?

We removed failures and the MTBF got worse?

What Could Cause this Kind of Change?

The classic calculation for MTBF is the total time divided by the number of failures. Taking a closer look at time to failure behavior of the two different failure mechanisms may reveal what is happening. The early failures have a decreasing failure rate (Weibull beta parameter less than 1) over the first two months of operation. Later, in the last couple of months of operation, 5 failures occur and they appear to have an increasing rate of failure (Weibull beta parameter greater than 1).

By removing the four early failures the Weibull distribution fit changes from the blue line to the black line (steeper slope).

Recall that the MTBF value represents the point in time when about 63% of units have failed. With only 9 total failures out of 100 units we have only about 10% of units failed so the MTBF calculation is a projects to the future when most of have failed, it does not providing information about failures at 6 months or less directly.

In this case when the four early failures are removed the slope changed from about 0.7 to about 5, it rotated counter clockwise on the CDF plot.

If only using MTBF the results of removing four failures from the data made the measured MTBF much worse and would have prevented us from improving the product. By fitting the data to a Weibull distribution we learned to investigate early life failures, plus once that failure mechanism was removed revealed a potentially serious wear out failure mechanism.

This is an artificial example, of course, yet it illustrates the degree which an organization is blind to what is actually occurring by using only MTBF. Treat the data well and use multiple methods to understand the time to failure pattern.

The Reliability Metric Book Announcement

The Reliability Metric

A Quick and Valuable Improvement Over MTBF

The-Reliability-Metric-cover-230x300Finished it. 130 pages long and packed with advice on why and how to switch from MTBF to reliability.

Based in large part on comments, feedback, discussions and input from you, my peers in the NoMTBF tribe. Thanks for the encouragement and support.

The Reliability Metric book is available here. Continue reading The Reliability Metric Book Announcement

Determine MTBF Given a Weibull Distribution

Determine MTBF Given a Weibull Distribution

Gary A. K. Reliable & regal 1000-block Nelson.
Gary A. K.
Reliable & regal 1000-block Nelson.

First off, not sure why anyone would want to do this, yet one of the issues I’ve heard concerning abandoning the use of MTBF is client ask for MTBF. If they will not accept reliability probabilities at specific durations, and insist on using MTBF, you probably should provide a value to them.

Let’s say you have a Weibull distribution model that described the time to failure distribution of your product. You’ve done the testing, modeling, and many field data analysis and know for the requestor’s application this is the best estimate of reliability performance. You can, quite easily calculate the MTBF value.

As you know, if theβ parameter is equal to one then the characteristic life, η, is equal to MTBF. If β is less than or greater than one, then use the following formula to determine the mean value, MTBF, for the distribution.

\displaystyle \mu =\eta \Gamma \left( 1+\frac{1}{\beta } \right)

You’ll need the Gamma function and the Weibull parameters. The further β is from one, the bigger the difference between η and MTBF.

You can find a little more information and background at the article Calculate the Mean and Variance on the accendoreliability.com site under the CRE Preparation article series.

Spotted a Current Reference to Mil Hdbk 217

Spotted a Current Reference to Mil Hdbk 217 Recently

Ben Bashford LOL Reliable https://www.flickr.com/photos/bashford/2659100054/in/gallery-fms95032-72157649635411636/
Ben Bashford
LOL Reliable

After a short convulsion of disbelieve I became shocked. This was  a guide for a government agency advising design teams and supporting reliability professional. It suggested using 217 to create the estimate of field reliability performance during the development phase.

Have we made not progress in 30 years?

What would do?

Let’s say you are reviewing a purchase contract and find a request for a reliability estimate based on Mil Hdbk 217F (the latest revision that is also been obsolete for many years), what would you do? Would you contact the author and request an update to the document? Would you pull to 217 and create the estimate? Would you work to create and estimate the reliability of a product using the best available current methods? Then convert that work to an MTBF and adjust the 217 inputs to create a similar result. Or would you ignore the 217 requirement and provide a reliability case instead?

Requirements are the requirements

When a customer demands a parts count prediction as a condition of the purchase, is that useful for either the development team or the customer?

No.

So, given the contract is signed and we are in the execution phase, what are your options?

  1. Do the prediction and send over the report while moving on with other work.

  2. Ask the customer to adjust the agreement to include a meaningful estimate.

  3. Ignore the 217 requirement and provide a complete reliability case detailing the reliability performance of the product.

  4. Find a new position that will not include MTBF parts count prediction.

The choice is yours.

I hope you would call out the misstep in the contract and help all parties get the information concerning reliability that they can actually use to make meaningful decisions.