Reliability Growth without MTBF

Reliability Growth and MTBF

Peter Lee Reliable https://www.flickr.com/photos/oldpatterns/5858406571/in/gallery-fms95032-72157649635411636/
Peter Lee, Reliable

Really? Is MTBF the only way to work with reliability growth?

Received this question via LinkedIn (feel free to connect with me there) and hadn’t given it much thought before. I am familiar with a few growth models and regularly have seen MTBF in use. Thus discounted the modeling as an approach of little interest to me or my clients.

MTBF measures the inverse of the average failure rate, when in many cases we really want to know about the first or tenth percentile of time to failure. Measuring and tracking the average time to failure provides little information about the onset of the first few failures.

Reliability Growth Models

Did just a quick check of common reliability growth models and found a few in the NIST Engineering Statistics Handbook  http://www.itl.nist.gov/div898/handbook/apr/section1/apr19.htm .

The Homogeneous Poisson Process (HPP) when the failure rate is constant over the time period of interest. This relies on the exponential distribution and the assumption of a stable and random arrival of failures, which is almost always not true (in my experience). It’s a convenient assumption as it makes the math a lot simpler, yet provides only a crude model and poor results.

The Non-Homogeneous Poisson process (NHPP) Power Law and Exponential Law models provide information based on the cumulative number of failures over time. These models rely on the notion that any system has a finite number of design errors that once resolved create a system that has a HPP behavior.

Duane Plot provides a graphical means to show cumulative failures over time. When the arrival of failures slows the curve decreases in slope effectively bending over. This provides a means to estimate the final failure rate (average unfortunately).

What I use instead

Given my dislike of all things MTBF, I’ve not used these model to estimate MTBF. Instead stay with the Duane plot and graphically track when the team is finding and fixing enough faults in the design.

I also tend to use reliability block diagrams (RBD) with each block modeled with the appropriate reliability distribution. For a series model then all we need to do is multiple the reliability value from each block for time t (say warranty period, or mission time, etc.) to estimate the system reliability at time t.

For complex systems with some amount of redundancy the RBD does get a bit more complicated.  For very complex systems with degraded modes of operation or significant repair times then use Petri Nets or Markov Models to properly model.

In the vast majority of cases a simple RBD is sufficient to capture and understand the reliability of a system. This allows the team to focus on improving weak areas and reduce uncertainty though improving reliability estimates. An RBD does not require nor assume an exponential distribution and the math is easy enough to manage, often even in your favorite spreadsheet.

Summary

Reliability growth starts with model of the estimated number of failures over a time period. Testing then provides a value for that estimate. This does not require the use of MTBF, so instead of assuming a constant failure rate, focus on the failure mechanisms and use a simple RBD to build a system model. The reliability growth is the result of identifying areas for improvement and doing the improvement. RBD, in my experience, provides a great way to communicate with the team where to focus improvements.

Recent Questions on MTTF

Questions about MTTF

raymondclarkeimages Reliable https://www.flickr.com/photos/rclarkeimages/8212393147/in/gallery-fms95032-72157649635411636/
raymondclarkeimages
Reliable

Over the past week I’ve seen or received a couple of questions about MTTF. One was on how to use failure data to calculate MTTF, another on how to estimate Weibull parameters after assuming a constant rate of failure.

It is good to see such questions, as it means the person is curious enough to take the time to ask. Continue reading Recent Questions on MTTF

How to Justify Using the Exponential Distribution

How to Justify Using the Exponential Distribution

Grant Hutchinson Reliable
Grant Hutchinson
Reliable

Do you check assumptions? Not all assumptions are equal as some may lead you to a costly decision.

We regularly make assumptions about the uniformity of material, the consistency of part to part performance, and many other engineering elements of a design or process. We have to simply the problems we face in order to work out solutions and make decisions. Continue reading How to Justify Using the Exponential Distribution

Maintenance and MTBF

Does MTBF have any role in Maintenance?

Reliable, ADM in afternoon light by Seth Anderson
Reliable, ADM in afternoon light by Seth Anderson

No. You should not use MTBF when designing or scheduling maintenance programs or tasks. Furthermore, it is a very poor metric to monitor equipment performance.

The basic calculation of MTBF (or MTTF) and assuming the equipment time-to-failure distribution is the exponential distribution implies the equipment downing event occurs randomly. In other word the equipment doesn’t break in and actually lower it’s chance for failure over time, nor exhibit wear out or the increase of failure rates over time.

The chance of failure is constant over time and does not change given the time the system or component has been in service.

MTBF dose provide the average time between failures and does not provide any information about when the failures may occur if the actually failure do not occur randomly. Furthermore the exponential distribution has a memoryless feature, meaning a motor that is brand new and a similar motor with1,000,000 hours of service each have the same chance to fail in the next hour.

The MTBF calculation or vendor supplied value does not include information about how the failure rate may change over time.

Wear Out and Maintenance Planning

Let’s use a motor as an example for a simple maintenance planning exercise. Let’s say the motor has an MTBF of 100,000 hours provided by the vendor. There isn’t any maintenance on the motor, such as lubrication or alignment checks, yet we are planning to use 100 motors in the plant and need to plan for spares.

How many spares will we need over the next year to replace faulty motors.

Using just MTBF, we can use the probability of successful operation over the year, 8760 hours, and quickly estimate how many of the 100 motors will require replacement.

\displaystyle R(t)={{e}^{{-t}/{\theta }\;}}

t is 8760 hours

θ is the MTBF or 100,000 hours

Thus, we find 91.6% of units should survive one year of operation. That means out of 100 installed motors, we expect about 8.4% to fail, or 8 or 9 units. Of course we could add a confidence bound to this calculation plus include the time the replacement unit operate for a bit more accuracy. For this example we’ll keep it simple.

Yet, we know based on experience with other similar motors that they rarely fail during the first year. With a little work we find the motors do actually wear out primarily due to bearing wear. And another call to the vendor we find they recommend using the Weibull distribution with β of 2 and η of 90,000 hours.

The reliability function for the Weibull distribution is

\displaystyle R\left( t \right)={{e}^{-{{\left( {}^{t}\!\!\diagup\!\!{}_{\eta }\; \right)}^{\beta }}}}

Where η is the characteristic life, in this case 90,000 hours

And, β is 2.

Thus over one year we would expect 99% of the motors to survive, meaning only 1 is expected to fail.

Using MTBF would have us buy 7 or 8 extra spares unnecessarily.

Maintenance Scheduling

We know that motors wear out. Given only MTBF and the exponential distribution assumption we do not have sufficient information to schedule motor replacements.

If the motors actually failed randomly, as assumed, then our only strategy is to replace motors as they fail. Since the chance to fail each hour remains constant arbitrarily replacing motors at a any point in time will not avert or change the chance of failure the next hour.

When we model the wear out behavior, I.e. Weibull distribution with β of 2, then we can calculate the time at which the chance of failure is economically unacceptable. For example, if we typically operation in 1 week shifts of 168 hours then have time for maintenance tasks, we can calculate the chance of failure over a week period after one year, two years, etc. And determine when the chance of failure becomes unacceptable.

Knowing how the failure rate changes over time we can schedule replacements and maintain a relatively lower overall failure rate.

Summary

Find or estimate the information concerning the changing rate of failure over time. Ignoring wear out or early failures by using MTBF only will cost you and your plant money.

Understanding and modeling the wear out patterns allows you to secure spares as needed. You can avoid costly downtime by doing replacements before the chance of failure is too high.

PS: I’m working on examples and update to the draft book on MTBF to include more maintenance reliability specifics.

How to Ask for Component Reliability Information

Supplier Reliability Information Requests

Image of Reliable Drugs Liquors Sign
CC BY-NC Luke Gattuso http://fmsrel.com/1DaHcnO

Every now and then we need to ask a supplier for a reliability estimate for a component they produce. Our team may be considering adding the part to a system and would like to know if it is reliable enough to meet our needs. Continue reading How to Ask for Component Reliability Information

5 Steps to Establish a Meaningful Reliability Goal

 

Establishing a Reliability Goal

iStock_000011864395SmallThe basic question of ‘How long should it last?” may be the first question you consider related to reliability of your product or production equipment. Ideally we would like to create a product that will never fail for our customers, or a set of equipment that just keeps running. Continue reading 5 Steps to Establish a Meaningful Reliability Goal

An Interview with Fred about MTBF

The NoMTBF Interview

Tim Rodgers interviews Fred Schenkelberg, consultant and blogger of NoMTBF, concerning Fred’s work and writing around the perils of MTBF.

We range from what started the site and the common issues caused by using MTBF. Then we discuss using reliability instead.

Continue reading An Interview with Fred about MTBF

My Favorite Reasons to Avoid Using MTBF

How to Explain the Perils of MTBF Use

With a little practice and being aware of the many perils when using MTBF, you can become adept at clear and concise lines of reason to help others at least try a better way.

A trivial objection is ‘our product is not repairable so we’re using MTTF’. The math to estimate MTBF and MTTF from data is the same, total hours divided by total failures, thus both are an estimate of the average. Therefore, most the arguments to switch away from MTBF equally apply to MTTF.

Misunderstanding 1

When someone suggests MTBF is a failure free period, try not to snort or laugh, that doesn’t help. Instead point out the MTBF calculation results in the inverse of the failure rate. So if using hours, it provides the average chance of failure each hour. Then using the exponential distribution reliability function you can quickly show how many are expected to survive (the rest failing) by the end of the so called failure free period – which is about 2/3rds of the items.

Misunderstanding 2

When someone suggests we only use MTBF because ‘we have always used MTBF’, I first ask if the metric and meaning are well understood. There seems enough people with misunderstanding 1 that may be the only reason needed to persuade someone to try another measure. If not, I ask ‘so, how many are expected to survive over the first year?’ This question usually surfaces one more other misunderstandings.

Misunderstanding 3

‘I use MTBF to set the warranty period or our maintenance strategy’. MTBF is the inverse of the average failure rate and devoid of any changing rate of failure information. Are we dealing with a decreasing or increasing failure rate? How do we know the failure rate is actually the same for each hour? The only strategy for maintenance planning, given only an MTBF value and the assumption of constant hazard rate, is to replace or repair upon failure. If the item actually does have an increasing chance of failure with time, then MTBF is not able to describe that increasing rate. Use Weibull or some other model.

Misunderstanding 4

‘All our competitors and customers use MTBF in our industry’. This may be my favorite. Generally it is possible to quickly show that using reliability directly (probability of success over a duration for a function in an environment) provides a clear metric, plus using Weibull or other suitable model to describe the changing rate of failure over time provides a competitive advantage. By using a clear reliability statement and an appropriate model even with your customers, you avoid other misunderstandings, plus help everyone make better decisions. Better decisions concerning reliability mean meeting your customer’s reliability performance expectations.

Misunderstanding 5

Once in a while someone objects to using anything other than MTBF as it is a very easy metric to calculate from time to failure data. It is also easy to conduct test planning, etc. as many guides and books include examples showing the math required. The line of reasoning around ‘why limit yourself to the computing tools of the 50’s’ generally doesn’t work. It maybe the hesitation is actually related to doing the math involved with Weibull Analysis or other approaches. Sure the formulas and algorithms for anything beyond the exponential distribution and chi-square table may seem daunting, yet the benefits far outweigh the need to study and practice just a little. The math will come back quickly (you are talking to college degreed folks most likely in an engineering or science field). Reinforce the need to avoid the other misunderstandings, plus the benefits around accurate models and decisions.

There are other misunderstandings and effective lines of reason to help someone move beyond using MTBF. What have you found useful for particular misunderstandings?

No excuse to use parts count to estimate field reliability

How to Estimate Reliability Early in a Program

In a few discussions about the perils of MTBF, individuals have asked about estimating MTBF (reliability) early in a program. They quickly referred to various parts count prediction methods as the only viable means to estimate MTBF.

One motivation to create reliability estimates is to provide feedback to the team. The reliability goal exists and the early design work is progressing, so estimating the performance of the product’s functions is natural. The mechanical engineers may use finite element analysis to estimate responses of the structure to various loads. Electrical engineers may use SPICE models for circuit analysis.

Customers expect a reliable product. If they are investing in the development of the product (military vehicle, custom production equipment, or solar power plant, for examples) they may also want an early estimate of reliability performance.

Engineers and scientists estimate reliability during the concept phase as they determine the architecture, materials, and major components. The emphasis is often on creating a concept that will deliver the features in the expected environment. The primary method for reliability estimation is engineering judgement.

With the first set of designs, there is more information available on specific material, structures, and components, thus it should be possible to create an improved reliability estimate.

Is testing the true way to estimate MTBF?

Early in a program means there are no prototypes available for testing, just bill of materials and drawings. So, what is a reliability engineer to do?

One could argue that without prototypes or production units available for testing (exercising or aging the system to simulate use conditions) we do not really know how the system will respond to use conditions. While it is true it is difficult to know what we do not know, we often do know quite a bit about the system and the major elements and how they individually will respond to use conditions.

Even with testing, we often use engineering judgement to focus the stresses employed to age a system. We apply prior knowledge of failure mechanism models to design accelerated tests. And, we use FMEA tools to define the areas most likely to fail, thus guiding our test development.

Creating a reliability estimate without a prototype

Engineering judgement is the starting point. Include the information from FMEA and other risk assessment methods to identify the elements of a product that are most likely to fail, thus limit the system reliability. Then there are a few options available to estimate reliability, even without a prototype.
First, it is rare to create a new product using all new materials, assembly methods, and components.

Often a new product is approximately 80% the same as previous or similar products. The new design may be a new form factor, thus mostly a structural change. It may includes new electronic elements – often just one or two components, where the remaining components in the circuit regularly used. Or, it may involve a new material, reusing known structures and circuits.

Use the field history of similar products or subsystems and engineering judgement for the new elements to create an estimate. A simple reliability block diagram may be helpful to organize the information from various sources.

For the new elements of a design, base the engineering judgement on analysis of the potential failure mechanisms, employ any existing reliability models, or use simulations to compare known similar solutions to the new solution.

Second, for the elements without existing similar solutions and without existing failure mechanism models, we would have to rely on engineering judgement or component or test coupon level testing. Rather than wait for the system prototypes, early in a program it is often possible to obtain samples of the materials, structures, or components for evaluation.

The idea is to use our engineering judgement and risk analysis tools to define the most likely failure mechanisms for the elements with unknown reliability performance. Let’s say we are exploring a new surface finish technique. We estimate that exposure to solar radiation may degrade the finish. Therefore, obtain some small swatches of material, apply the surface finish and expose to UV radiation. While not the full product using fully developed production processes, it is a way to evaluate the concept.

Another example, is a new solder joint attachment technique. Again, use your judgement and risk analysis tools to estimate the primary failure mechanisms, say thermal cycling and power cycling, then obtain test packages with same physical structures (the IC or active elements do not have to be functional) and design appropriate tests for the suspected failure mechanisms.

Estimate combine the available knowledge

With a little creativity we can provide a range of estimates for elements of a design that have little or no field history. We do not need to rely on a tabulate list of failure rates for dissimilar product created by a wide range of teams for diverse solutions. We can draw from our team’s prior designs actual field performance for the bulk of the estimate. Then fill in the remaining elements of the estimate with engineering judgement, comparative analysis, published reliability models, or coupon or test structure failure mechanism evaluations.

In general, we will understand the bulk of the reliability performance and have rational estimates for the rest. It’s an estimate and the exercise will help us and the team focus on which areas may require extensive testing.

Variance and MTBF

Variance and MTBF

When the data sheet or only available information is MTBF, how much do you know about the variability of the expected time to failure distribution? Not much really.

Do you need to know when to expect the first one percent of failures, 10 percent? Sure, that information is useful when estimating warranty or service costs, also for estimating readiness to go to market. We often are not interesting when the bulk will fail, rather the early small percentages. Continue reading Variance and MTBF

Book and Course projects

Traffic

Over the past two weeks this site has received over 150 visitors each weekday. From what I can see in the analytics and from a few conversations with folks, the site provides insights and information around the use of MTBF, plus basic information concerning reliability engineering.

Google tends to like the site as they agree that visitors like the site, too.

Book project in search of feedback

Given the interest and plenty of encouragement (and helpful suggestions) I’m putting together a book based on the NoMTBF material. Not just bashing MTBF, although there is plenty of that, but also the steps to use reliability or other measure that provide better information.

I have the basic outline and draft completed and am now ready for some feedback. If you’d like to review the work, conditional on you providing you feedback, suggestions, ideas and comments, let me know and I’ll send you a draft copy.

The draft needs work on formatting, layout, adding clean graphics, etc. Yet the outline and basic text is there.

Can you follow the argument, is the writing clear, is there anything missing, how about the order or emphasis?

It’s not a long work, right now about 22,000 words or depending on book page size, fonts size, margins, etc. about 100 to 120 pages. In word it has 73 pages right now without any attention to formatting.

If you have the time and interest let me know and I’ll send you copy, but you have to comment, critic, and make suggestions. I really would like this work to be useful for you and for use to encourage others to avoid using MTBF.

Course project in search of ideas and direction

This period of reflection concerning the NoMTBF project has reinforced the idea that we need to provide something concrete and positive to do instead of just not doing MTBF. Part of the issue is our education system, standards, and textbooks as they often include MTBF in examples and at length in the discussion.

So, the idea is to create a course for experienced reliability professionals and interested engineers and managers with an interest in reliability, that focuses on reliability metrics from goal setting to tracking performance.

I’ve the technology to put together an online course that could be self paced or provided on a fixed schedule (say weekly). It could include short lectures, discussions, reading material and quizzes or examples to work.

Here’s a draft outline – what do you think?

  1. Reliability definition and how it is used in engineering decision making

  2. Common reliability measures: pros and cons

  3. Reliability and Availability Goal setting – connecting the goal to your business objectives

  4. Estimating reliability for comparison to the goals

  5. Tracking reliability and reporting performance

  6. Reliability testing with results that compare to goals

  7. Reliability modeling that leads to meaningful discussions and decisions

  8. Common mistakes and remedies concerning reliability measures

  9. How to get useful reliability information from vendors

(plenty of opportunity for bashing MTBF, yet if done in contrast to much better methods and measures, may provide really practical and useful information.)

So, thoughts? What would you want added, emphasized, and what would you want to be main take aways for each topics? What would you like to see in the course for yourself or for those you’d recommend take the course?

If you’d like to participate in the course project, I’m very open to your ideas and suggestions. Maybe help create and present a topic, provide examples, or sample problems or discussion questions.

Anyway, looking for feedback and ideas to make the NoMTBF site much more positive and useful for the reliability engineering community and for anyone interested in reliability.

Well thought out feedback

A note from Scott – providing feedback on the NoMTBF site.

Hi Fred,

Your website has generated quite a bit of valid conversation about MTBF. I applaud you for that. Honestly though I have mixed feelings about some of what you present and thought I’d write this lengthy e-mail to provide some feedback. I hope you take this in the right light as constructive criticism from someone who, overall, appreciate your efforts.

Clarifications

Let me start with a point I disagree with. In your opening slide show “Thinking about MTBF” I think the “Common Confusion” slide could be better presented. Many viewers would interpret that slide to say that the MTBF is not the mean. Of course MTBF is the mean. Your point is that, while it is the mean, the distribution is not Gaussian. Fair enough. Funny thing is I’ve actually had quality engineers try and tell me the MTBF is not the mean of the distribution and I’m afraid your slide may perpetuate that misunderstanding.

In the same vein, later in the talk, and in the other sections on your site, you seem to indicate that the MTBF is not the expected value (See Perils “I heard one design team manager explain MTBF as the time to expect from one failure to the next.”). Of course the MTBF is the expected value. That is from a pure mathematical sense (as you discuss earlier in this section). So I’m confused on your point here. I guess you are commenting on the laymen’s feeling for “expected” value. Which leads me to my next section.

Lack Of Understanding of Statistics

It almost appears that one of the premises of NoMTBF is that many people do not understand statistics and therefore we should not confuse them by using MTBF. I disagree with this. For example, many people don’t understand the difference between median and mean but no one is suggesting we remove those terms. Similarly because many people incorrectly assume a Gaussian distribution when they hear the term mean is hardly justification for removing the term MTBF. The problem is education not the definition. Same point for expectation. Because the average is some value does not imply all samples will be equal to that value. Anyone who thinks that, in my opinion needs more education in statistic and we shouldn’t try and “simplify” to account for lack of education.

Constant Failure Rate

I don’t really accept your implication that using MTBF implies constant failure rate. The proper definition is the integral form you present in a number of spots but I agree that many tie these two together. I think one of the themes of your website is that the constant failure rate assumption is not valid. In that, I’m in 100% agreement and applaud your efforts. (I guess the site name would not have the same panache if it was called NoConstantFailureRate). Clearly the constant failure rate model often does not apply and reducing all of reliability to one number is a gross simplification.

Leadership

So where should people go instead? Just bashing something is not a solution. Your website really has had an impact but in a strange way sometimes it has had the opposite impact than what I think we would both like. I’ve had quality managers who did not want to gather the data on field failure with, in part, the justification that MTBF is bogus statistics. OK MTBF is not perfect but I’m sure we agree that the way to improve reliability is to gather data as a first step.

You have quite a following and, personally, I’d like to see you to lead more. Yes MTBF is a simplification but I also don’t expect to pick up a data sheet and see physics of failure paper stabled to the back of it or a chart of reliability over time. Fact is many complex things get reduced to a few key numbers (e.g. horsepower, MPG, 0 to 60 time for a car). I think your Actions/Alternative Metric is addressing this. Stating a reliability percentage over a time interval is an intriguing alternative. I like it. If that is your alternative then, personally I’d like to see it more clearly emphasized across the site. I’d also like to see you develop it more. How does one determine reliability % and duration from the Weibull parameters? How would one put together a reliability block diagram and estimate overall reliability if subcomponents were specified in this manner? I don’t know that answer to these questions and I’d be interested in reading more.

As I stated in the beginning, I hope you take this in the right light. While obviously I don’t agree with everything on your site you have many extremely valid points and you are doing a great job stimulating discussion. Thanks for your efforts.

Scott Diamond
Vice President of Quality and Customer Excellence
Surveillance Group
FLIR Systems Inc.

 

— Ed note:

Thanks Scott for the insightful and meaningful feedback – I will be making some adjustments and improvements. Thanks for the careful reading and taking time to provide you suggestions and comments. Very much appreciated. Fred

Questions to ask your Supplier about Reliability

Questions to ask a Supplier

Especially if they list MTBF on their data sheets.

My first questions, which I generally keep to myself, is ‘MTBF, yeah, right. Do they know better or not?” This is generally not a good way to start a conversation with a vendor about the reliability information you need to make appropriate decisions. Continue reading Questions to ask your Supplier about Reliability