The post Futility of Using MTBF to Design an ALT appeared first on No MTBF.
]]>Let’s say we want to characterize the reliability performance of a vendor’s device. We’re considering including the device within our system, if and only if, it will survive 5 years reasonably well.
The vendor’s data sheet lists an MTBF value of 200,000 hours. A call to the vendor and search of their site doesn’t reveal any additional reliability information. MTBF is all we have.
We don’t trust it. Which is wise.
Now we want to run an ALT to estimate a time to failure distribution for the device. The intent is to use an acceleration model to accelerate the testing and a time to failure model to adjust to our various expected use conditions.
Given the device, a small interface module with a few buttons, electronics, a display and enclosure, and the data sheet with MTBF, how can we design a meaningful ALT?
The data sheet and our system’s functionality relying on this device define a range of possible elements to measure. We could measure display brightness, button functionality, response times, life of the electronics, etc.
Before selecting what to measure in the ALT, we need to stop and ask what will limit the life of the device in our application? The provided reliability information doesn’t say. It just says the device has a suspiciously round number MTBF value of 200k hours.
An FMEA, risk analysis, or discussion with the development engineers may narrow down the possible elements of the device that will likely fail first. If time and resources permit, maybe running HALT to find weaknesses (ID failure mechanisms) is on order. Again, just having MTBF doesn’t help.
Knowing the likely failure mechanism to cause the device to fail is an essential first step to select the appropriate stress (temperature, vibration, power cycling, etc.) to accelerate that failure mechanism.
Not every failure mechanism responds to an increase in temperature. Applying the wrong stress will lead to poor results.
The data sheet might have some environmental or operating limits (power, voltage, temperature, etc.) Those may be clues as to important stresses to explore how they lead to failures.
Like when determining what to measures, we need to sort out which stress, or stresses, provide a means to accelerate the failure mechanism of interest.
Let’s say we estimate a rubber seal around the display is likely to fail and could be accelerated using higher temperatures.
Instead of the normal operating temperature of 25°C, let’s double it to 50°C. Ok, so? How much of an acceleration does that change in temperature cause? That is why we need an acceleration model.
The temperature increase might increase the chemical reaction between the material and oxygen and we can use the Arrhenius mo l, if we know or can estimate the activation energy.
Or, the temperature increase may increase the compression of the seal creating a mechanical deformation and damage over time. Here I’m not sure what model to use, yet the Arrhenius model would likely not be useful.
Of course, knowing MTBF provides no information on failure mechanisms other than to suggest the failures are repairable to keep the system running.
Given MTBF we may assume the system has a constant failure rate, or not. Remember all life distributions have a mean value. Knowing the MTBF value doesn’t automatically imply a constant failure rate.
Therefore, if we assume an exponential distribution describes the time to failure pattern, we may be wrong, and most likely would be wrong.
Is the failure arrival pattern decreasing, increasing? We don’t know just knowing MTBF.
Knowing the failure mechanism and how an appropriate stress changes the failure rate is a great start. The design of the ALT includes sample sizes and how and when to make measurements. Knowing the expected pattern of failures given our samples allows us to monitor for failures as appropriate times.
Knowing the inverse of the average failure rate doesn’t really help us know when to expect failures to occur. Thus hampers our ability to design an efficient ALT.
An astute reader would probably wonder why we’re not using either time or failure truncated test planning and analysis. We have MTBF and that is all we need to design such life tests.
Well, the MTBF value is given and defines the testing. It doesn’t allow us to estimate the time to failure distribution. It may reveal if a system has poorer reliability then expected, yet now if it is better. Nor does such testing permit evaluation or understanding of the pattern of failures.
The MTBF based testing also assumes a constant failure rate. This means if we run 1,000 units for 20 hours, or 20 units run for 1,000 hours it has the same result. If the failure mechanism is wear out or a chemical degradation, then we are more likely to have failures in the units that run longer, and no or few failures in the group that runs for a few hours.
This approach is only appropriate if you know, without doubt, the dominant failure mechanism is best described by an exponential distribution and has an equal chance of failure every single hour of operation. If this is not a certainty, then running 20 or 1,000 units till you have sufficient failures to estimate the time to failure distribution is prudent.
Running an ALT is expensive. Let’s get the design of the ALT right. That starts by ignoring MTBF claims by vendors, and getting to know the failure mechanisms.
The post Futility of Using MTBF to Design an ALT appeared first on No MTBF.
]]>The post Two Ways to Think and Talk about Reliability appeared first on No MTBF.
]]>Neither includes using MTBF, btw.
And, I’m not thinking about the common language definition either.
Plus, I may have this all wrong. Here is the way I think about the reliability of something. More than ‘it should just work’ and different than ‘one can count on it to start’. When I ask someone how reliable a product is, this is what I mean.
By explaining my basic understanding we can compare notes. It is possible, quite possible, that I will learn something. As you may as well. Let’s see.
First, consider the definition of reliability as used by reliability engineers and others in the know.
The probability that an item will perform a required function without failure under stated conditions for a stated period of time. Practical Reliability Engineering, Fifth Edition. Patrick D. T. O’Connor and Andre Kleyner, John Wiley & Sons, Ltd. 2012.
The product will probably work for some duration. The reliability function for life distributions is the probability of success over a duration.
We define reliability as a probability over a duration. To me, there are two ways that I understand this concept.
Let’s explore these two a bit more.
If a product that is about to launch offers a warranty, a common question may be, “how many will fail during the warranty period?” The finance team and others want to know so they can plan accordingly.
The complement of how many will fail is how many will survive (it does sound a bit more positive then acknowledging failure occur.) Hence, the ‘survive’ idea.
So, if we create and sell 1,000 widgets we are interested in the reliability of said widget. If the probability the widgets will perform over the warranty period without failure is 90%, then we would expect 900 widgets to have worked without failure over the warranty period.
This understanding works when we have more than one item in considerations. Say 10 million cell phones, 50 thousand electric vehicles, 768 electric toothbrushes. Given some number put into service, what is the tally of successful widgets, meaning the ones that haven’t failed, still working at the end of some duration?
So, if our goal is 90% reliable over 2 years, we expect, if we achieve our goal, to have 9 out of 10 widgets function without failure for 2 years starting when manufactured, sold, or placed into service (whenever we and our customer defines time zero for an individual widget.)
It’s not calendar time or time since first launch and first sale. Time is relative to the individual widgets. It is the duration over which the item is expected to function that we track.
As a customer that just buys one widget, not an entire fleet, I’m interested in the chance my widget will survive some duration.
When I bought my current cell phone, I entered that purchase with the expectation it would last at least 3 years and would be great if it meets all my expectations functionally over 4 years. A consideration is the probability that the specific cell phone I purchase will survive over my expected duration. The sales folks and even reliability professionals could not provide me with the probability of success for my individual, serial number xyz, will survive 3 years or any other duration.
What we might know, or hopefully the product development team and supporting reliability professional know, is the probability an individual item will survive a duration, in general, or on average (average not being the right word, I think).
If the design and production process create a phone that is 90% reliable over 3 years for all phones they produce, then we can estimate that the phone in my hand has a 90% chance of surviving 3 years.
There is a lot of factors that contribute to the time each phone actually fails. Design changes, manufacturing variability, environmental and use differences, etc. In some circumstances, we may estimate the confidence bounds around the probability of surviving three years. Or we may find a range within which there is a 90% chance the individual items will survive. Either way, our best guess for an individual item is the reliability over the duration, R(t).
I once heard a woman buying an inkjet printer ask the clerk to select the box which has a printer most likely to last 3 years. The boxes are not labeled to indicate which has the most robust components, so the clerk selected the box with the fewest blemishes on the box. They laughed and she bought a printer hoping her printer would last 3 years.
I use probability as a function of time. The chance of survival over a short time is better than over a longer time, so I use a reliability function to keep track. The Weibull distribution is my goto, yet there are other options both parametric and non-parametric methods to describe the probability as a function of time.
Do you use the reliability definition listed in so many textbooks as what you mean when you ask, ‘how reliable is this component?’ A probability over a duration? Shouldn’t we ask for the probability of success over some duration, or set of durations?
What do you think of my two ways of considering ‘reliability’? What does ‘reliability’ mean to you?
The post Two Ways to Think and Talk about Reliability appeared first on No MTBF.
]]>The post The Damage Done by Drenick’s Theorem appeared first on No MTBF.
]]>Have you ever wondered by we use the assumption of a constant failure rate? Or considered why we assume our system is ‘in the flat part of the curve [bathtub curve]’?
Where did this silliness first arise?
In part, I lay blame on Mil Hdbk 217 and parts count prediction practices. Yet, there is a theoretical support for the notion that for large, complex systems the overall system time to failure will approach an exponential distribution.
Thanks go to Wally Tubell Jr., a professor of systems engineering and test. He recently sent me his analysis of Drenick’s theorem and it’s connection to the notion of a flat section of a bathtub curve.
Wally did a little research and found the theorem lacking for practical use. I agree and will explain below.
Dr. Kececioglu in Reliability Engineering Handbook vol. 2 chapter 13 describes Drenick’s theorem as a ‘limit law of the time-to-failure distribution of a complex system’. He devoted chapter 13 of the handbook to the theorem.
Here is Dr. Kececioglu’s definition of the theorem:
Consider a complex system with n units connected in series reliability wise and each unit has its own pattern of malfunction and replacement. Further assume that (1) the components are independent, (2) every unit failure causes system failure, and that (3) a failed unit is replaced immediately with a new one of the same kind. Then, under some reasonable general conditions, the distribution of the time between failures of the whole system tends to the exponential as the complexity and the time of operation increase.
R. F. Drenick in his 1960 paper, “The Failure Law of Complex Equipment”, opens the paper with a brief explanation of the ‘law’.
In theoretical studies of equipment reliability, one is often concerned with systems consisting of many components, each subject to an individual pattern of malfunction and replacement, and all parts together making up the failure pattern of the equipment as a whole. The present note is concerned with that overall pattern and more particularly with the fact that it grows the more simple, statistically speaking, the more complex the equipment. Under some reasonably general conditions, the distribution of the time between failures tends to the exponential as the complexity and the time of operation increases; and somewhat less generally, so does the time up to the first failure of the equipment.
In section 6 of Drenick’s paper, he provides a few comments. These comments highlight the limitations for the use of the work, plus outline potential extensions.
On the assumption that components within the system are statistically independent, Drenick mentions “this assumption is debatable.” I agree as it is rare that the failure rate of an individual component is not influenced by the behavior of other components or the immediate environment/use conditions.
On the assumption that it “makes good sense to lump the failures of many, presumably dissimilar, devices into one collection pattern.” The theorem focuses on the overall system failure pattern, which is made up of many different failure mechanisms within the many components. Drenick states, “The fact is that this may sometimes be quite inappropriate.” He goes on to explain that the failure of some components may be rather inconsequential while other component failures may be catastrophic. Not all failures have the same impact on the system.
Drenick cautions that this ‘simple and comprehensive statement of a complex state of affairs…. can also be interpreted.” He cautions that assuming one only needs the means (MTBF) of the components. I agree with Drenick that we need more information to make meaningful decisions concerning the design, sourcing, and maintenance of a system.
From the email enhance with Wally, here are a few more ‘issues’ with the practical use of the theorem.
Assuming failed components are replaced immediately. This is in part due to the theorem’s reliance on renewal process theory. In practice, it may take time to identify failed components and execute a replacement. Some are quick if the diagnostics and there is a readily available and appropriate set of spares.
Assuming replacement parts are identical (albeit new) components. We use refurbished, upgrades, or otherwise similar parts, not just an identical replacement.
Assuming just the failed component is replaced leaving all other components in place. We replace subsystems which may have dozens to thousands of components. Thus, we replace aged, yet not yet failed components robbing the system of other potential imminent failures.
Assumes there are no components that failure significantly more or less often than other components. There is an allowance for different failure patterns, yet the theorem doesn’t work if there is one or two components that contribute the majority of systems failures. Recall Pareto and the notion that a few components will cause the most failures.
Assumes the failures occur independently of use or environmental conditions. Sure some components fail more often due to storage stress, while others fail due to use. Yet the assumption of statistical independence includes the notion that the chance of failure does not change for a set of components when in storage or in use.
Assumes the system level failure pattern is useful. To me, Fred, this is purely an academic exercise and not useful for any practical decision making or modeling. If I need to estimate the cost of operation, cost of spares, availability, etc. I need more than a system MTBF value to make a meaningful decision.
Assuming it is true for your system and further assuming you’re system is in the ‘flat part of the curve’. This line of reasoning leads to the baseless assumption that every component follows the exponential distributions as well.
Most of you, kind readers of the NoMTBF blog know this to be untrue, yet not everyone is as enlightened.
All too often we hear, along with a wave of a hand, that we’re in the flat part of the curve, or that this is a large complex system so we can assume exponential… cringe.
Drenick’s Theorem does not justify the use of MTBF. Drenick mentions that we need to more than just the mean life to do our work as reliability engineers.
How have you seen this theorem misused? Add your thoughts to the comments section below.
Drenick, R F. 1960. The failure law of complex equipment. Journal of the Society for Industrial and Applied Mathematics 8 (4): 680-690
Kececioglu, Dimitri. 1991. Reliability Engineering Handbook Vol. 2 Vol. 2. Englewood Cliffs, NJ: Prentice-Hall
Email exchange with Wallace Tubell, October 30, 2017
The post The Damage Done by Drenick’s Theorem appeared first on No MTBF.
]]>The post 3 MTBF Stories appeared first on No MTBF.
]]>Everyone loves a great story. Storytelling has been a long tradition to pass along knowledge and wisdom.
There are good stories, tales of inspiration. There are sad stories, tales of caution.
There are fables, ghost stores, legends, epic poems, and more. When considering the reliability performance of your product or equipment, you probably have a few stories that you can tell. “That time … “
Simple join colleagues for lunch and ask about the ‘major disasters’ of the past. The stories help us to remember and hopefully avoid repeating mistakes.
Here are three stories with MTBF as a central figure. It is a site and blog that does take about MTBF, so it fits. To start, let me introduce you to Martin, a new reliability engineering reporting to his first day of work at a bicycle design and manufacturing company. Two sad stories and a good one. enjoy.
As a former manufacturing engineer and with significant design for reliability accomplishments, Martin joined the ABC Bike Co as their reliability engineer. Martin knows about many of the tools within reliability engineering such as FMEA, HALT, ALT, modeling, failure analysis, etc.
Martin asked the director of engineering, Jas, for access to the production line’s reliability data. Jas says, “Sure, we have a lot of reliability data. We track every incident of downtime across the various product lines. Here are MTBF values by machine or station. Here are MTBF values per shift. It’s all here in the MTBF spreadsheet. Enjoy.”
With a little study and investigation, Martin learns that the data collection is simply a count of the downtime incidents by piece of equipment and by shift. The shift data is for the 8 hours of the shift across all the production lines and equipment. The equipment specific MTBF values does not indicate over what duration the data corresponds.
Back to Jes, Martin asks, “Why is the reliability data only MTBF summaries?”
Jes responds, “It’s what we use. We only use MTBF, it is for repairable systems. We don’t use MTTF as that is strictly for MTTF. Why, what else do we need to know?
At this point, Martin launches into a quick description of time to failure data, how failure rates change over time, and how MTBF is not helpful. He realizes the data collection practices will need improvement, plus some education.
Martin knows the bicycle lines enjoy a poor reliability reputation in the market. He heads over to see the director of engineering, Kyle. Kyle quickly explains the current design practice that focuses on creating durable, rugged bicycles, yet suspects the manufacturing and supply chain cost-cutting erode the ability to keep warranty expenses low.
This all seems good and the discussion turns to design for reliability. Martin asks, “What is the reliability goal for product xyz?”
Kyle says, “It is a difficult goal of 50,000 hours MTBF, a slight tick up in reliability from the previous model.”
Martin then asks, “Ok, over what duration? And, how many should survive that duration?”
Kyle looks a little confused as he responds, “50k hours, that is how long the bike should work without failure for a customer. Yet we seem to get plenty of failures prior to 5 years.”
“Why do you set the goal using MTBF?” (Martin)
“Because that is what our distributor ask for and expect from us. And, it’s the most common reliability metric right? Besides looking at our field data we are just about meeting the 50k goal now.”
Martin then scratches his head and wonders how much time they have left to discuss reliability goals, MTBF, and how successful they have been achieving their stated goal. Seems another round of education is in order.
Martin saved exploring product reliability testing practices for another day.
Let’s fast forward about 2 years. Martin is still working at ABC Bike Co and enjoying going to work each day. He’s accomplished quite a bit in the two years. He recently received a bonus for his contributions which cut warranty expenses to half what they were before he started working there.
His recognition by management for his achievements if great, yet the general greeting he receives walking through halls is wonderful praise, “We don’t use MTBF here.”
Martin changed the way the organization talks about reliability. It avoids confusion and misunderstanding. Using reliability, R(t), allows them to get the data that provides a clear understanding of changing failure rates over time. The entire organization is making better decisions and not using MTBF at all. The end.
The post 3 MTBF Stories appeared first on No MTBF.
]]>The post Different Data Same Decision appeared first on No MTBF.
]]>Let say you have some time to failure data on your equipment. A common action is to calculate the MTBF. All well and good until you expect to make a meaningful decision based on the calculation.
Using just the mean of the data, the MTBF value is likely to provide you with a less than useful bit of information. Thus your decision will be rather random or worthless.
Let’s explore just how this simple calculation of perfectly good data can mislead your decision making.
In your factory or within your system you have data on five failures. In this case, the perfect case, all five failures occurred after 1,752 hours of operation. The system was installed and started one year ago, and today at noon, the fifth failure occurred.
What is the MTBF if the equipment has operated for a total 8,760 hours?
Easy right, it is 8,760 / 5 or 1,752 hours MTBF.
This even makes sense since each failure occurs like clockwork at 1,752 hours since start-up or the last repair. The five failures occurred equally spaced over the year.
If the failure is always the same part, we could just replace that part at a convenient time prior to 1,752 hours of operation and avoid the unwanted downtime. This is a ‘perfect’ case and if it occurs with your system, let me know, as I’ve never heard of or seen this type of pattern of failures outside of textbooks or contrived examples.
If you routinely just calculate MTBF who close to this perfect case is your data? Are you assuming this failure pattern when making decisions concerning spares, scheduling, etc?
Let’s say another system generates some failure data where four of the failure occurs in the first 100 hours of operation and the last failure occurs at the end of the year. A total of 5 failures again and there has been a total of 8,760 hours of operation.
The MTBF calculation is simple. 8,760 / 5 for 1,752 hour MTBF, right?
Would planning to replace the part as we operate for another 1,752 hours make sense? You could make the same decisions in the ‘perfect case’ and would likely not see downtime due to the part.
More likely after four failures inside of a week of operation, the team would investigate and solve the issue leading to the ‘early’ failures. Yet, if the end of the year calculation is simply the MTBF value, what have we learned? How could this data distort our understanding of the system’s operation and our best course of action going forward?
“Bearings and MTBF” is one of my favorite examples. As you know, if a bearing is designed, applied, and installed correctly it will work well. It will function well till it wears out. As the lubrication breaks down it permits metal wear which leads to eventual bearing failure.
In the spirit of the trio of examples, let’s say we have installed five bearings on our system and all five failed after exactly 8,760 hours of use. The MTBF value is once again 8,760 / 5 or 1,762 hours, right?
We know bearing wear out, we know the failure was due to wear, we calculate a value which ignores the wear out time to failure pattern anyway. Why?
If we wanted to design an appropriate maintenance plan for bearings, we would not use MTBF as it would have us replacing them every 1,762 hours (unless those with practical thinking skills and experience intervened).
These three cases are contrived. This serves to illustrate the point that calculating MTBF is rather a waste of time at best. It is going to mislead you and your team by ignoring the time to failure patterns of decreasing or increasing over time. It is likely you will make poor decisions about your system, its reliability, and its maintenance.
Poor decisions cost money.
Treat your data better and make better decisions. Save money.
Instead of calculation MTBF, plot the data using a mean cumulative function. If the data is for a non-repairable element of your system, use the Weibull distortion or appropriate life data distribution to describe the failure pattern over time.
This short article only reveals a small set of a decision making routinely done using MTBF – how have you seen MTBF use lead to poor (and costly) decisions?
The post Different Data Same Decision appeared first on No MTBF.
]]>The post 5 Reasons Rate of Change is Important appeared first on No MTBF.
]]>A simplifying assumption associated with using MTTF or MTBF implies a constant hazard rate. Some assume we’re in the useful life section of the bathtub curve. Others do not understand what assumptions they are making.
Using MTTF or MTBF has many problems and as regular reader here know, we should avoid using these metrics.
By using MTTF or MTBF we also lose information. We are unable to measure or track the rate of change of our equipment or system’s failure rates (hazard rate). The simple average is just an average and does not contain the essential information we need to make decisions.
Let’s explore five different reasons the rate of change of a failure rate is important to measure and track.
A decreasing failure rate tends to indicate manufacturing, assembly, transportation or installation type problems. An increasing failure rate tends to indicate wear out type problems.
Both patterns can occur at anytime, thus good failure analysis of units that have failed is still important.
A decreasing failure rate would indicate the reduced need for spares in the future. Likewise an increasing failure rate would indicate an increased need for spares in the future.
Simply maintaining stocking levels base on previous counts of failures may over or under stock expensive or critical spares.
If there are two elements of product both showing an increasing failure rate, which do you address first? The one increasing at a faster rate, right? Another consideration is the severity of consequences of a failure and magnitude of failures. Yet when all else is even the one with a higher rate of increase should receive your attention first.
When tracking field failures a common outcome of the analysis is to spot potentially epidemic or higher than expected number of fairies. By monitoring the rate of failure occurrences one can spot those increasing failure rates. An increasing failure rate, even if with just a few actual failures permits your team to investigate and understand the failure mechanism.
Finding major problems well before them become major is a good thing.
Using just an average failure rate provides little insight on what will actually occur tomorrow. If the goal is to repair or replace items just before they have an unacceptable chance of failing (or just before failure occurs), we need to understand that rate of change for the probability of failure.
We can use the combination of our understanding of the failure mechanisms and existing time to failrue data (or degradation measurements) to forecast when the best time for repair/replacement.
Let your data tell the story around the changing change of failure over time. It happens. It is very rare that an item or system has a constant failure rate – thus you need to monitor the changing failure rates to make better decisions, save money and time, plus improve the reliability of your system.
The post 5 Reasons Rate of Change is Important appeared first on No MTBF.
]]>The post How About Weibull Instead of MTBF? appeared first on No MTBF.
]]>This was a follow up question in a recent discussion with Alaa concerning using a metric other than MTBF.
The term ‘Weibull’ in some ways has become a synonym for reliability. Weibull analysis = life data (or reliability) analysis. The Weibull distribution has the capability to describe a changing failure rate, which is lacking when using just MTBF. Yet, it is suitable to use ‘Weibull’ as a metric?
Use reliability, the probability of successful operation over a defined duration. This typically includes a defined environment as well.
It is the definition of reliability, as we use it in reliability engineering.
Instead of saying we want a 50,000 hour MTBF for the new system, say instead, we want 98% to survive 2 years of use without failure.
Be specific and include as many couplets of probability and duration as is necessary and useful for your situation. For example, you may want to add 99.5% survive the first month of use. And, 95% survive 5 years of use.
Weibull, Lognormal, normal, exponential and many others are names of statistical distributions. They are formulas that describe the pattern formed by time to failure data (repair times, and many other groups or types of data).
Instead of Weibull Analysis you could easily also say we’re going to conduct a Normal analysis. In reliability work, I often first explore a set of life data by fitting a Weibull distribution to the data and plotting the probability density function (PDF) and cumulative density function (CDF). It’s a first look and not the end of the analysis.
Each distribution has four functions that are useful for reliability engineering work:
Since I tend to like being positive about a product, I often use the reliability function (calculated at specific points in time, t) instead of the CDF which is the probability of failure over time, t.
The reliability function is a function of time, hence my suggestion to always include a probability and duration when specifying or reporting reliability values.
The Weibull distribution, as other distributions, is a curve or equation. It is not a metric on its own.
Define the time intervals of interest, run out the calculations (I recommend using the reliability function for the appropriately fitted distribution) and then you have a metric.
Goals are not metrics, yet should be something we can measure and helps us make better decisions. For example, setting reliability goals for 1 month, the warranty period, and over the expected use life.
Then use vendor or testing data, and/or field data to estimate the distribution of the life data. Then again for specific time intervals of interest calculate reliability. Now you can compare your data to your goals and make informed decisions.
Just doing ‘Weibull’ is not a metric.
In many circumstances it is clear that when someone says they are going to do a Weibull Analysis, it is really a life data or reliability analysis not limited to only fitting a Weibull distribution. At least I hope so. The result of the analysis may be an estimate of reliability over a time period of interest.
How do you use the term ‘Weibull’ or how have you heard it misused? Add your thoughts or observations in the comments below.
The post How About Weibull Instead of MTBF? appeared first on No MTBF.
]]>The post Are We Teaching Reliability All Wrong? appeared first on No MTBF.
]]>Teaching reliability occurs through textbooks, technical papers, peers, mentors, and courses. The many sources available tend to use MTBF as a primary vehicle to describe system reliability.
What has gone wrong with our education process?
From tutorials by college professions at RAMS to numerous ‘reliability engineering’ text books the discussion equates reliability with MTBF. The way to measure or describe the reliability of something is MTTF for non-repairable or MTBF for repairable systems (like that bit of semantics matters).
Just use an average, it’s good enough.
A good text book does not mention MTBF, a great lecture avoids the use of MTBF. IMHO
When I confronted a professor on why a major portion of a tutorial on reliability statistics focused on MTBF, she said it was a great way to teach the concepts of distribution properties without worrying too much about the math. It was merely a mechanism to teach other concepts.
The lecture did not spend much time on applying those concepts to real problems. It did not explain the use of MTBF (and the exponential distribution) was a never to be used in the real world set of examples to focus on key concepts. It left, me and others in the room with the idea that MTBF was the way to describe reliability.
The same conversations with book authors.
The ASQ CRE exam is rife with problems using MTBF (or MTTF). Why? – Because it is easy and quick to test calculations based on MTBF and the exponential distribution.
Sure it’s easy and not something we should be good at. Aren’t we supposed to be good at solving real problems, which are not easy? How about creating a certification exam that actually evaluates what we should know and do at work, not what is considered easy.
In more than one setting, as I rant on about the topic of abolishing MTBF, I encounter the rebuttal that students/engineers need to know about MTBF. If for no other reason than it is out there.
Sure, MTBF is on data sheets, test reports, parts count software outputs, etc. It is everywhere.
I agree students and engineers need to understand the folly that MTBF represents, the lack of information it contains, the inability to use it for any meaningful decision making. Instead students and engineers should learn to automatically insist on more information, data, evidence, all leading to an understanding of reliability (probably of success over time…).
When I hear someone ask for MTBF, I ask them what they really want to know. What they seek, when asking for MTBF, is never served or supported by knowing MTBF. We need to learn, and teach, reliability engineering to help each other ask better questions and solve real problems.
If I use existing literature and teachings on reliability engineering to prepare a new book on the topic, I would likely feel compelled to include MTBF. I won’t other than to warn the reader to not use it at all.
You should do the same.
If in a class or tutorial where the instructor mentions MTBF, ask them when they will begin talking about something useful concerning reliability. Go ahead, you can say I urged you to ask.
If reading a book that drones on about MTBF testing and confidence intervals for time or failure truncated testing, send the author a note on when and under what conditions would this technique every be useful. Ask then to provide case studies and evidence that the underlying failure mechanisms involved are actually best fit by the exponential distribution. Ask them to justify spending more than one sentence of this expensive book on such drivel.
Go ahead, ask for rationale and justification. Ask for a better education.
If enough of use stand up and say, ‘hold on – when in the real world is the use of MTBF every useful?’, we just may get some professors and authors to provide meaningful content.
How have you challenged the use of MTBF? If you haven’t why not? What is holding you back?
The post Are We Teaching Reliability All Wrong? appeared first on No MTBF.
]]>The post Life Data Analysis with only 2 Failures appeared first on No MTBF.
]]>Here’s a common problem. You have been tasked to peer into the future to predict when the next failure will occur.
Predictions are tough.
One way to approach this problem is to do a little analysis of the history of failures of the commonest or system. The problem looms larger when you have only two observed failures from the population of systems in questions.
While you can fit a straight line to two failures and account for all the systems that operated without failure, it is not very satisfactory. It is at best a crude estimate.
Let’s not consider calculating MTBF. That would not provide useful information as regular reader already know. So what can you do given just two failures to create a meaningful estimate of future failures? Let’s explore a couple of options.
Well, two failures is a start. Of course there are number of questions about those two failures that my provide helpful to have answers.
When did they fail? How long did they operate? This provides just a sketch of time to failure information.
How did they fail? What is the failure mechanism(s)? Maybe there is a time to failure model that describes these failures.
The more we know about the two failures the better we are able to estimate other failures in the population. Speaking of the population, how many elements are in the population? Anything unique about the two failures vs remaining items? How about operating time for all items?
If we know the failure mechanisms and time to failure information we may be able to use existing models or historical knowledge of similar failures to create an estimate of reliability performance. Some may call this a Bayesian approach. Use what you know both statistically and technically to your advantage.
Knowing the failure mechanism may permit finding a published model that describes the time to failure pattern. Knowing the time to failure information for the two failures allows using that information to adjust the model to fit the known information.
If the failure mechanism suggests a particular pattern of failures over time, say a wear out mechanism, we may be able to assume a beta value (for a Weibull distribution). Using the two known failures construct a rough estimate using a point and slope approach.
Another option, again understanding the nature of the failure mechanisms, along with access to existing unit, we may be able to map the progress toward failure in some fashion for other items. If we have two failed meters for example due to excessive brush wear, we could measure brush wear on a sample of remaining units to create a degradation model and estimate remaining operating life for the population.
Lot’s of if’s here, yet it is an option is the situation fits.
Finally, one could, I’m not sure why, one could estimate the total time of operation for the population including the two items that have failed, and calculate MTBF. You would calculate a number, which may be satisfying, yet, as you know, not very useful for any practical purpose.
The more you know about the two failures the better. Ask the questions before fielding your units. Before failures occur. As after failures occur you may not have the range of options available to estimate system reliability.
What have you learned from a couple of failures? How did you treat the information?
The post Life Data Analysis with only 2 Failures appeared first on No MTBF.
]]>The post Discussions and MTBF Questions appeared first on No MTBF.
]]>The best way to help others understand and stop using MTBF is to engage them in a discussion. I get questions concerning MTBF or reliability a few times a week. I attempt to answer each and every one, plus adding a follow up question or two.
In person or online, ask and answer MTBF questions. You not only improve your understanding of MTBF and reliability, you improve your still at tell stories to help affect change across your industry.
We all have learned reliability engineering concepts and methods from others. We add our contributions and refinements along the way. We should share our knowledge with others as a small part of improving the practice in our field.
Likewise, if you have a question, there is often someone with a little more or different experience that has an answer. Ask your question. Give others a chance to share their knowledge.
When we talk about MTBF we encourage the understanding of reliability. When we ask questions about how others use MTBF values to make decisions we help them frame better questions improving their decisions. When we answer questions concerning the perils of using MTBF we illuminate the underlying problems with it’s use.
With practice, anytime you see or hear MTBF in use, you will reflectively ask questions.
At first you may not even notice how often you see and hear MTBF. Plan to pay attention for a day. Just count the times MTBF is invoked.
Set a reminder in your calendar system so you receive a nudge to count MTBF use again, then again, and so on. Each time you notice you will also notice when it is being improperly used.
Then practice doing something to improve the dissuasion. Ask questions. Get to a better understanding of the goal, the estimate, or the analysis. Get clear understanding for yourself and for those around you.
The logic here is simple. If we, within an organization, understand a clear set of terms and goals concerning reliability performance, we avoid mistakes, confusion, and surprises. If we collectively with our customers talk about and understand reliability performance expectations and performance we are more likely to deliver what they want.
When you hear a customer or supplier talking about MTBF, ask them what they really want concerning reliability. Simple, quick, clears up any confusion.
Our ability to deliver what a customer expects is in part based on knowing what they expect. Cleaning up the language we use is a great start to make our product successful in delivering what customers expect.
One the best ways to master a topic is to teach that topic. Explaining the perils of MTBF to others allows you to improve your ability to convert the meaning well. You will learn how to teach by fitting the examples and specific problems with MTBF use to the specific situation.
Practice every chance you get. It’s not bad being known as the NoMTBF dude.
What is an example of how you answer (or ask) questions concerning MTBF? Use the comments box below and let’s share how to work to engage and improve our collective discussions concerning reliability,
The post Discussions and MTBF Questions appeared first on No MTBF.
]]>The post 3 Types of MTBF Stories appeared first on No MTBF.
]]>Stories communicate well. We have been telling stories long before the invention of writing, or the internet. The MTBF stories we tell communicate our ideas, suggestions, and recommendations.
There are a differences between good and poor stories. How you tell a story matters as well as the subject of the story. Now, MTBF stories may not be the most thrilling or entertaining, yet there are stories on MTBF topics that matter.
Let’s explore using the power of story to cause those around us to better understand and avoid the use of MTBF.
For someone that asks about MTBF, wants to talk about MTBF, or only knows and uses MTBF, this story may provide the needed ‘slap in the face’ (‘wake up call’ or insight) to look at MTBF as something other than reliability.
Simply telling someone that MTBF is bad doesn’t work. You need to provide support, evidence, and examples. In general, select a story that fits the current situation.
If someone asks you for your product’s MTBF, you may respond with another question, “What do you really want concerning reliability performance?”. This may lead to discussion about the actual reliability requirements they have in mind, or it may result in a puzzled look and a pause. You can then tell them a story about meeting customer exceptions and how product reliability is an important element of customer satisfaction. MTBF by comparison doesn’t provide the information they need, thus would not meet their needs well.
This one is an easy story to tell as there are many examples available. It is also a story you should have ready as most of the time when someone is talking or asking about MTBF they are either going to make a decision or gathering information for others to make a decision.
Now let’s assume decision makers would prefer to make the right decision for the given set of options available to them. If they need to select a component from one of two vendors the wrong decision involves not enjoying the benefits of the better choice. Reliability of a component in your application is often considered when selecting a vendor. If only MTBF is used for the comparison, it certainly increases the chance of making the wrong decision. Comparing two data sheet MTBF values is little more then comparing two random numbers, plus provides no information about the nature of expected failures over time (increasing or decreasing). Having and comparing time to failure distributions is much more informative than MTBF, thus improving the chance of selecting the right vendor.
Comparison is not the only decision adversely impacted by MTBF. If the decision to start production and shipments of a product based on MTBF – it is likely to shield, hide, or obscure reliability issues due the the use of parts count or reliability testing based on the assumption of a constant hazard rate. Ignoring the expected reliability performance information does not change the actual field performance, therefore get the best available information for your decision making.
Other decisions may include readiness of a new design concept, effectiveness of a design or process improvement, or the purchase decision of your customers. Illustrating how MTBF provide inadequate information leads to making poor decisions more often who’ll help you and your team ask for better information and make informed reliability based decisions.
There are more ways to mis-understand MTBF than there are MTxx acronyms. A common one is that MTBF represents a minimum failure free time. A quick story about the math and probability of failure over any given hour, then over a year, is generally enough to dispel this mis-understanding.
If someone uses MTBF as a synonym for reliability, a quick story on the definitions of the two terms may be handy. Maybe a story of how you once confused the two terms to your detriment.
If someone is simply assuming a constant hazard rate (or we’re in the ‘flat part of the curve’, thus MTBF is alright, a story about how making that assumption without checking if the assumption is true leads to very poor decisions.
In each case, using a story instead of a tutorial will help them actually hear your advice. They grasp the message and begins o avoid using MTBF.
The best stories that I hear about MTBF are the ones from you about moving others off using MTBF. About company policies the basically avoid using MTBF. About educational or reminder pieces that shape an organizations culture and avoiding the use of MTBF.
The best stories of successful improvements to reliability programs by moving the reliability discussion away from the use of MTBF.
What is your favorite story? What is your success story? Add your story to the comments section below or send it to me directly at fms@nomtbf.com
The post 3 Types of MTBF Stories appeared first on No MTBF.
]]>The post 3 Recent Questions and Comments Concerning MTBF appeared first on No MTBF.
]]>Over the past couple of days, like most days, have received questions and comments concerning MTBF. I do try to respond to all questions and acknowledge the comments.
Glad to help in anyway I can, so please feel free to send me your questions. Certainly do appreciate the supporting comments, or any comments for that matter.
Let’s take a look a few such discussion that occurred over the past two days.
The question:
Hi
I have a basic question and request you to clarify. Which method should i follow ( MIL 217 F or Telcordia) for MTBF and Failure rate calculations for Automotive application electronics part.
Thanks in advance !!
Regards,
My response:
Neither – 217 is little more than a random number generator and is over 20 years old based on data and technology the is even older.
Telcordia is likewise little more than a random number generator concerning the expected future reliability performance, and based on telecommunication system failure data. Not suitable for automotive applications at all.
If you are simply comparing design options and not attempting to estimate future reliability field performance then use your own field data supplemented with vendor or internal testing data.
If trying to estimate field performance, parts count predictions is not the right tool.
Cheers, Fred
So, was the an appropriate response or was I bit too harsh?
Hello Fred,How are you and hope this email meets you well.Please I have a question concerning MTBF.Our international reliability department just sent us an updated memo for calculating MTBF for both operational units and standby units.I do not agree with their calculation because they divide the number of hours by the total number of failures.Please what do you think?Thanks for your help.
Kind regards,
HI … ,
Again, was that the appropriate response? Did I miss something?
Hi, I came across your “No MTBF” website on the internet, and would like to connect to your group, to learn more from experts and professionals in the field of Reliability Engineering. Thank You & Best Regards,
Thanks Fred! Thank you for all the great information you always give us!
The post 3 Recent Questions and Comments Concerning MTBF appeared first on No MTBF.
]]>The post Exposing a Reliability Conflict of Interest appeared first on No MTBF.
]]>Kirk Gray wrote the article titled Exposing a Reliability Conflict of Interest on Accendo Reliability. He talked about a recent article discussion the maintenance costs for the F-35 fighter jet program and how the companies designing the system make a significant profit selling spare parts or maintenance services.
If you count on the profit from the system you design failing, you have an inherent conflict of interest concerning creating a reliable system. If you create a reliable product you lose money.
The problem Kirk described is not just with military projects. The reliability conflict of interest arises anytime the team designing an item profit from spares or maintenance services.
With extended warranty contracts, the better the reliability performance of a product the more profitable the extended warranty contracts become. In contrast if your organization generates profit when an item fails by providing spares or maintenance services, there is little incentive to improve reliability.
For military project there are typically cost plus type arrangements which effectively cap the profit the organization can generate performing design and development services. On the other hand, once the system is fielded, if the same organization is the sole provider of spares or maintenance services, they really can and do charge to create significant profit.
For example, if you have a product idea one common approach is to engage a firm that can design and build your product. If they are also the organization that provides spares, repairs, and warranty management, they create a profit from those action only when the product they designed is un-reliable.
Another example is a repair service. If they make money when doing a repair independent if it actually works or not, there is little incentive to actually fix the problem. The more times they see the same item, the more often they invoice for ‘repair’ services.
Check the contracts and systems you have concerning your products. These reliability conflicts of interest may be hampering your ability to create reliability products for your customers.
In the discussions and comments around Kirk’s article on Accendo Reliability and Linkedin one idea from Andy caught my attention. Andy suggests:
Contract should be for the useful operational hours of the equipment, purchase them on a lease basis with the supplier responsible for the OEE, that would concentrate their minds! — Andy Gailey
If you want a reliable product contract for a reliable product. There are real costs to un-reliability that far outweighs the costs associated with repairs or servicing. Building in suitable terms to balance the incentives toward creating a durable solution is possible.
When a design team has a bonus based on time to market or a procurement team has a bonus based on cost reductions, the emphasis on reliability general suffers. Including a factor based on field reliability performance within the bonuses changes the focus to include the impact on reliability performance.
For state of the art military aircraft the functional performance is critical. Yet, if it will take 3 aircraft to insure mission success because two are in the shop for repairs, the functionality focus is out of balance with the aircraft’s mission.
Creating reliability systems is economically sound for your customers and your business. Identifying and eliminating (or at least balancing incentives) reliability conflicts of interest in service to the warfighter and your customers.
What has worked for you to expose and reduce reliability conflicts of interest? Use the comments section below and share your success stories so others can make progress as well.
The post Exposing a Reliability Conflict of Interest appeared first on No MTBF.
]]>The post How We Think About Reliability appeared first on No MTBF.
]]>Getting on an airplane we think about the very low probability of failure during the flight duration. This is how we think about reliability.
When buying a car we think about if the vehicle will leave us stranded along a deserted stretch of highway. When buy light bulbs for the hard to reach fixtures we consider paying a bit more to avoid having to drag out the ladder as often.
When we consider reliability as a customer does, we think about the possibility of failure over some duration.
And, we really don’t like it when something fails sooner than expected (or upon installation).
When someone advertises a low failure rate (or high reliability), we ask for how long. When we see an item has a 2,000 hour lifetime, it implies it also has a low failure rate (high reliability) over 2,000 hours of use.
When both the probability and duration are not explicit we fill in the blanks with our expectations.
If we, as consumers, think about reliability as a couplet of probability and duration, then shouldn’t we as designers, manufacturers, or producers do the same?
We should provide internal specifications that include both probability and duration. BTW: MTBF as in inverse failure rate is really a form of probability and has nothing to do with duration.
We should provide external reliability claims that are specific and include both probabilities and durations. Of course, if the function and environment are not clear, we should include that as well.
If we think about reliability in terms of the chance of failure over an expected duration, let’s make the conversation clear on those points. MTBF is not up to the task.
We have the ability to ask for reliability information that is useful, meaningful, and aligned with the way we think about reliability. Ask of the probability of failure and the corresponding duration. Ask for various probabilities at different durations. Do not accept MTBF as it is insufficient when compared to how we think about reliability.
If someone offers MTBF as reliability, ask over what duration it is valid. (Also ask why they would want to use MTBF, yet that is a topic of another post.)
If someone offers an item will last for 2 years, or that the warranty is 2 years… that doesn’t mean there is a low probability of failure over two years, it’s just a duration. Ask for the probability element to get a better understanding of the offered reliability. Warranty durations just mean they will (or are supposed to) fix or replace an item, not how often they expect to have to do so.
If I had enough trust and understanding that an item would last with a very low chance of failure for 2 years, I wouldn’t need the warranty. If unsure, I’d expect a full warranty and it should cover the entire period of time I expect to use the item.
As a producer provide better information concerning reliability. Warranty terms do not describe reliability. A duration or vague claims of ‘high reliability’ are likewise insufficient.
Provide both the duration and associated probability of successful operation over the entire duration. Better is to provide various couplets of probability and duration. Even better is provide the expected life distribution including the environment and use profile parameters.
If someone asks you for reliability information, they are thinking of probability and duration. Give an answer that provides probability and duration. It is the way we think about reliability.
If someone is asking for availability or maintainability, that is not a probability of successful operation over some duration. Yet, providing just MTBF doesn’t work either. Provide suitable information so your customers can compare your product’s expected reliability performance with what they think it should be. Help your customers make informed decisions concerning reliability.
Extend this improve communication around reliability to your vendors – ask for and provide complete information concerning reliability. Again, MTBF is not sufficient, nor useful. If you want to improve the reliability performance of your product you need to get reliable parts from your vendors. This does not just mean higher MTBF values. It means a lower probability of failure over the duration of desired operation.
You can improve the communication around reliability by insisting on using probability and duration every time you talk about reliability. Set the standard. Set an example.
Doing so you will find others embrace reliability discussions as it make sense. If we discuss reliability as we already are thinking about it, good thinks happen.
What’s your observation here? How do you find yourself thinking about reliability? Add you comments below.
The post How We Think About Reliability appeared first on No MTBF.
]]>The post MTBF Use May Reduce Product Reliability appeared first on No MTBF.
]]>MTBF is not reliability. Attaining a specific MTBF does not mean your product is reliable. MTBF use may be the culprit.
Therefore, working to achieve a MTBF value may actually be preventing you from creating a product that mets your customer’s reliability performance expectations.
Actively working to achieve MTBF using the common tools around MTBF may be taking you and your team down the wrong rabbit hole. You may be working to reduce the reliability of your products rather than improving them.
Let’s take a look at a couple of ways the pursuit of MTBF is harmful to your product’s reliability potential and contrary to your customer’s expectations.
Some customers may request 50,000 hours MTBF when they really want a very low failure rate probability over 50,000 hours of product use. They meant a duration of 50k hours, not a chance of failing every hour of 1:50k. They didn’t know how to ask for what they wanted. You should ask any time someone asks for MTBF what they really want.
What is a customer really wants high availability over a short duration, or they want to reduce repair times, or they cannot tolerate any failures over a 12 hour mission duration? If they only ask for some MTBF value, is that sufficient for you to create a product that will meet their needs. Probably not.
When we assume a constant hazard rate, which is common when using MTBF, we can use the memoryless feature of the exponential distribution. Therefore we can test 50 units for 1,000 hours each and count the number of failures that occur… if only one or none, we have ‘proved’ the 50,000 hour MTBF. All good.
The may catch early life failures complicating the test analysis and results, yet certainly would not reflect any product actual reliability performance over a duration of 50k hours. We actually learn very little about how the product performs after 1,000 hours.
The sad part is products with known wear out failure mechanisms are tested using these methods, thus avoiding the messy business of wear out failures clouding reliability testing results.
If customer would just use the product during the ‘useful life’ portion of the bathtub curve. Draw as a low failure rate over an extended duration with a few early life failures for a short duration, plus eventually something wears out long after the product has been retired.
Customers do not control the ‘useful life’. The product design does, with a dash of manufacturing, too. Design and build a reliable product, and it may have a low failure rate over the duration a customer may want to use the item.
If we assume away the the early life and wear out portions in order to focus on the useful life, we have a couple of problems:
First, we’re delusional in thinking there is a flat part of the curve that we can assume will naturally occur.
Second, our assumptions do not change what actually causes the product fail. We still have design and manufacturing issues that cause failures. Some occur early, some later, rarely at a low and random rate.
Third, customers do not care about your assumptions, it is the actual performance that matters.
One way out this nest of problems is to avoid using MTBF. Instead use reliability, or the probability of successful operation, over a specified duration. Include the details on the environment, use profile, and what we consider a failure, and you are making progress.
Using MTBF makes everything easier. From apportionment, test planning, and design we can simply assume away many problems that will cause the product to fail. The problem is products that are not reliable fail.
Does your team use MTBF (or MTTF) and do you regularly have ‘surprise’ field failures? If you use reliability directly could you have avoided the issue? I suspect so. What is your story?
The post MTBF Use May Reduce Product Reliability appeared first on No MTBF.
]]>