By it’s cover no doubt. The title and cover are important, this is true. When you judge a reliability book we often first see and evaluate the cover.
The author? Do you buy the book based on who wrote or edited it?
Do you have a quick scan or check for key features before you add the book to your library? I’m curious how you select a book to use a reference for your work. The books we read and use for work shape our work, thus it’s important to have the right works at our disposal. Continue reading How to Judge a Reliability Book→
Here’s a simple illustration of how MTBF oversimplifies data concealing essential information.
By convention, we tend to use MTBF for repairable data. That is fine.
You may also be aware of my dislike for the use of MTBF, for many different reasons. If you find yourself suggesting your organization, customer, industry or whomever to stop using MTBF, you may want to use this simple example to illustrate the ‘value’ of MTBF. Continue reading Illuminating MTBF’s Lack of Information→
I was preparing to write this article and wondered how many search hits would appear for MTBF? So, opened Google and did an MTBF search. It is a common if misunderstood, acronym.
Beyond the 5,200,000 Google search results, it was the first page results that got me thinking. Keep in mind that Google often serves up a combination of what it thinks you are seeking and which sites have been useful for others.
It is deceptively easy to calculate MTBF given a count of failure and an estimate of operating hours. Just tally up the total hours the various systems operate and divide by the number of failures. Easy.
This simple calculation is the unbiased estimator for the inverse of the parameter lambda for the exponential distribution, or directly to estimate theta (MTBF). We use theta to represent the 1 / lambda.
What could go wrong with such a simple calculation?
What is a failure?
Let’s start with what we count or do not count as a failure. This directly changes the resulting MTBF value. If we only count confirmed hardware failures, and do not count intermittent or unreproducible or software failures, are we under counting what the customer experiences as a failure?
Over what duration do we count the failures? Should we focus only on the first month of operation, the first year, the warranty or service contract period or the entire operating life of the system? How do you calculate MTBF?
Some organizations only count failures they expect to occur. The unexpected ones are ‘special’ causes and require further study before counting as failure officially.
Another organization only counted failures that completely shut down the system. A partial loss of functionality, a degradation of capability or the failure of a redundant element all did not count a system failure.
In my opinion if the customer calls it a failure, it’s a failure. If a failure, by any definition, costs your organization time and money to address, acknowledge, resolve or repair, it’s a failure.
What is operating time?
This one is tricky. If the system does include the appropriate sensors and tracking mechanisms (hour meter) and a way to gather that operating time of units both failed and still operating, then we have a pretty good way to track total operating hours. Some situations and systems make this easy.
Most do not.
Let’s say we ship 100 systems a month for 10 months. At the end of ten months the first shipments have accumulated 10 months of operating time. IF….
… They are all placed into service immediately
… They are all operated full time for the full 10 months
… They are have each failure reported including down time
In general, we do have to make a few assumptions to determine the operating time for shipped systems. We tend to be conservative and err on the side that would make the MTBF value a little smaller than if we had the full set of carefully tracked data. Or do we?
Some organization count from date/time of shipment ignoring shipping and installation time.
Some organization assume all systems are installed and operated 24/7.
Some organization assume no news is good news and the systems with no information are still operating.
And a few organization assume systems run indefinitely, even systems 20 years old, unless notified that it is decommissioned, assume it is still running full tilt. i.e. No retirement or replacement policy.
How about when you calculate MTBF?
By convention when there are no failures we assume in the next instant there will be one failure. This avoid dividing by zero which causes fits for calculators and spreadsheets and mathematicians.
Another issue is how often are the calculations made? Do we gather data hourly, daily, weekly, monthly, annually? Some use a rolling set of data, for example only units shipped in the last year count for both operating time and failures. This result will ignore or discount the longer term wear out failures as the bulk of the units are young.
Some organization do the calculations weekly in order to detect trends. If there are trends you probably should not be using MTBF…. If it’s changing, if there are early life or wear out failure mechanisms, you should not be using MTBF.
Even though you can calculate MTBF easily, the complexities of getting it right still do not provide a useful metric. Instead focus on getting better data including time to failure information so you can explore and report the data with other tools and methods. Treat the data appropriately and make better decisions
Sure, better data will improve the ability to calculate the MTBF value, if you’d like to be like some organizations, that is fine.
How have you seen MTBF calculated poorly? Share your thoughts and stories in the comments below.
We talk about reliability because it matters. The ability to estimate reliability allows us to make design and development decisions. The ability to monitor reliability allows us to adjust the design, suppliers or expectations about a product. Continue reading Why do we talk about reliability?→
A few months ago at a IEC Dependability standards meeting, I met Thomas Young Olesen of Grundfos and we talked a little about NoMTBF. He said their company has a polity to not use MTBF. YES! So I asked for permission to post some information about the policy.
The reason so many use MTBF is because so many use MTBF. ‘Our data sheet has to include MTBF since all the other data sheets have MTBF’. Which seems to be primary reason MTBF is so common. It’s because it is so common.
Against this logic is the desire I have to use a measure of reliability that actually is understood. Using reliability (probability of success over a specified duration) as a measure seems some how odd or novel. It is easy to understand and it doesn’t obscure the reliability. Continue reading MTBF Logic→
Normally, we life test a sample of products in order to make sure the products will last as long as expected. We assume that the sample we select will represent the total population of products that we eventually ship. It is not a perfect system, and there is some risk involved. Continue reading Sample size and MTBF→
Thanks to Kirk and the folks at TED for sharing another interesting presentation. Seth Godin: The Tribes we lead is a look at how we each can lead a movement to make real change in this world. Tribes and the organization of tribes has been around a long time, in recent years though those that controlled the mass media tried a different way to influence. Continue reading No MTBF Tribe→
A few weeks ago I was working on a report summary that included a reliability value at 5 years for a product. The document was intended for use with customers, so it was reviewed by Marketing.
Not too many changes, which was nice as I do not consider myself a writer and certainly not a marketeer. They did ask for one change that prompts this note. They wanted the 96% reliable at 5 years to be replaced with the (roughly) 2,100,000 hour MTBF value instead. Continue reading Marketing and MTBF→
This site is part a long string of attempts to eradicate the improper use of MTBF. This week two people have sent me references to work previously done and Chris sent me another podcast also highlighting issues with MTBF. Jim McLinn wrote about the possible transition away from constant failure rate Continue reading The MTBF Battle Continues→
Chris records a podcast almost everyday and many are enjoyable, fun, and and provide something to think about as you go about your day. If you like the podcast above, check out her growing list of available podcasts.