When Do Failures Count?
One technique to calculate a product’s MTBF is to count the number of failures and divide into the tally of operating time.
You already know, kind reader, that using MTBF has its own perils, yet it is done. We do not have to look very far to see someone estimating or calculating MTBF, as if it was a useful representation of reliability… alas, I digress.
Counting failures would appear to be an easy task. It apparently is not.
What is a Failure that Should be Counted?
This may not seem like a fair question.
Keep in mind that not all failures have the same consequences. Some cause serious problems for all involved, while other failures may never be noticed.
A product has many levels of specification and requirements. There may be layers of tolerances. Not every product is identical.
Not every failure is of interest. Not every failure is of interest to each customer. A failure that causes a product return or complaint by one person may not cause any response from others. Is a paint blemish a failure? Do you count it as a failure?
You might. Or, might not.
For every product, for every situation you are assessing failures, take the time to clearly define what is a failure. Define a shape dividing line between what you call a failure and do not call a failure. I would err on the side of calling more things failures, such that you capture anything a customer defines a failure as a failure worth counting.
When do we Start Counting Failures?
We’re talking reliability here, so do your out of box or first start up failures count as quality or as reliability failures?
If the product fails in the factory, is that a reliability failure. Often we track yield and do not count these as reliability issues, yet we do know there is a link.
If an early prototype fails, which is actually quite common, is that a reliability failure? It’s too easy to dismiss these failures as just part of the development process.
When counting failures do you include prototypes, manufacturing, early life, and beyond? Or just some subset? When not counting all failures is there a clear and well understood reason and definition of what is and is not countable?
I suggest count all failures right from the first prototype. Track all failures. Monitor, measure, analyze and then select where to make improvements. Dismissing or avoiding some failures limits your ability to understand where to focus your reliability improvements.
Which Failures Do We Include in the Count?
Recently heard a client discuss they were not going to track (count) software failures. Instead, they were going to focus on hardware failures only. This is troubling?
By not tracking all failures, you skew the information you can glean from your failure tracking. Avoiding one class of failure limits your ability to determine if you are actually focusing your efforts on solving the right problems.
If you do not count a failure reported by a customer, yet unable to replicate the issue in house, you limit your visibility into what’s happening from your customer’s point of view.
If you do not count a failure because we’ve seen it before, you limit your ability to prioritize based on the relative frequency of failures.
If you do not count a failure because you are unable (or unwilling) to determine the root cause, I suspect you are not willing to learn from product failures.
My advice is to count all failures.
Better is to track time to failure for all failures, plus conduct detailed root cause analysis of each failure.
When do failures count? Always, every failure counts as each failure contains information that permits you to make reliability improvements for your customers. If you do not count some failures, you will always overestimate your MTBF.
What’s your take on what counts? Leave a comment and let me know how you justify limiting what is considered a ‘countable’ failure.
Hello?
Testing….
Is this thing on?
Got it Dave, thanks for the help troubleshooting the firewall issue and getting the ability to comment working again. cheers, Fred
I work primarily in aerospace and defense, so every failure counts. Development failures, like HALT, and qualification testing usually point to specification and design errors. Production HASS test failures reveal build quality and some design errors. Field failures often point to specification and operator induced failures, but frequently some design weakness that escaped the previous torture testing. If the customer wants me to track the field MTBF I will require them to supply documented operating time, storage time, operating conditions like temperature, temperature cycling, humidity, shock, vibration, fault symptoms, error codes, and other information with each returned unit. They usually back down.
Great comment Dave – I like the listing to typical suspects to examine when a failure occurs. cheers, Fred
I almost forgot, they also need to supply the vehicle chassis/tail number, what other equipment was replaced, what O-Level I-Level and D-Level testing and fault verification was performed, if the replacement cleared the fault, maintenance procedures, and there’s probably more but those are the big ones. Usually they’re clueless.