Predicting Failure vs. Reacting to Failure
One of the twitter notes I sent out a few weeks ago in part read, “Celebrate failures”. And a comment came back that it was a wonderful approach that she had not though of before. Failure will occur and when it does it is our chance to learn.
And, we need to learn. As reliability professionals, we continue to learn our entire career. New materials fail in novel manners. New assemblies fail in an assortment of ways. New designs fail due to unknown sources of variation. We will see failures. So rather than simply focus on the next try and hope to find success, let’s learn from each failure as we move toward success.
Do you work in a fire department?
It an expression that I use as have others to describe an organization that quickly responds to failures. A customer calls with a problem, the team jumps into action to solve the issue. A phone call on Friday afternoon brings a chance to work all weekend. The line is down, all hands on deck.
The better fire departments actually do a good job responding and solving problems. They may even work to prevent other similar problems form occurring. They are not very good at determining where the next failure will occur, so they remain diligent and ready to respond.
Heroes are born in a fire department organization. The one that saves the big customer account, get a prime parking spot. The engineer that pulls an all nighter to get the line running, get noticed for promotion. The message is get good at solving crises problems. The problems you solve as part of your day job, doesn’t really count. The spoils go to the solution found at the last minute, under duress, and often after hours.
What have you done of value lately?
In some fire department like organizations, unless you personally abated a major crises, you’re not noticed. Let’s say you do your work well. You craft durable products, work with teams to create reliable solutions, and meet your cost, time to market, performance and quality targets.
Not one bit of recognition or notice. You did your job. Did well even, yet that is expected isn’t it.
Let’s say the same brilliant folks that stamps out raging rates of failure find the time to actually design a product that doesn’t fail unexpectedly. Let’s say your organization does a full root causes analysis, including where the life cycle set of system failed to avert the chain of events leading to the field escalation.
Once we understand that failures will occur, we might take steps to anticipate and resolve those failures before a customer has the luxury of a failure experience. During the design and development stages, we start to balance the final design decision based on acceptable risk of failure and minimum risk of unknown failures. We begin to build certainty in knowing what will fail, when it fail and how often… and work to create a low enough failure rate.
The idea is to predict, anticipate, forecast, estimate and celebrate failures. Running a test that has no failure is a lost opportunity to learn something that allows us to improve the design. Finding a design flaw early, permits a routine amount of work to fix (i.e. If all night and all weekend pushes to meet shipping deadlines is your normal, you need to see what’s possible).
Predicting Failures
Everything will fail. It’s a matter of when and how. We regularly think about limitation, constraints, and failure modes. Now I’m asking you to consider the failure mechanisms. What chemistry or physics or elemental shift in the design would lead to failure.
It’s not someone else job to add reliability to a system. It’s anyone making decisions about part or vendor selection, anyone sizing a bolt, anyone thinking though or performing maintenance. It pretty mush is anyone that touches in some way the product. It’s your job.
What could fail? Have you discovered new failure mechanisms today? Let’s reward those that discover the first failure, the most failures, or the biggest cost avoidance failure. Let’s give the prime parking spot to someone that finds and fixes a critical flaw before it’s an emergency.
The work in physics of failure, HALT, and risk analysis are just a few of the tools available to prediction what will fail. Coupled with someone willing to consider and prove what will fail, and when, (we call these folks reliability engineers) the team can shift out of firefighting to fire prevention. Celebrating the lack of failures seen by customers, and rarely being surprised by what does fail.
Give it a try, step off alert status and think about what could fail. Sort out how you can find out before starting the line or shipping. Put in the hours now to carefully find what will fail, so you and your team can work to avert those many pending Friday afternoon phone calls from another irate customer.
Great summary of the often seen situation, I also get the impression that they even think earning money with an unreliable product. Simply because the customer needs a replacement!
Thanks for the comment Nol. And, yes sometimes the business model create more profit with repairs than the original sale. cheers, Fred
DFMEA (Design Failure Mode Effect Analysis) with action plan testing is must to release any new product in Market.
FMEA is a good tool, and it has both the reactive elements given what we already know, plus the proactive by those elements we discover. cheers, Fred
Fred
Good insight on failures. My focus is an operating environment – discrete or continuous where we should predict remaining useful life of the equipment and this science is still in infancy. But I am sure we will catch up in future. What are your thoughts regarding RUL in an operating environment?
Hi Mohan,
thanks for the kind words and comment. I’m not sure what you mean by ‘RUL’ – glad to comment if I know the meaning of the abbreviation.
You are correct that we are just getting started with the science of reliability engineering. Being able to monitor and measure failure mechanisms as they are in progress is reliability new. The entire field of prognostic health management has really accelerated given the improvements in capability and cost of a wide range of sensors and communication networks.
More to come along, I’m sure.
Cheers,
Fred
Hi Fred
In the world of Prognosis, RUL- Remaining Useful Life, is defined as time to imminent failure rather than the time to actual failure. The difference between the times is that the imminent failure is the time when the health indicator exceeds a set threshold. The time to actual failure is when the system failure occurs
Ok, got it. There are two elements and I’ve seen this working in a few operating environments. First, track a measure that shows the degradation toward a defined failure. Second, model that degradation.
It is not a new concept as there are models and information going back many decades. It’s our ability to make the measurements that has been changing. Also, keep in mind that not all failure mechanisms follow a defined and smooth path to failure. Oil breaks down in smooth fashion, yet a flange experiencing an load above it’s shear strength is not predictable.
I generally start with an understanding of the failure mechanisms, then look for what I can measure to track the progress of the mechanisms.
Not always possible, yet often we can find something that is useful.
Cheers,
Fred