All posts by Fred Schenkelberg

About Fred Schenkelberg

I am an experienced reliability engineering and management consultant with my firm FMS Reliability. My passion is working with teams to create cost-effective reliability programs that solve problems, create durable and reliable products, increase customer satisfaction, and reduce warranty costs.

The Technical Skills of a Good Reliability Engineer

14762172816_a10e6f2942_zThe Technical Skills of a Good Reliability Engineer

The fundamental technical skills, as I see it, have to include statistics and root cause analysis skills. This skill set is one of three broad areas introduced in the article, What Makes the Best Reliability Engineer?

I would say these are the minimum technical skills for a good reliability engineer. Able to calculate sample size requirements, understand a dataset, and correctly determine the root causes of a failure.

There are others skills that would be great to include, such as electrical, mechanical and software engineering, plus materials science, physics, and chemistry. Yet, what separates a good reliability engineer from other types of engineering is our ability to plan and analyze life tests and to truly understand how and why failures occur.

Statistics

This is often considered the same as leaping tall buildings with a single bound with respect to skill level.

Few enjoyed their undergraduate statistics class and recently fewer campuses require a stats course. Statistics is the language of variation and is essential for our understanding of the world our products experience.

If every product met the exact specifications of the design and only operated in one set of environmental and use conditions, we would have fewer field failures. If every failure mechanism led to failure exactly the same way within each and every product, we would have far fewer field failures.

Variability may lead to elements of a product being out of spec, or drifting/wearing to an out of spec conditions, thus failing. Variability may also lead to changes in the stress/strength relationships, again increasing the number of failures over time.

The ability of a good reliability engineer to use available data and statistical techniques to:

  • Estimate sample size requirements for environmental testing
  • Analyze vendor life testing results
  • Summarize field failure and warranty datasets

Is just the start of our expected statistical prowess. We also need statistical skills to:

  • Monitor and control processes
  • Design and analyze screening and optimization design of experiments
  • Review and identify field failure trends and unique failure mechanisms

Your ability to use the right tool to quickly solve a problem may span statistical process control, hypothesis testing, regression analysis, and life data analysis all before noon. That may well be like stopping a speeding bullet level of skill.

You may need to master all these elements of statistics if you’re working as a lone reliability engineer, or rely on a trusted colleague is so fortunate. Either way you need to understand enough statistics to know when and how to apply this set of technical skills.

Root Cause Analysis

Failure mechanisms are hard science – even the human factors related failures. Failures occur because something occurs at an atomic, molecular, code or interaction level that precipitate an error or fault to manifest.

Your technical skill includes understanding the range of possible errors and faults that may occur with your product and how to avoid, minimize or mirage each one. It may not be possible to anticipate and fully understand every possible failure mechanisms, thus we focus on the most likely and common, plus continue to learn about those new (or interesting) failure mechanisms that appear.

A second element to this set of skills is the ability to deduce the root cause of a failure. Given a failure, you should be able to conduct the root causes analysis to determine the underlying failure mechanism and initiating circumstance. This permits the team to take corrective action that actually works.

The skill set includes

  • Gathering evidence and understanding the relationships and contributing factors
  • Delving into the unseen elements (microscopes, cross sections, chemical analysis, etc.)
  • Replicating the failure at will

The root cause analysis skill may rely on tools like x-rays and thermal imaging tools, some operated by specialists, yet you need to know which tools to employ and how to interpret their results. It may be fun to explore failures in a well furnished failure analysis lab, yet you need to focus on solving the mystery of what caused the failure.

You also need to be well versed in how to proceed from the “crime scene” (or instance of failure location), through symptoms, to non-destructive and destructive testing. You need to build your “case” based on evidence and logic, plus a healthy dose of engineering knowledge of the fundamental elements involved.

If working as the lone reliability engineer, you certainly need to establish an ongoing relationship with a failure analysis lab. In other words, do not rely on your vendors, do the failure analysis work under your organizations control with your own lab or contracted facility.

Get the information your team needs to solve problems or to avoid future problems by exercising your technical root causes analysis skills.

Good Reliability Work

To be good, I’m suggesting you have to have robust skills in statistics and root cause analysis. Do you agree? What else would you argue is essential to be a good reliability engineer?

Considering WIIFT When Reporting Reliability

14762172376_976f51db33_oWIIFT and Reliability Measures

WIIFT is “what’s in it for them”. Similar to what’s in it for me, yet the focus is your consideration of what value are you providing your audience.

As a reliability engineer you collection, analyze and report reliability measures. You report reliability estimates or results. Do you know how your audience is going to use this information?

Consider WIIFT when reporting reliability. Continue reading Considering WIIFT When Reporting Reliability

What makes the best Reliability Engineer?

14762163056_b991c2ff6a_zWhat makes the best Reliability Engineer?

Formal education (masters or Ph.D) or design/manufacturing engineering experience?

Where do you look when hiring a new reliability engineer? Do you head to U of Maryland or other university reliability program to recruit the top talent? Or, do you promote/assign from within? Where do yo find the best reliability people? Continue reading What makes the best Reliability Engineer?

A World of Constant Failure Rates

14760970966_18c932956c_zWhat if all failures occurred truly randomly?

The math would be easier.

The exponential distribution would be the only time to failure distribution. We wouldn’t need Weibull or other complex multi parameter models. Knowing the failure rate for an hour would be all we would need to know, over any time frame.

Sample size and test planning would be simpler. Just run the samples at hand long enough to accumulated enough hours to provide a reasonable estimate for the failure rate.

Would the Design Process Change?

Yes, I suppose it would. The effects of early life and wear out would not exist. Once a product is placed into service the chance to fail the first hour would be the same as any hour of it’s operation. It would fail eventually and the chance of failing before a year would solely depend on the chance of failure per hour.

A higher failure rate would suggest it would have a lower chance of surviving very long. Although it could still fail in the first hour of use as if it had survived for one million hours and then it’s chance to fail the next hour would still be the same.

Would Warranty Make Sense?

Since by design we cannot create a product with a low initial failure rate we would only focus on the overall failure rate. Or the chance of failing over any hour, the first hour being convenient and easy to test, yet still meaningful. Any single failure in a customer’s hands could occur at any time and would not alone suggest the failure rate has changed.

Maybe a warranty would make sense based customer satisfaction. We could estimate the number of failures over a time period and set aside funds for warranty expenses. I suppose it would place a burden on the design team to create products with a lower failure rate per hour. Maybe warranty would still make sense.

How About Maintenance?

If there are no wear out mechanisms (this is a make believe world) changing the oil in your car would not make any economic sense. The existing oil has the same chance of engine seize failure as any new oil. The lubricant doesn’t breakdown. Seals do not leak. Metal on metal movement doesn’t cause damaging heat or abrasion.

You may have to replace a car tire due to a nail puncture, yet the chance of an accident due to worn tire tread would not occur any more often than with new tires. We wouldn’t need to monitor tire tread or break pad wear. Those wouldn’t occur.

If a motor is running now, if we know the failure rate we can calculate the chance of running for the rest of the shift, even when the motor is as old as the building.

The concepts of reliability centered maintenance or predictive maintenance or even preventative maintenance would not make sense. There would be advantage to swapping a part of a new one, as the chance to fail would remain the same.

Physics of Failure and Prognostic Health Management – would they make sense?

Understanding failure mechanisms so we could reduce the chance of failure would remain important. Yet when the failures do not

  • Accumulated damage
  • Drift
  • Wear
  • Abrade
  • Diffuse
  • Degrade
  • Etc.

Then many of the predictive power of PoF and PHM would not be relevant. We wouldn’t need sensors to monitor conditions that lead to failure, as no specific failure would show a sign or indication of failure before it occurred. Nothing would indicate it was about to fail as that would imply it’s chance to failure has changed.

No more tune-ups or inspections, we would pursue repairs when a failure occurs, not before.

A world of random failures, or a world of failures each of which occurs at a constant rate would be quite different than our world. So, why do we so often make this assumption?

What Does ‘Lifetime’ as a Metric Mean

14750331216_6c7a719566_oWhat Does ‘Lifetime’ as a Metric Mean

We talk about lifetimes of plants and animals. Also, you may talk about the lifetime of a product or system.

I expect to have safe and trouble free use of my car over its lifetime. Once in a while I find a warranty that says it is guaranteed over my lifetime — for as long as I own the blender, for example. Continue reading What Does ‘Lifetime’ as a Metric Mean

Time to Update Our Standards

14598646597_9c7d086e1d_zTime to Update Our Standards

Not our personal or moral standards, rather the set of documents we rely upon as a foundation for reliability engineering tools and techniques.

We have a wide array of standards for reporting reliability test data to calculating confidence intervals on field returns. We have standards that describe various environmental conditions and appropriate testing levels suitable to evaluate your product. We define terms, concepts, processes, and techniques.

A Missing Element

Despite the many documents and impressive titles of numbers and abbreviations or acronyms, most of the standard related to reliability engineer fail to include sufficient context and rationale concerning when and why to use or modify the standard. If a specific test is to determine the expected lifetime of solder joints, well, which type of solder joints (shape, size, configuration, material, and process) is the standard appropriate and when does it not apply? Make the boundaries of applicability clear.

No single test works for all situations.

For example, a wrist watch standard defining how to test for specific water resistance claims does not evaluate the effects of corrosion. The standard has the watch or similar device exposed to a set of water conditions, then evaluate if the system is operating, nearly immediately after the water exposure.

We know that water encourages corrosion, yet takes time to occur. Water alone on a circuit board is no big deal (much of the time) it’s when the water facilitates the creation of additional and unwanted current paths that there is a problem. Metal migration and rusting, take time to occur.

If the standard for water resistance doesn’t evaluate corrosion, and it’s one of the ways your product fails, too bad. You can ‘pass’ the test, meet the standard, add it to your data sheet, and the customer will still experience a failure.

Same for many environmental testing, FMEA, life testing, field data analysis, and a range of other standards. They do not include the critical information necessary for appropriate application of the standard to your particular situation.

Connection to Value

Many, not all, standards provide a recipe to accomplish as task or evaluation. One of the values of the standard is different teams may replicate the results of one team by repeating the steps outlined in the standard.

One of issues with standards is they do not include how and why to actually accomplish the set of tasks and what to do with the results. In part, we need to clearly connect, say the task of testing a product across a range of temperature and humidity conditions, only if it will provide meaningful information.

Don’t run the test if the information is not needed, unnecessary or meaningless.

For example, if we expect that exposure to high temperature and humid conditions may increase the chance of product failure. We may want to know

  • how many failures will occur;
  • how the product will actually fail;
  • how the failure will initiate and progress;
  • when the failures occur under use conditions;

Or any number of reasons to use the results of the testing. Often we run a standard test with very few samples, experience no failures and erroneously conclude all it good. Then surprised that failures occur anyway when the product is in use.

The standard let us down.

The standard provided only a recipe or outline for a procedure and now that guidance and rationale on how it may or may not help us and our team resolve very real questions. Testing 3 units that all pass does not mean your solar panel will survive hot and humid conditions for 20 years with no failures. It doesn’t.

Only run the test or work to accomplish a process only if it is tied to answering a question. Focus on business decisions and the questions we have to resolve in order to make better decisions (i.e. Wrong less often).

Summary

Let’s change the way we read and use standards. You may need to add the how and why, the boundaries, and the connection to value for your situation. It’s not always easy. The people writing the standard often have sufficient experience to include guidelines to help you — when possible contact them and ask what was their thinking and what are the limitations.

If enough of us avoid simply meeting the requirements of the standard, we will

  • Enjoy reliable product performance
  • Create value to our organization with each test or task
  • And, eventually change how standards are written

Finding the Hidden Field Data in Your Organization

14598551918_d0970d4bde_oHiding From Your Field Data Reality?

One of the major dilemmas of reliability engineering is one we really need to solve. Too many times we are trapped by our organizations competing priorities and working with inadequate information.

We generally understand that field failure data provides the best possible representation of our product’s reliability performance. It’s data from our population of products with our customers while they apply all the stresses’ customer will apply to our product. Customer’s report the failures they care about, and not failures of little significance. Continue reading Finding the Hidden Field Data in Your Organization

When Your Supplier Converts Reliability to MTBF

14598537229_fdbf335dac_zWhen Your Supplier Converts Reliability to MTBF

Oh, the trouble that will occur. The mistakes, mishaps and errors and most certainly the inability of the supplier to provide a reliability solution.

If you provide the supplier with a straightforward and complete reliability goal, and they convert it to an single number as an MTBF value, what really could go wrong? Also, why would the supplier degrade the requirement to an MTBF value? Continue reading When Your Supplier Converts Reliability to MTBF

What is MTBF?

14598527118_673c196c3c_zWhat is MTBF?

The acronym MTBF is commonly known in our field as Mean Time Between Failure.

It is also associated with repairable systems in most text books.

It is also denoted as the theta parameter for an exponential distribution.

It is referenced as a metric for reliability, too. Oh, and it is the inverse of the failure rate.

And, it is mis-understood and mis-used by many. I digress, as there is plenty already written on the perils of MTBF.

What is MTBF? And where and how should it be used, if at all? Continue reading What is MTBF?

Holiday Break and a Few Notes

Thank you

First off I want to say thanks to you the readers of the NoMTBF blog. The notes of thanks, of encouragement, and support all propel me to write to you each week.

I especially like the stories of success helping someone ‘get it’ concerning the common misunderstandings of MTBF.  I have to think your work and actions is making a difference across the field of reliability engineering. We’re making progress. Continue reading Holiday Break and a Few Notes

Predicting Failure vs. Reacting to Failure

14598507469_9c031d1fe3_oPredicting Failure vs. Reacting to Failure

One of the twitter notes I sent out a few weeks ago in part read, “Celebrate failures”. And a comment came back that it was a wonderful approach that she had not though of before. Failure will occur and when it does it is our chance to learn.

And, we need to learn. As reliability professionals, we continue to learn our entire career. New materials fail in novel manners. New assemblies fail in an assortment of ways. New designs fail due to unknown sources of variation. We will see failures. So rather than simply focus on the next try and hope to find success, let’s learn from each failure as we move toward success. Continue reading Predicting Failure vs. Reacting to Failure

Thoughts on Testing One Sample and No Failures

 

14598506379_df6e4e22e0_zReliability Testing with Constraints

In some cases we have to conduct testing and are asked to not break the product. Now, that isn’t all that fun as a reliability engineer. We want to find what fails and understand it. Or, we want to confirm what we expect will fail, actually does as expected.

So, what do we do when confronted with a very small sample size (that is one issue) and are expected to conduct failure free testing (second issue)? Let’s explore each issue separately and come up with a few suggestions on how to proceed.

Thanks to Олег (@OlegV_Ivanov) via Twitter for the article suggestion. Thanks for the idea  Олеr. Continue reading Thoughts on Testing One Sample and No Failures

My Thoughts on the Internet of Things and Reliability

 

14598497368_104e814d7f_zThe Impact of IoT on Reliability Engineering

Article inspired by @JillNewberg thanks for the suggestion Jill.

There are two elements to this subject. First there is the reliability of the elements collecting and connecting to the internet. Second is the potential value of the connection and information. Continue reading My Thoughts on the Internet of Things and Reliability

Are You Doing Your Professional Reading?

Professional reading

14598432080_2b4c535cb2_z

As reliability engineers we are the local expert. We know the arcane arts of product life and equipment uptime design and maintenance. We are sought after to estimate useful life, time to first failure, and consulted when failures occur. Continue reading Are You Doing Your Professional Reading?

5 Things You Can Do Today to Avoid Using MTBF

14597503837_2511f1d075_oTake Action Today to Improve How Your Organization Talks About Reliability

You know the perils of MTBF use. The widespread misunderstanding and mis-use. You know about how MTBF treats your data poorly.

You also know everyone around you uses MTBF. Your industry uses MTBF. And, now one likes change, least of all about metrics concerning reliability.

As I said to a friend this morning, “The madness has to stop.”

And, you feel that say way. So, what are you going to do about it? Here are five things you can do today.

  1. Use the data to calculate reliability (probability of success) over a duration of interest along with calculating MTBF, then share the results.

  2. Encourage five of your colleagues to check out and subscribe to this site, www.nomtbf.com.

  3. Ask a vendor how they determined the MTBF value they are presenting on the data sheet? What evidence supports that claim and what assumptions are included (often unstated)?

  4. The next time you hear someone mention MTBF, ask them what do they mean? And, than ask what percentage of items should survive a year? If they are not consistent  — you found a learning opportunity.

  5. Write a blog post for the www.nomtbf.com site. What have you done to encourage better understanding of reliability concepts in your world? Share you hints, tips, stories, and advice here.

Pick one for today and do as many as you can. What would you add to this list? What kind responses are you receiving when you speak out about the perils of MTBF.

Keep up the effort. Together we are making progress. Thanks for the support.