“What’s the MTBF of a Human?” That’s a bit of a strange question?
Guest post by Adam Bahret
I ask this question in my Reliability 101 course. Why ask such a weird question? I’ll tell you why. Because MTBF is the worst, most confusing, crappy metric used in the reliability discipline. Ok maybe that is a smidge harsh, it does have good intentions. But the amount of damage that has been done by the misunderstanding it has caused is horrendous.
MTBF stands for “Mean Time Between Failure.” It is the inverse of failure rate. An MTBF of 100,000 hrs/failure is a failure rate of 1/100,000 fails/hr = .00001 fails/hr. Those are numbers, what does that look like in operation?
Does it mean…
The product lasts 100,000 hrs before failing?
Half the population fails by 100,000 hours?
Wait a minute! our product is only supposed to last three years with a 50% duty cycle. That’s 13,140 hrs of use. Why would we have an MTBF goal of 100,000 hours? It can’t even run that long if everything goes perfectly.
Because of all this confusion what occurs is everyone makes up their own definition of what MTBF is. It’s usually one of the first two I listed, how long it runs or when half have failed. No one wants to sound stupid and ask what it means, so we all just pretend. This is the moral of “The Emperor’s New Clothes. ” We’ll I’m here to tell you that that dude is buck naked.
That is why I ask the “MTBF of a Human” question. Because it forces a harsh realization when I give the correct answer. The answer is “At least 800 years.” The beauty is that the shared confusion in the room brings a sigh of relief as everyone realizes they weren’t alone in not knowing. “Dude go put some clothes on, you’re kinda freaking us out, and there are kids here!”
Ok so I gave the answer. The MTBF of a human is 800 years. That’s actually very conservative. In your current lifestyle it is probably more like 2,000 years. A 800 year MTBF is more indicative of living in some very harsh old world conditions. Maybe a coal mining town in the 1700’s.
The group’s surprise that a human can have an 800 year MTBF brings about a new interest in hearing some MTBF 101.
Ok here we go…..
When MTBF is used as a metric to describe a product’s reliability during it’s use life there are three assumptions.
- The first is that no “Infant Mortality” (i.e. quality failures) are included in this metric.
- The second is that no wear-out ( i.e. end of life) failures are included in this metric.
- The third is failures during use life occur randomly. So in any given moment during use life a failure is just as likely to occur as at any other moment. For a product with a ten year life this means that a random failure is just as likely to occur at three months of age as it is at seven years of age.
What does that mean for our human example? Let’s frame the question in a meaningful way. The individual asking “What is the MTBF of a human?” is the owner of a coal mine in an isolated 1700’s mid west town. Every person in that town works in the coal mine. The owner wants to know how often he can expect people to not show up to work because they are sick or injured, oh yeah and the sickness or injury “had better a killed them dead, if theys not showing up to work.” He doesn’t believe in sick days. He doesn’t’ care about children or retired people either. They have nothing to do with the coal mine’s productivity. The coal mine reliability engineer does a quick calculation and tells the owner that the miners can be expected to have an MTBF of 800 years.
Of course the owner wants to know what that means in practical terms. The reliability engineer explains that he can expect that over an 800 year period that 62.3% of the work population will not show up to work due to death from a random illness or injury. The owner still isn’t sure what that means day to day for his operation. The engineer puts that MTBF into a reliability percentage equation to find the reliability for a one year period, which is a more useful number for the owner.
“How many employees can we expect to not show up due to death in a one year period?
The answer is we can expect a reliability of 99.875% for the work population over a one year period. This translates to 13 deaths for a work population of 10,000 employees. See what I mean about the 800 year MTBF for a human being very conservative. Could you imagine a modern day university with 10,000 students having 13 students die each year from accident or illness? I’m pretty sure someone would be looking into what is going on at that campus.
Let’s break down how we got from an 800 year MTBF to a probability that in any given year 13 employees may die. Below is the equation we will use. It is a derivative of the Weibull equation with the assumption that we are dealing with a constant failure rate and no offsets.
- We do not include children who die (<13 years of age). These are infant mortality. In the production world we consider these to be quality defect and not a characteristic of the design’s reliability.
- We don’t include retirees. In production these are items that are to be removed from service (retired). The manufacturer has predicted that wear out failure modes are going to become dominant at some point and that the promised use reliability will no longer be up held.
- We are not repairing systems
- Units that fail are being immediately replaced with new units that are past the infant mortality stage so the population is a consistent number.
I believe the big shift in understanding what MTBF means is realizing that “wear out” failures do not contribute to it. When an item approaches wear out it is effectively removed from the MTBF population and replaced with a fresh unit. It never failed. A graphical way to look at this is the “bathtub curve”, below. It is commonly referred to for demonstration of failure rate over life for a population. You can clearly see the three life phases for the population.
There is an infant mortality phase driven by quality defects. This failure rate quickly falls as the defective units fall out of the population and are replaced with new units. We then have the useful life where there is a constant failure rate. The height of this line “failure rate” is dictated by the MTBF. A higher MTBF the lower the line. Remember failure rate is the inverse of MTBF. The third phase is “wear out” where predictable failures driven by accumulated stress begin to dominate the population’s behavior.
For the user it is recommended to remove the units at this point and replace them with new ones. Effectively we are creating a scenario where the customers experience is a flat line with a small wave (quality defect introduction of replacement units) in it that goes on forever. The wave can be reduced to almost nothing if quality practices are improved and products are quality screened to defective product never leaves the factory.
Why do I dislike MTBF so much. It’s just to confusing unless you keep in mind all fo those assumptions. It’s valuable for statistics because it is population characteristic that is easily transferable between equations. But for general discussion with individuals not using it for statistics, designers, marketing, project managers, it creates more confusion than clarity. It is just better to discuss product performance in % reliability, failure rate, or availability.