To Analyze or Not to Analyze, That is the Question

Can we successfully improve reliability without analyzing reliability data?

To Analyze or  Not to Analyze, That is the  Question

In my personal experience, I have worked over many years with a lot of people who have worked hard to improve reliability. In the majority of cases, their starting point is a high level of reactive maintenance. They aim to utilize the principles of RCM, RCFA, condition monitoring, and improved maintenance practices to break out of the reactive maintenance cycle, and help the organization reduce costs and improve performance. The question is: how useful is reliability data analysis to the average reliability engineer?
You have probably all heard the quotes by W. Edwards Deming; “Without data you’re just another person with an opinion”, and “In God we trust, all else must bring data”. That is great, but what data do you need, where do you get it, and what do you do with the data once you have got it?
In many cases (the majority of cases), the message I receive is that there is so much to improve, so much low-hanging fruit, that we do not need data to tell us what to do.
Let’s explore that though a little.

Asset Criticality Ranking (ACR)

Although it is not data in the true sense, it is a numerical ranking that every reliability engineer needs to make informed decisions. I have written about how to develop the asset criticality ranking in previous articles, including the previous edition of MaintWorld, but by generating a score based on the reliability of an asset, the consequences of failure of the asset, and the likelihood of detecting the onset of failure of the asset, enables us to assess where our risks and opportunities for improvement lie.
The asset criticality ranking helps us to prioritize tasks so that we can make the best use of our precious time. The prioritization is based on addressing the assets that pose the greatest risk. We can prioritize Reliability Centered Maintenance (RCM) or Failure Modes Effects Analysis (FMEA) analysis work, condition monitoring, work orders, and much more

Pareto analysis; the “bad actor” list

Pareto analysis is also a tremendously valuable tool for identifying where you should focus your attention to have the greatest impact. The basic idea of Pareto analysis is that just 20% or fewer of your assets generate 80% or greater of your costs (waste, downtime, etc.). 

If you go through your CMMS records you should hopefully be able to extract which assets have received maintenance work over the last few years. With that information you can create a special type of histogram, called a Pareto chart, which lists assets in the order of their impact; the assets that have failed most frequently on the left, ranked across the graph with the assets that have failed the least on the right.


You can also display it as a list; the “bad actor” list

Ideally you would also be able to create Pareto charts based on the nature of failure, the type of component, and if you were really, really fortunate, the root cause of the failure; but that would be hoping for too much…

Combining Criticality, Pareto, and condition

If you also have an effective condition monitoring program, then you have the perfect trio: you will know where you face the greatest risk (ACR), where you have experienced the most problems (Pareto), and which asset(s) are likely to require repair or replacement in the near future (CM).
So if you look out of your office window at 10,000 assets and you are wondering where to start, start with the assets that are at the top of the criticality ranking and bad actor list and condition-monitoring warning list.

Key Performance Indicators (KPIs)

KPIs are another type of data that reliability engineers should use. KPIs can tell you how your program is performing
(lagging KPIs), the performance you can expect in the future (leading KPIs), and where you have the greatest opportunity for improvement. This is a huge topic, but here are a few thoughts:
Rule #1 is that you have KPIs that are aligned with the goals of the organization. Operating Equipment Effectiveness (OEE) is an excellent KPI for a manufacturer.
Rule #2 is that you measure what you want to improve. People tend to focus on the metrics that are being recorded, especially if targets are established.
Rule #3 is that you use KPIs as described in the opening paragraph, and not to financially reward a person, otherwise you may find that there are some shenanigans when it comes to how data is recorded…

Mean Time Between Failure (MTBF)

One of the most common measures of reliability, and the effectiveness of the improvements being made by the program,
is the MTBF. But sadly, it is misused for a variety of reasons.
Just to ensure we are on the same page, if there were ten failures over a three-year period (36 months) then the MTBF would be 3.6 months (assuming the repair times are consistent in length). While it is a simple metric, unless there is a constant failure rate (i.e. the asset is not experiencing infant mortality or age related failures, and you are not changing the reliability within the analysis window), and you are focused on single failure modes, and you correctly treat voluntary removal from service (compared to actual failures), you must be very careful how you use MTBF.
There is more that could be said, and there is a great web site dedicated to this topic at where you can learn more.

Using Weibull Analysis and Monte Carlo Simulations

OK, now we are getting serious. This is way beyond simple MTBF analysis.
This is a HUGE topic.
Let’s just say this; utilizing failure data, performing Weibull analysis, and performing Monte Carlo “simulations” allows a reliability engineer to understand the nature of failure (infant mortality, random, age-related) and the time-to-failure (the P-F interval) to create a more informed maintenance strategy. It also allows the engineer to build reliability block diagrams to estimate the future availability of the plant (or parts of it). And it enables the reliability engineer to answer all kinds of questions related to the optimal time to perform maintenance, estimating maintenance costs, assessing risk, and MUCH more.
You need to go beyond basic MTBF analysis to answer these questions, but it takes good data and an understanding of statistics to be successful.

Do You Have Good Data?

But the bulk of the preceding commentary is based on having good data. Are you sure the information in the CMMS has been recorded accurately. When there is a failure recorded against “Air dryer #1”, are you sure it was not “Air dryer #2”? If you utilize failure data, are you sure about the component/part/failure mode/root cause recorded? If you find yourself saying “the data is not accurate, but it is the only data we’ve got”, then you must proceed with great care!

Collecting Good Data

The information we would like to have includes:
Which asset failed (the maintainable item)
The component type (motor, gearbox, etc.)
The part (coupling, drive-end bearing, etc.)
The failure mode (bearing seized, gear tooth broken, etc.)
The consequence of failure (production loss, labour cost, parts costs, etc.)
The root cause (lack of lubrication, incorrect operation, etc.)
When it failed

It is wishful thinking that you will be able to collect all of that information unless you go to considerable lengths, and in many cases you may not know the root cause. But for sure, the more accurate information you have the better position you will be in to make informed decisions.
One of the ways to collect this data is via failures codes. But if a technician is confronted with 82 options they will choose the first option or “other”. A smart system will only provide valid options. If the system knows the motor failed, it should only offer failure codes relevant to motors, and you should further limit them to the most common failure modes.

What If You Don’t Have Any Data?

There are three things you can do:

1 Use anecdotal information to construct the asset criticality ranking, and talk to the maintenance technicians and operators about the equipment that has failed the most and the most common repairs & installations. You may have records of the components and parts purchased most frequently. It should be possible to get a pretty good “bad actor” list.

2 Use “common knowledge” to set the priority of the improvement projects. It is well known that precision lubrication, alignment, balancing, fastening, and cleaning will improve the reliability of rotating machinery. You need standard operating procedures, work management, spares management, and written work procedures and standards. The 5S principles will improve efficiency and morale. Training will improve skills and morale. And the list goes on.

3 Start collecting good data. Today.
While the majority of reliability engineers may not feel they need data, or that they do not have good data to work with, it is hoped that this article will explain why you can and should use what you’ve got, and begin a plan to start collecting data that will help you make more informed decisions (and prove the benefits of your program).

Jason Tranter, CMRP,
Mobius Institute, jason@mobiusinstitute

Asset Management | 23.10.2018

Latest articles

Wärtsilä software will enable a new Simulation Centre to meet needs of cruise, ferry and superyacht clients

The technology group Wärtsilä is to supply and configure the software solutions for a new, state-of-the-art, purpose built simulation centre.

R&D | 20.5.2019

Safe chemical storage is essential for safety, health and business prosperity

Many industries use and store dangerous chemicals, and the chemical waste produced. Storing these substances safely is essential to protect workers’ health, the environment, and the public from the effects of potential chemical spills and gas emissions. In this article the European Union information agency for occupational safety and health, EU-OSHA, gives tips on how to keep workers safe and healthy.

HSE | 16.5.2019

Safe and sustainable airport maintenance with tractors

Robotic vacuum cleaners and lawnmowers are already a reality, but can the same idea be applied to a 7-ton machine clearing an airstrip from snow? Valtra and Nokian Tyres partnered up with three other Finnish companies for a joint proof-of-concept project at the European Union’s northernmost airport. 

Applications | 15.5.2019

Maintenance Performance and its Measurements

Maintenance performance can be difficult to measure.  But, measurements drive behavior and it is therefore important to think through what to measure.  The measurement should be designed in order to achieve the following:

Asset Management | 25.4.2019

Implementing RCM? 5 Mistakes You Need to Avoid

If you are considering to employ  RCM analysis  at your facility, it means you have recognized the need for a change in your maintenance strategies. Reliability-centered maintenance is an excellent way to keep your plant or machinery up and running by helping you choose the optimal maintenance strategy for all of your important assets. 

Cmms | 19.3.2019

Save Energy in Steam and Condensate Systems

The oldest paper mill still operating in Germany detects defective steam traps with digital ultrasonic testing technology

Partner Articles | 15.3.2019

Time to Invest in Europe’s Water Infrastructure

THE WATER AND SANITATION sector is an important part of the European economy. It represents more than 500,000 people directly employed in water and sewerage companies, operating thousands of facilities and several million kilometres of networks.

Editorial | 14.3.2019

Maintenance: A Necessary and Important Function in the Future

Euromaintenance 2016 will take place in Athens at the end of May. It is the ideal moment to reflect on maintenance in a European context. Euromaintenance is known as the summit for all involved in maintenance across Europe, it’s the place to be. The conference, with the support of the EFNMS, is the only commercially independent conference covering the topics we deal with in the maintenance world.

EFNMS | 20.5.2016