To Analyze or Not to Analyze, That is the Question

Can we successfully improve reliability without analyzing reliability data?

To Analyze or  Not to Analyze, That is the  Question

In my personal experience, I have worked over many years with a lot of people who have worked hard to improve reliability. In the majority of cases, their starting point is a high level of reactive maintenance. They aim to utilize the principles of RCM, RCFA, condition monitoring, and improved maintenance practices to break out of the reactive maintenance cycle, and help the organization reduce costs and improve performance. The question is: how useful is reliability data analysis to the average reliability engineer?
You have probably all heard the quotes by W. Edwards Deming; “Without data you’re just another person with an opinion”, and “In God we trust, all else must bring data”. That is great, but what data do you need, where do you get it, and what do you do with the data once you have got it?
In many cases (the majority of cases), the message I receive is that there is so much to improve, so much low-hanging fruit, that we do not need data to tell us what to do.
Let’s explore that though a little.

Asset Criticality Ranking (ACR)

Although it is not data in the true sense, it is a numerical ranking that every reliability engineer needs to make informed decisions. I have written about how to develop the asset criticality ranking in previous articles, including the previous edition of MaintWorld, but by generating a score based on the reliability of an asset, the consequences of failure of the asset, and the likelihood of detecting the onset of failure of the asset, enables us to assess where our risks and opportunities for improvement lie.
The asset criticality ranking helps us to prioritize tasks so that we can make the best use of our precious time. The prioritization is based on addressing the assets that pose the greatest risk. We can prioritize Reliability Centered Maintenance (RCM) or Failure Modes Effects Analysis (FMEA) analysis work, condition monitoring, work orders, and much more

Pareto analysis; the “bad actor” list

Pareto analysis is also a tremendously valuable tool for identifying where you should focus your attention to have the greatest impact. The basic idea of Pareto analysis is that just 20% or fewer of your assets generate 80% or greater of your costs (waste, downtime, etc.). 

If you go through your CMMS records you should hopefully be able to extract which assets have received maintenance work over the last few years. With that information you can create a special type of histogram, called a Pareto chart, which lists assets in the order of their impact; the assets that have failed most frequently on the left, ranked across the graph with the assets that have failed the least on the right.


You can also display it as a list; the “bad actor” list

Ideally you would also be able to create Pareto charts based on the nature of failure, the type of component, and if you were really, really fortunate, the root cause of the failure; but that would be hoping for too much…

Combining Criticality, Pareto, and condition

If you also have an effective condition monitoring program, then you have the perfect trio: you will know where you face the greatest risk (ACR), where you have experienced the most problems (Pareto), and which asset(s) are likely to require repair or replacement in the near future (CM).
So if you look out of your office window at 10,000 assets and you are wondering where to start, start with the assets that are at the top of the criticality ranking and bad actor list and condition-monitoring warning list.

Key Performance Indicators (KPIs)

KPIs are another type of data that reliability engineers should use. KPIs can tell you how your program is performing
(lagging KPIs), the performance you can expect in the future (leading KPIs), and where you have the greatest opportunity for improvement. This is a huge topic, but here are a few thoughts:
Rule #1 is that you have KPIs that are aligned with the goals of the organization. Operating Equipment Effectiveness (OEE) is an excellent KPI for a manufacturer.
Rule #2 is that you measure what you want to improve. People tend to focus on the metrics that are being recorded, especially if targets are established.
Rule #3 is that you use KPIs as described in the opening paragraph, and not to financially reward a person, otherwise you may find that there are some shenanigans when it comes to how data is recorded…

Mean Time Between Failure (MTBF)

One of the most common measures of reliability, and the effectiveness of the improvements being made by the program,
is the MTBF. But sadly, it is misused for a variety of reasons.
Just to ensure we are on the same page, if there were ten failures over a three-year period (36 months) then the MTBF would be 3.6 months (assuming the repair times are consistent in length). While it is a simple metric, unless there is a constant failure rate (i.e. the asset is not experiencing infant mortality or age related failures, and you are not changing the reliability within the analysis window), and you are focused on single failure modes, and you correctly treat voluntary removal from service (compared to actual failures), you must be very careful how you use MTBF.
There is more that could be said, and there is a great web site dedicated to this topic at where you can learn more.

Using Weibull Analysis and Monte Carlo Simulations

OK, now we are getting serious. This is way beyond simple MTBF analysis.
This is a HUGE topic.
Let’s just say this; utilizing failure data, performing Weibull analysis, and performing Monte Carlo “simulations” allows a reliability engineer to understand the nature of failure (infant mortality, random, age-related) and the time-to-failure (the P-F interval) to create a more informed maintenance strategy. It also allows the engineer to build reliability block diagrams to estimate the future availability of the plant (or parts of it). And it enables the reliability engineer to answer all kinds of questions related to the optimal time to perform maintenance, estimating maintenance costs, assessing risk, and MUCH more.
You need to go beyond basic MTBF analysis to answer these questions, but it takes good data and an understanding of statistics to be successful.

Do You Have Good Data?

But the bulk of the preceding commentary is based on having good data. Are you sure the information in the CMMS has been recorded accurately. When there is a failure recorded against “Air dryer #1”, are you sure it was not “Air dryer #2”? If you utilize failure data, are you sure about the component/part/failure mode/root cause recorded? If you find yourself saying “the data is not accurate, but it is the only data we’ve got”, then you must proceed with great care!

Collecting Good Data

The information we would like to have includes:
Which asset failed (the maintainable item)
The component type (motor, gearbox, etc.)
The part (coupling, drive-end bearing, etc.)
The failure mode (bearing seized, gear tooth broken, etc.)
The consequence of failure (production loss, labour cost, parts costs, etc.)
The root cause (lack of lubrication, incorrect operation, etc.)
When it failed

It is wishful thinking that you will be able to collect all of that information unless you go to considerable lengths, and in many cases you may not know the root cause. But for sure, the more accurate information you have the better position you will be in to make informed decisions.
One of the ways to collect this data is via failures codes. But if a technician is confronted with 82 options they will choose the first option or “other”. A smart system will only provide valid options. If the system knows the motor failed, it should only offer failure codes relevant to motors, and you should further limit them to the most common failure modes.

What If You Don’t Have Any Data?

There are three things you can do:

1 Use anecdotal information to construct the asset criticality ranking, and talk to the maintenance technicians and operators about the equipment that has failed the most and the most common repairs & installations. You may have records of the components and parts purchased most frequently. It should be possible to get a pretty good “bad actor” list.

2 Use “common knowledge” to set the priority of the improvement projects. It is well known that precision lubrication, alignment, balancing, fastening, and cleaning will improve the reliability of rotating machinery. You need standard operating procedures, work management, spares management, and written work procedures and standards. The 5S principles will improve efficiency and morale. Training will improve skills and morale. And the list goes on.

3 Start collecting good data. Today.
While the majority of reliability engineers may not feel they need data, or that they do not have good data to work with, it is hoped that this article will explain why you can and should use what you’ve got, and begin a plan to start collecting data that will help you make more informed decisions (and prove the benefits of your program).

Jason Tranter, CMRP,
Mobius Institute, jason@mobiusinstitute

Asset Management | 23.10.2018

Latest articles

The Ultimate Guide to ITSM Best Practices

ITSM or IT service management is a collective term used to describe the processes followed by organizations to design, plan, improve and deliver IT services they offer. It has become a standard procedure for organizations around the world to come up with a business-specific ITSM framework that aligns IT services and processes with organizational goals.

Cmms | 15.11.2018

Leak detection in the age of digitalization with SONOCHEK

Together with the update of the SONOCHEK apps SONOLEVEL and SONOLEAK, PRUFTECHNIK launches the SONOLEVEL DATAVIEWER, a new visualization PC-software for leak analysis. SONOCHEK is an innovative digital leak inspection device used to localize leaks in compressed air, gas and vacuum systems. The device includes automatic leak classification and its broad bandwidth of 20 to 100 kHz allows detecting leaks even at an early stage. SONOCHEK is also used effectively for electrical inspection and monitoring of bearing lubrication.

Asset Management | 24.10.2018

Cost-Effective, High Performance Flare Monitoring in the Petrochemical Industry

Safe flare operation and environmental protection require reliable and accurate flare pilot monitoring.

Partner Articles | 24.10.2018

Standards Increase Economic Growth and Help Companies Access Foreign Markets

Standardisation is a key to a well-functioning, economically prospering and sustainable society. It is also a way to increase economic growth. These conclusions can be made from the results of the new study “The Influence of Standards on the Nordic Economies”.

R&D | 24.10.2018

Future of Work: Gig Economy and Field Maintenance

In today's dynamic and inter-dependent world, traditional jobs are losing ground due to the rise of the gig economy. Modern blue and white collar workers now have the opportunity to choose to whom they will give their time, where will they work and under which conditions. 

HSE | 24.10.2018

How ultrasonic sensors and artificial intelligence improve condition monitoring

In an industrial setting, assets are everything. A breakdown can cause hours  of downtime, and thus hours of lost productivity and lower financial gains. Preventative maintenance is one way facility managers counteract system entropy, but it is not perfect strategy. In some cases, a preventive approach to maintenance can actually lower the overall effectiveness of an asset. In the case of a valve for example, constant tightening can cause premature wear and tear.

Applications | 23.10.2018

Maintenance: A Necessary and Important Function in the Future

Euromaintenance 2016 will take place in Athens at the end of May. It is the ideal moment to reflect on maintenance in a European context. Euromaintenance is known as the summit for all involved in maintenance across Europe, it’s the place to be. The conference, with the support of the EFNMS, is the only commercially independent conference covering the topics we deal with in the maintenance world.

EFNMS | 20.5.2016