Predictive Maintenance Solution for Real-time Condition-Monitoring

Published on:

08 Sep 2020, 6:41 am

Background

Predictive maintenance techniques are designed to help anticipate equipment failures to allow for advance scheduling of corrective maintenance, thereby preventing unexpected equipment downtime, improving service quality for customers, and reducing the additional cost caused by over-maintenance in preventative maintenance policies. Recent Analysis suggests that the market for predictive maintenance applications is poised to grow from $2.2B in 2017 to $10.9B by 2022, a 39% annual growth rate. The major industries where these techniques can be used are, the Oil and Gas Industries, Mining Industries, Manufacturing Industries, Food and Beverage Industries, etc. Many types of equipment—e.g., Manufacturing equipment's, information technology equipment, medical devices, etc.—track run-time status by generating system messages, error events, and log files, which can be used to predict impending failures. In the current study we attempt to use ML and AI techniques to get insights for machine maintenance and failure prevention. In specific we focus on data for a Turbofan engine in this study.

Problem Statement

The problem statement attempted in this study maybe divided into the following specific objectives.

Assess the health state of the equipment (State of Health – SoH): Prediction of whether an equipment is in its' last 'n' periods of life.
Predicting the remaining useful life (RUL) of an equipment.: Prediction of number of cycles remaining before the equipment completely breaks down.
Outlier Analysis: Finding the anomalies in the data.
Pareto Alerts: analysis identifies the "vital few" features that contribute the most to plant maintenance and distinguishes them from the "trivial many".

Data Sources

The data set used her was provided by the Prognostics CoE at NASA Ames, and can be found in the Prognostics Data Repository[20].

Choice of Data Source

We followed the following though process in shortlisting the data set of choice; The dataset we are looking for should be a time series dataset and should contain the information of the machine degradation. Since Turbofan Engine Degradation Simulation Datasets show both the characteristics that we want, therefore we consider this group of datasets for our experiment. We have considered five such similar datasets for our experiment.

Details on Data Source

The datasets differ on the basis of the conditions on which the engines are run and the fault modes. They consist of multiple multivariate time series. The data can be considered to be from a fleet of engines of the same type. Each engine starts with different degrees of initial wear and manufacturing variation which is unknown to the user. This wear and variation is considered normal, i.e., it is not considered a fault condition. There are three operational settings that have a substantial effect on engine performance. These settings are also included in the data. The data is further contaminated with sensor noise.

Description of Specific Data Sets

Description of each dataset is as described below:

1. Data Set: FD001

Train trajectories: 100
Test trajectories: 100
Conditions: ONE (Sea Level)
Fault Modes: ONE (HPC Degradation)

2. Data Set: FD002

Train trajectories: 260
Test trajectories: 259
Conditions: SIX
Fault Modes: ONE (HPC Degradation)

3. Data Set: FD003

Train trajectories: 100
Test trajectories: 100
Conditions: ONE (Sea Level)
Fault Modes: TWO (HPC Degradation, Fan Degradation)

4. Data Set: FD004

Train trajectories: 248
Test trajectories: 249
Conditions: SIX
Fault Modes: TWO (HPC Degradation, Fan Degradation)

5. PHM08 Prognostics Data Challenge Dataset

This dataset was used for the prognostics challenge competition at the International Conference on Prognostics and Health Management (PHM08)[20]. The engine is operating normally at the start of each time series, and starts to degrade at some point during the series. In the training set, the degradation grows in magnitude until a predefined threshold is reached beyond which it is not preferable to operate the engine. In the test set, the time series ends some time prior to complete degradation.

Machine Description

In this section we briefly describe about the Turbofan engine and how data is collected from it. The turbofan consists of different components as shown in Fig 1. Each component is equipped with different sensors, whose output acts as different features in our dataset. Fig 2 gives the information about all different features. The engine is operating normally at the start of each time series, and develops a fault at some point during the series. In the training set, the fault grows in magnitude until system failure. In the test set, the time series ends some time prior to system failure.

Figure 1. Turbofan Engine [3]

Figure 2. Features of Engine

Exploratory Data Analysis

We plotted the histograms and box plots for all the features to analyze the central tendencies and outliers present in all the features. The features which are relatively constant with the time are dropped. The histograms and box plots for two of the features are shown below.

Figure 3. Histogram and Box Plot for two sample columns

We can see the distribution of data in 'col7' and 'col8' of Dataset 1. We observed that some of the columns had constant or near constant values, which could be removed from our analysis as they don't have any relation to the equipment lifecycle.

Solution Overview

In this section we discuss the core solution steps focusing on the following primary areas:

RUL prediction
Assessment of state of health
Outlier analysis
Pareto analysis

I. RUL Prediction

Remaining useful life estimation is central to the prognostics and health management of systems, particularly for safety-critical systems, and systems that are very expensive. We present a non-linear model to estimate the remaining useful life of a system based on monitored degradation data.

Label Creation

The RUL label is created by reversing the feature cycle which shows the number of cycles run by the equipment before it completely degrades. Now the problem reduces to that of a simple regression, and can be solved using different regression techniques. We have used the Gradient Boosting regressor for solving this problem.

Processing

As part of preprocessing, the features which are relatively constant with time are dropped. We then split the dataset into train and test sets. The model is then trained with Gradient Boosting Regressor. A schematic of the entire pipeline is depicted in Fig. 4.

Data points towards the end of equipment life (with low RUL values) are more critical because they affect the decision-making process of equipment maintenance. Therefore, we want to make sure that we select the model which performs better when RUL values are low. To make sure errors in such points are penalized more, metrics like R2 score, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) were customized while comparing performance of the model for different datasets. Customized metrics are shown below:

Customized Mean Absolute Error (MAE):

Customized Root Mean Square Error (RMSE):

Customized r2 score:

Figure 4. Schematic of entire pipeline for RUL prediction

II. Assessment of State of Health

When it comes to assessment of the health of the equipment, it becomes an important task to know whether the equipment is in the last 'n' cycles of its life. If it is in the last 'n' cycles of its life then the arrangements have to be made for the maintenance beforehand and the spare parts of the equipment are to be arranged within speculated time in order to reduce the impact of breakdowns to production losses.

Solution Approach

Now this is a binary classification problem where the positive class is labelled if the equipment belongs to the last 'n' cycles of its Remaining Useful Life and the negative class is labelled if the equipment does not belong to the last 'n' cycles of its Remaining Useful Life. Here the 'n' is usually determined according to industry requirements/prediction latency, i.e. before how many cycles of the remaining cycles for the equipment failure, the user wants to initiate the maintenance procedures. Now the problem can be solved using different classification models. We have proceeded with LSTM approach as discussed below:

Classification methodology used

Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points, but also entire sequences of data. Hence, we use this for our classification problem. Since we have class imbalance due to higher number of negative classes compared the positive class points, accuracy is not a reliable metric, thus we use precision, recall, F1 score for comparison of different models. Fig. 5 shows a schematic of the LSTM Architecture used for classification.

Figure 5. Architecture for neural network used for classification:

III. Outlier Analysis

Outlier/Anomaly is a datapoint that differs significantly from the other data points. In this case the outliers may occur due to three reasons, which can be either the equipment failure i.e. if the equipment shows untimed breakdown, new normal condition i.e. suppose an equipment fails and is replaced by new equipment or the sensor faults.

Solution Approach for Outlier Analysis

Anomalies can be found out using different multivariate anomaly detection techniques like KMeans Clustering, Isolation Forest, One class SVM, etc. In KMeans Clustering, the threshold distance is calculated using the outlier fraction and the clusters are formed normally using the usual clustering technique. Now if the distance of a datapoint from the centroid of its cluster is greater than this threshold distance, then this datapoint is considered as an anomaly/outlier. In Isolation Forest the data is randomly partitioned along different features and those data points which require the least number of partitions to be isolated from the rest of the data are classified as outliers. One Class SVM is similar to the classic SVM. But in one class SVM, we only have one class to train the data, which in our case are the normal data points. The algorithm learns the boundary for these normal points and classifies points outside this boundary as outliers.

IV. Pareto Analysis

Pareto Analysis is a statistical technique in decision-making used for the selection of a limited number of tasks that produce significant overall effect. It uses the Pareto Principle (also known as 80/20 rule), the idea that 20% of the population usually results in generating 80% of the benefit. In our case, we get feature (sensor values) importance using the feature importance attribute of sklearn Gradient Boosting Classifier and the top features explaining the 80% variance in the state of health equipment are obtained. Thus, one can focus only on these top features more closely to get a preliminary idea about the state of health of the equipment.

Results and Observations

I. Remaining Useful Life

We can see from Fig. 6 depicting the performance metrics obtained from different datasets. We find that the customized metrics show a better model performance, indicating that the model is doing well closer to the end of its life, where the accuracy is critical.

Figure 6. R2 and customized R2 scores for selected model

II. Assessment of State of Health

We have considered precision, recall and the F1 score as metrics to compare the results obtained from different datasets (see Fig. 7) The F1 score is the harmonic mean of the precision and recall. It is important not to mis-label the positive class, as that would lead to a delay in maintenance, hence our primary focus was to improve recall value. The model seems to work better on dataset 1 and dataset 3 when compared to other datasets. We observe that performance improves on increasing the sequence length up to a certain threshold (30-35) and then becomes insensitive with respect to further increase in sequence length. There is a tradeoff between precision and recall values, we try to improve recall value while making sure that precision does not drop too low.

Figure 7. Comparison of data set performance for Classification

III. Outlier Analysis

Anomaly detection helps in identifying unusual occurrences which might not be evident with a manual observation. Different methods identify anomalies in different contexts, we are more concerned with the instances corresponding to equipment degradation. Fig. 8 depicts charts for all three methods on dataset 5, RUL values for different equipment are plotted where red points correspond to anomalous instances:

Figure 8. Comparison of different algorithms for detecting outliers in equipment lifecycle

K-Means clustering seems to have correctly identified some of the anomalies that we are looking for towards the end of the degradation cycle but also some which are present between the mid and end section of the cycle. One class SVM has identified most of the anomalies in a single degradation cycle. Isolation forest gives the best result for our case as it identifies almost all the anomalies present towards the end of the degradation cycle, hence it is used for anomaly detection for further analysis. We also visualized the detected anomalies in 2D feature space for dataset 5. Shown in Fig. 9

Figure 9. Visualization of anomalous points using dimensionality reduction

For the 2D representation we use t-SNE to reduce the features to two components, represented by x and y. In every 2D representation for anomalies in each dataset, we can see that different cluster of points are observed, and anomalies are present on the edge of these clusters.

IV. Pareto Analysis

The Pareto Chart shows the plot of Cumulative Feature importance in percentage vs features plot, with a horizontal line at 80% of cumulative feature importance, features below which are the 'vital few' features identified by Pareto Analysis. For example, the top features identified on dataset1 are "Static Pressure at HPC outlet", "Ratio of fuel flow to Ps30", "Physical core speed", "pressure at HPC outlet" and "Temp at LPT outlet". These are the features that explain 80% variance in state of health of the equipment and should be monitored closely. The Pareto Chart for dataset 1 is shown in Fig. 10

Figure 10. Pareto Analysis for feature importance

Conclusion and Future Scope

We explored employing Machine learning and analytical techniques to use IoT sensor data to predict whether an in-service equipment is close to failure. This can be used in various industries to help in real-time monitoring of a machine's health and correctly time maintenance to down-time costs without over-maintenance.

Here a few additional things that can be explored:

Thresholding: Value analysis for the top features reported by Pareto Analysis which can be used as a warning system in case any of those values cross their assigned acceptable ranges.
Trend Reversal: Used in stock market to track change in the movement direction of stocks for making buying/selling decisions, can help us identify change in the trends of sensor outputs which might be associated with possible equipment degradation.
Hotspot Analysis: Adding a feature wise anomaly detection method so that any instances of unusual observations/possible equipment degradation can be traced to the components responsible.

Acknowledgements

The authors wish to express their gratitude to Paulami Das, Head of Data Science CoE @ Brillio and Anish Roychowdhury, Senior Analytics Leader @ Brillio for their mentoring and guidance towards shaping up this study.

Appendix

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

References

Wetzer, Michael. "Performing predictive maintenance on equipment." U.S. Patent No. 6,738,748. 18 May 2004.

Yuan, Mei, Yuting Wu, and Li Lin. "Fault diagnosis and remaining useful life estimation of aero engine using LSTM neural network." 2016 IEEE International Conference on Aircraft Utility Systems (AUS). IEEE, 2016.
Saxena, K. Goebel, D. Simon, and N. Eklund, "Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation", in the Proceedings of the Ist International Conference on Prognostics and Health Management (PHM08), Denver CO, Oct 2008.
https://www.ke.tu-darmstadt.de/lehre/arbeiten/master/2015/Jahnke_Patrick.pdf
https://arxiv.org/ftp/arxiv/papers/1708/1708.03665.pdf
https://www.kaggle.com/rdslater/looking-at-categorical-data
https://www.kaggle.com/gaborfodor/notebookd19d11e4f2
https://www.kaggle.com/dollardollar/eda-of-important-features
https://pythondata.com/working-large-csv-files-python/
https://www.svds.com/predictive-maintenance-iot/
https://towardsdatascience.com/water-pumps-maintenance-prediction-data-science-illustrated-20c7100017c5
https://towardsdatascience.com/how-to-implement-machine-learning-for-predictive-maintenance-4633cdbe4860
https://www.analyticsvidhya.com/blog/2018/09/multivariate-time-series-guide-forecasting-modeling-python-codes/
https://gallery.azure.ai/Collection/Predictive-Maintenance-Template-3
http://www.davidsbatista.net/blog/2018/08/19/NLP_Metrics/
https://www.kaggle.com/guglielmocamporese/macro-f1-score-keras
https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/
https://towardsdatascience.com/cryptocurrency-analysis-with-python-macd-452ceb251d7c
https://www.kaggle.com/vinayak123tyagi/damage-propagation-modeling-for-aircraft-engine
https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/
https://www.kaggle.com/c/bosch-production-line-performance/overview
https://github.com/mathworks/WindTurbineHighSpeedBearingPrognosis-Data

Authors: Shashank Gupta,Abinav Sirohi, Vikram Nande,Brillio Technologies, Indian Institute of Technology, Kharagpur

Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp

_____________

Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.