Predictive maintenance techniques are designed to help anticipate equipment failures to allow for advance scheduling of corrective maintenance, thereby preventing unexpected equipment downtime, improving service quality for customers, and reducing the additional cost caused by over-maintenance in preventative maintenance policies. Recent Analysis suggests that the market for predictive maintenance applications is poised to grow from $2.2B in 2017 to $10.9B by 2022, a 39% annual growth rate. The major industries where these techniques can be used are, the Oil and Gas Industries, Mining Industries, Manufacturing Industries, Food and Beverage Industries, etc. Many types of equipment—e.g., Manufacturing equipment's, information technology equipment, medical devices, etc.—track run-time status by generating system messages, error events, and log files, which can be used to predict impending failures. In the current study we attempt to use ML and AI techniques to get insights for machine maintenance and failure prevention. In specific we focus on data for a Turbofan engine in this study.
The problem statement attempted in this study maybe divided into the following specific objectives.
The data set used her was provided by the Prognostics CoE at NASA Ames, and can be found in the Prognostics Data Repository[20].
We followed the following though process in shortlisting the data set of choice; The dataset we are looking for should be a time series dataset and should contain the information of the machine degradation. Since Turbofan Engine Degradation Simulation Datasets show both the characteristics that we want, therefore we consider this group of datasets for our experiment. We have considered five such similar datasets for our experiment.
The datasets differ on the basis of the conditions on which the engines are run and the fault modes. They consist of multiple multivariate time series. The data can be considered to be from a fleet of engines of the same type. Each engine starts with different degrees of initial wear and manufacturing variation which is unknown to the user. This wear and variation is considered normal, i.e., it is not considered a fault condition. There are three operational settings that have a substantial effect on engine performance. These settings are also included in the data. The data is further contaminated with sensor noise.
Description of each dataset is as described below:
This dataset was used for the prognostics challenge competition at the International Conference on Prognostics and Health Management (PHM08)[20]. The engine is operating normally at the start of each time series, and starts to degrade at some point during the series. In the training set, the degradation grows in magnitude until a predefined threshold is reached beyond which it is not preferable to operate the engine. In the test set, the time series ends some time prior to complete degradation.
In this section we briefly describe about the Turbofan engine and how data is collected from it. The turbofan consists of different components as shown in Fig 1. Each component is equipped with different sensors, whose output acts as different features in our dataset. Fig 2 gives the information about all different features. The engine is operating normally at the start of each time series, and develops a fault at some point during the series. In the training set, the fault grows in magnitude until system failure. In the test set, the time series ends some time prior to system failure.
We plotted the histograms and box plots for all the features to analyze the central tendencies and outliers present in all the features. The features which are relatively constant with the time are dropped. The histograms and box plots for two of the features are shown below.
We can see the distribution of data in 'col7' and 'col8' of Dataset 1. We observed that some of the columns had constant or near constant values, which could be removed from our analysis as they don't have any relation to the equipment lifecycle.
In this section we discuss the core solution steps focusing on the following primary areas:
Remaining useful life estimation is central to the prognostics and health management of systems, particularly for safety-critical systems, and systems that are very expensive. We present a non-linear model to estimate the remaining useful life of a system based on monitored degradation data.
Label Creation
The RUL label is created by reversing the feature cycle which shows the number of cycles run by the equipment before it completely degrades. Now the problem reduces to that of a simple regression, and can be solved using different regression techniques. We have used the Gradient Boosting regressor for solving this problem.
Processing
As part of preprocessing, the features which are relatively constant with time are dropped. We then split the dataset into train and test sets. The model is then trained with Gradient Boosting Regressor. A schematic of the entire pipeline is depicted in Fig. 4.
Data points towards the end of equipment life (with low RUL values) are more critical because they affect the decision-making process of equipment maintenance. Therefore, we want to make sure that we select the model which performs better when RUL values are low. To make sure errors in such points are penalized more, metrics like R2 score, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) were customized while comparing performance of the model for different datasets. Customized metrics are shown below:
Customized Mean Absolute Error (MAE):
Customized Root Mean Square Error (RMSE):
Customized r2 score:
When it comes to assessment of the health of the equipment, it becomes an important task to know whether the equipment is in the last 'n' cycles of its life. If it is in the last 'n' cycles of its life then the arrangements have to be made for the maintenance beforehand and the spare parts of the equipment are to be arranged within speculated time in order to reduce the impact of breakdowns to production losses.
Solution Approach
Now this is a binary classification problem where the positive class is labelled if the equipment belongs to the last 'n' cycles of its Remaining Useful Life and the negative class is labelled if the equipment does not belong to the last 'n' cycles of its Remaining Useful Life. Here the 'n' is usually determined according to industry requirements/prediction latency, i.e. before how many cycles of the remaining cycles for the equipment failure, the user wants to initiate the maintenance procedures. Now the problem can be solved using different classification models. We have proceeded with LSTM approach as discussed below:
Classification methodology used
Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points, but also entire sequences of data. Hence, we use this for our classification problem. Since we have class imbalance due to higher number of negative classes compared the positive class points, accuracy is not a reliable metric, thus we use precision, recall, F1 score for comparison of different models. Fig. 5 shows a schematic of the LSTM Architecture used for classification.
Outlier/Anomaly is a datapoint that differs significantly from the other data points. In this case the outliers may occur due to three reasons, which can be either the equipment failure i.e. if the equipment shows untimed breakdown, new normal condition i.e. suppose an equipment fails and is replaced by new equipment or the sensor faults.
Solution Approach for Outlier Analysis
Anomalies can be found out using different multivariate anomaly detection techniques like KMeans Clustering, Isolation Forest, One class SVM, etc. In KMeans Clustering, the threshold distance is calculated using the outlier fraction and the clusters are formed normally using the usual clustering technique. Now if the distance of a datapoint from the centroid of its cluster is greater than this threshold distance, then this datapoint is considered as an anomaly/outlier. In Isolation Forest the data is randomly partitioned along different features and those data points which require the least number of partitions to be isolated from the rest of the data are classified as outliers. One Class SVM is similar to the classic SVM. But in one class SVM, we only have one class to train the data, which in our case are the normal data points. The algorithm learns the boundary for these normal points and classifies points outside this boundary as outliers.
Pareto Analysis is a statistical technique in decision-making used for the selection of a limited number of tasks that produce significant overall effect. It uses the Pareto Principle (also known as 80/20 rule), the idea that 20% of the population usually results in generating 80% of the benefit. In our case, we get feature (sensor values) importance using the feature importance attribute of sklearn Gradient Boosting Classifier and the top features explaining the 80% variance in the state of health equipment are obtained. Thus, one can focus only on these top features more closely to get a preliminary idea about the state of health of the equipment.
I. Remaining Useful Life
We can see from Fig. 6 depicting the performance metrics obtained from different datasets. We find that the customized metrics show a better model performance, indicating that the model is doing well closer to the end of its life, where the accuracy is critical.
II. Assessment of State of Health
We have considered precision, recall and the F1 score as metrics to compare the results obtained from different datasets (see Fig. 7) The F1 score is the harmonic mean of the precision and recall. It is important not to mis-label the positive class, as that would lead to a delay in maintenance, hence our primary focus was to improve recall value. The model seems to work better on dataset 1 and dataset 3 when compared to other datasets. We observe that performance improves on increasing the sequence length up to a certain threshold (30-35) and then becomes insensitive with respect to further increase in sequence length. There is a tradeoff between precision and recall values, we try to improve recall value while making sure that precision does not drop too low.
III. Outlier Analysis
Anomaly detection helps in identifying unusual occurrences which might not be evident with a manual observation. Different methods identify anomalies in different contexts, we are more concerned with the instances corresponding to equipment degradation. Fig. 8 depicts charts for all three methods on dataset 5, RUL values for different equipment are plotted where red points correspond to anomalous instances:
K-Means clustering seems to have correctly identified some of the anomalies that we are looking for towards the end of the degradation cycle but also some which are present between the mid and end section of the cycle. One class SVM has identified most of the anomalies in a single degradation cycle. Isolation forest gives the best result for our case as it identifies almost all the anomalies present towards the end of the degradation cycle, hence it is used for anomaly detection for further analysis. We also visualized the detected anomalies in 2D feature space for dataset 5. Shown in Fig. 9
For the 2D representation we use t-SNE to reduce the features to two components, represented by x and y. In every 2D representation for anomalies in each dataset, we can see that different cluster of points are observed, and anomalies are present on the edge of these clusters.
IV. Pareto Analysis
The Pareto Chart shows the plot of Cumulative Feature importance in percentage vs features plot, with a horizontal line at 80% of cumulative feature importance, features below which are the 'vital few' features identified by Pareto Analysis. For example, the top features identified on dataset1 are "Static Pressure at HPC outlet", "Ratio of fuel flow to Ps30", "Physical core speed", "pressure at HPC outlet" and "Temp at LPT outlet". These are the features that explain 80% variance in state of health of the equipment and should be monitored closely. The Pareto Chart for dataset 1 is shown in Fig. 10
We explored employing Machine learning and analytical techniques to use IoT sensor data to predict whether an in-service equipment is close to failure. This can be used in various industries to help in real-time monitoring of a machine's health and correctly time maintenance to down-time costs without over-maintenance.
Here a few additional things that can be explored:
The authors wish to express their gratitude to Paulami Das, Head of Data Science CoE @ Brillio and Anish Roychowdhury, Senior Analytics Leader @ Brillio for their mentoring and guidance towards shaping up this study.
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Authors: Shashank Gupta, Abinav Sirohi , Vikram Nande, Brillio Technologies, Indian Institute of Technology, Kharagpur
Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp
_____________
Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.