Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method

Elhassan AT1; Aljourf M1; Al-Mohanna F2; 3; Shoukri M2; 3

Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method

Abstract

Elhassan AT1, Aljourf M1, Al-Mohanna F2,3, Shoukri M2,3*

The problem of classifying subjects into disease categories is of common occurrence in medical research. Machine learning tools such as Artificial Neural Network (ANN), Support Vector Machine (SVM) and Logistic Regression (LR) and Fisher’s Linear Discriminant Analysis (LDA) are widely used in the areas of prediction and classification. The main objective of these competing classification strategies is to predict a dichotomous outcome (e.g. disease/healthy) based on several features. Like any of the well-known statistical inferential models; machine learning tools are faced with a problem known as “class imbalance”. A data set is imbalanced if the classification categories are not approximately equally represented. When learning from highly imbalanced data, most classifiers are affected by the majority class leading to an increase in the false negative rate. Increased interests in applying machine learning techniques to "real-world" problems, whose data are characterized by severe imbalance, have emerged as can be seen in numerous publications in medicine and biology. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or when the costs of different errors vary markedly. In this paper, we use the T-Link algorithm in the preprocessing phase as a method of data cleaning in order to remove noise. We combine T-Link with other sampling method such as RUS, ROS and Synthetic Minority Technique (SMOTE) in order to maintain a balanced class distribution. Classification was then utilized using several ML algorithms such as ANN, RF and LR. Classifiers performance was evaluated using several performance measures deemed more appropriate for classifying data with sever imbalance. These methods are applied to arterial blood pressures data and Ecoli2 data set. Using TLink in combination with RUS and SMOTE demonstrated a superior performance compared to resampling techniques such among different classification algorithms such as SVM, ANN, RF and LR.

Avertissement: Ce résumé a été traduit à l'aide d'outils d'intelligence artificielle et n'a pas encore été examiné ni vérifié

Journal mondial de technologie et d'optimisation

Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method

Abstract

Faits saillants de la revue

Indexé dans

Liens connexes

Revues en libre accès