Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link

Hairani Hairani; Anthony Anggrawan; Dadang Priyanto

doi:10.30630/joiv.7.1.1069

Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link

Hairani Hairani - Universitas Bumigora, Mataram, 83127, Indonesia
Anthony Anggrawan - Universitas Bumigora, Mataram, 83127, Indonesia
Dadang Priyanto - Universitas Bumigora, Mataram, 83127, Indonesia

Citation Format:

DOI: http://dx.doi.org/10.30630/joiv.7.1.1069

Abstract

Most of the health data contained unbalanced data that affected the performance of the classification method. Unbalanced data causes the classification method to classify the majority data more and ignore the minority class. One of the health data that has unbalanced data is Pima Indian Diabetes. Diabetes is a deadly disease caused by the body's inability to produce enough insulin. Complications of diabetes can cause heart attacks and strokes. Early diagnosis of diabetes is needed to minimize the occurrence of more severe complications. In the diabetes dataset used, there is an imbalanced data between positive and negative diabetes classes. Diabetes negative class data (500 data) is more than diabetes positive class (268), so it can affect the performance of the classification method. Therefore, this study aims to apply the Smote-Tomeklink and Random Forest methods in the classification of diabetes. The research methodology used is the collection of diabetes data obtained from Kaggle, as many as 768 data with eight input attributes and 1 output attribute as a class, pre-processing data is used to balance the dataset with Smote-Tomeklink, classification using the random forest method, and performance evaluation based on accuracy, sensitivity, precision, and F1-score. Based on the tests conducted by dividing data using 10-fold cross-validation, the Random Forest algorithm with Smote-TomekLink gets the highest accuracy, sensitivity, precision, and F1-score compared to Random Forest with Smote. The Random Forest algorithm with Smote-Tomeklink has 86.4% accuracy, 88.2% sensitivity, 82.3% precision, and 85.1% F1-score. Thus, using Smote-Tomeklink can improve the performance of the random forest method based on accuracy, sensitivity, precision, and F1-score.

Keywords

Class Imbalance; Smote-Tomeklink;Random Fores Method;Diabetest Disease

Full Text:

PDF

References

O. Heranova, â€œSynthetic Minority Oversampling Technique pada Averaged One Dependence Estimators untuk Klasifikasi Credit Scoring,â€ J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 3, no. 3, pp. 443â€“450, 2019, doi: 10.29207/resti.v3i3.1275.

T. Zhu, Y. Lin, and Y. Liu, â€œSynthetic minority oversampling technique for multiclass imbalance problems,â€ Pattern Recognit., vol. 72, pp. 327â€“340, Dec. 2017, doi: 10.1016/j.patcog.2017.07.024.

F. Last, G. Douzas, and F. Bacao, â€œOversampling for Imbalanced Learning Based on K-Means and SMOTE,â€ no. November, 2017.

G. A. Pradipta, R. Wardoyo, A. Musdholifah, and I. N. H. Sanjaya, â€œRadius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning from Imbalanced Data,â€ IEEE Access, vol. 9, pp. 74763â€“74777, 2021, doi: 10.1109/ACCESS.2021.3080316.

M. Kamaladevi, V. Venkataraman, and K. R. Sekar, â€œTomek link Undersampling with Stacked Ensemble classifier for Imbalanced data classification,â€ vol. 25, no. 4, pp. 2182â€“2190, 2021.

W. C. Lin, C. F. Tsai, Y. H. Hu, and J. S. Jhang, â€œClustering-based undersampling in class-imbalanced data,â€ Inf. Sci. (Ny)., vol. 409â€“410, pp. 17â€“26, 2017, doi: 10.1016/j.ins.2017.05.008.

Z. Xu, D. Shen, T. Nie, and Y. Kou, â€œA hybrid sampling algorithm combining M-SMOTE and ENN based on Random Forest for medical imbalanced data,â€ J. Biomed. Inform., p. 103465, 2020, doi: 10.1016/j.jbi.2020.103465.

E. AT, A. M, A.-M. F, and S. M, â€œClassification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method,â€ Glob. J. Technol. Optim., vol. 01, no. S1, 2016, doi: 10.4172/2229-8711.s1111.

Z. Wang, C. Wu, K. Zheng, X. Niu, and X. Wang, â€œSMOTETomek-Based Resampling for Personality Recognition,â€ IEEE Access, vol. 7, pp. 129678â€“129689, 2019, doi: 10.1109/ACCESS.2019.2940061.

N. V Chawla, K. W. Bowyer, and L. O. Hall, â€œSMOTE : Synthetic Minority Over-sampling TEchnique,â€ J. Artif. Intell. Res., vol. 16, pp. 341â€“378, 2002.

H. Hairani, K. E. Saputro, and S. Fadli, â€œK-means-SMOTE for handling class imbalance in the classification of diabetes with C4.5, SVM, and naive Bayes,â€ J. Teknol. dan Sist. Komput., vol. 8, no. 2, pp. 89â€“93, 2020, doi: 10.14710/jtsiskom.8.2.2020.89-93.

I. Tomek, â€œTomek Link: Two Modifications of CNN,â€ IEEE Trans. Syst. Man Cybern., pp. 769â€“772, 1976.

E. F. Swana, W. Doorsamy, and P. Bokoro, â€œTomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset,â€ Sensors, vol. 22, no. 9, 2022, doi: 10.3390/s22093246.

A. Alzahrani and A. Safhi, â€œThe role of data mining techniques and tools in big data management in healthcare field,â€ Sustain. Eng. Innov., vol. 4, no. 1, pp. 58â€“65, 2022, doi: 10.37868/sei.v4i1.id128.

S. SaraÄ and B. DurakoviÄ‡, â€œAnalysis of student performances in online and face-to-face learning: A case study from a Bosnian public university,â€ Herit. Sustain. Dev., vol. 4, no. 2, pp. 87â€“94, 2022, doi: 10.37868/HSD.V4I2.91.

R. Kaur, â€œPredicting diabetes by adopting classification approach in data mining,â€ Int. J. Informatics Vis., vol. 3, no. 2â€“2, pp. 218â€“221, 2019, doi: 10.30630/joiv.3.2-2.229.

A. Azrar, M. Awais, Y. Ali, and K. Zaheer, â€œData mining models comparison for diabetes prediction,â€ Int. J. Adv. Comput. Sci. Appl., vol. 9, no. 8, pp. 320â€“323, 2018, doi: 10.14569/ijacsa.2018.090841.

S. Barik, S. Mohanty, S. Mohanty, and D. Singh, â€œAnalysis of prediction accuracy of diabetes using classifier and hybrid machine learning techniques,â€ Smart Innov. Syst. Technol., vol. 153, no. January, pp. 399â€“409, 2021, doi: 10.1007/978-981-15-6202-0_41.

H. Hairani, M. Innuddin, and M. Rahardi, â€œAccuracy Enhancement of Correlated Naive Bayes Method by Using Correlation Feature Selection (CFS) for Health Data Classification,â€ in 2020 3rd International Conference on Information and Communications Technology (ICOIACT), Nov. 2020, pp. 51â€“55, doi: 10.1109/ICOIACT50329.2020.9332021.

C. Fiarni, E. M. Sipayung, and S. Maemunah, â€œAnalysis and prediction of diabetes complication disease using data mining algorithm,â€ Procedia Comput. Sci., vol. 161, pp. 449â€“457, 2019, doi: 10.1016/j.procs.2019.11.144.

Erlin, Y. N. Marlim, Junadhi, L. Suryati, and N. Agustina, â€œEarly Detection of Diabetes Using Machine Learning with Logistic Regression Algorithm,â€ J. Nas. Tek. Elektro dan Teknol. Inf., vol. 11, no. 2, pp. 88â€“96, 2022.

C. Azad, B. Bhushan, R. Sharma, A. Shankar, K. K. Singh, and A. Khamparia, â€œPrediction model using SMOTE, genetic algorithm and decision tree (PMSGD) for classification of diabetes mellitus,â€ Multimed. Syst., vol. 28, no. 4, pp. 1289â€“1307, 2022, doi: 10.1007/s00530-021-00817-2.

X. Shi, T. Qu, G. Van Pottelbergh, M. van den Akker, and B. De Moor, â€œA Resampling Method to Improve the Prognostic Model of End-Stage Kidney Disease: A Better Strategy for Imbalanced Data,â€ Front. Med., vol. 9, no. March, pp. 1â€“9, 2022, doi: 10.3389/fmed.2022.730748.

K. Wang et al., â€œImproving risk identification of adverse outcomes in chronic heart failure using smote +enn and machine learning,â€ Risk Manag. Healthc. Policy, vol. 14, no. May, pp. 2453â€“2463, 2021, doi: 10.2147/RMHP.S310295.

H. Kaur, H. S. Pannu, and A. K. Malhi, â€œA systematic review on imbalanced data challenges in machine learning: Applications and solutions,â€ ACM Computing Surveys, vol. 52, no. 4. Association for Computing Machinery, pp. 1â€“34, Aug. 2019, doi: 10.1145/3343440.

K. Guo, X. Wan, L. Liu, Z. Gao, and M. Yang, â€œFault diagnosis of intelligent production line based on digital twin and improved random forest,â€ Appl. Sci., vol. 11, no. 16, pp. 1â€“18, 2021, doi: 10.3390/app11167733.

J. Chen, H. Huang, A. G. Cohn, D. Zhang, and M. Zhou, â€œMachine learning-based classification of rock discontinuity trace: SMOTE oversampling integrated with GBT ensemble learning,â€ Int. J. Min. Sci. Technol., vol. 32, no. 2, pp. 309â€“322, 2021, doi: 10.1016/j.ijmst.2021.08.004.

Y. Sun, H. Zhang, T. Zhao, Z. Zou, B. Shen, and L. Yang, â€œA New Convolutional Neural Network with Random Forest Method for Hydrogen Sensor Fault Diagnosis,â€ IEEE Access, vol. 8, pp. 85421â€“85430, 2020, doi: 10.1109/ACCESS.2020.2992231.

H. Hartono and E. Ongko, â€œAvoiding Overfitting dan Overlapping in Handling Class Imbalanced Using Hybrid Approach with Smoothed Bootstrap Resampling and Feature Selection,â€ Int. J. Informatics Vis., vol. 6, no. June, pp. 343â€“348, 2022.

H. Hairani, A. Anggrawan, A. I. Wathan, K. A. Latif, K. Marzuki, and M. Zulfikri, â€œThe Abstract of Thesis Classifier by Using Naive Bayes Method,â€ in Proceedings - 2021 International Conference on Software Engineering and Computer Systems and 4th International Conference on Computational Science and Information Management, ICSECS-ICOCSIM 2021, 2021, no. August, pp. 312â€“315, doi: 10.1109/ICSECS52883.2021.00063.

A. Luque, A. Carrasco, A. MartÃn, and A. de las Heras, â€œThe impact of class imbalance in classification performance metrics based on the binary confusion matrix,â€ Pattern Recognit., vol. 91, pp. 216â€“231, 2019, doi: 10.1016/j.patcog.2019.02.023.

H. Qteat and M. Awad, â€œUsing Hybrid Model of Particle Swarm Optimization and Multi-Layer Perceptron Neural Networks for Classification of Diabetes,â€ Int. J. Intell. Eng. Syst., vol. 14, no. 3, pp. 11â€“22, 2021, doi: 10.22266/ijies2021.0630.02.

H. Hanafi, A. H. Muhammad, I. Verawati, and R. Hardi, â€œAn Intrusion Detection System Using SDAE to Enhance Dimensional Reduction in Machine Learning,â€ Int. J. Informatics Vis., vol. 6, no. June, pp. 306â€“316, 2022.

H. Hairani, A. S. Suweleh, and D. Susilowaty, â€œPenanganan Ketidak Seimbangan Kelas Menggunakan Pendekatan Level Data,â€ MATRIK J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 20, no. 1, pp. 109â€“116, 2020, doi: 10.30812/matrik.v20i1.846.

M. Y. Thanoun, M. T. Yaseen, and A. M. Aleesa, â€œDevelopment of Intelligent Parkinson Disease Detection System Based on Machine Learning Techniques Using Speech Signal,â€ Int. J. Adv. Sci. Eng. Inf. Technol., vol. 11, no. 1, pp. 388â€“392, 2021.

Username
Password
Remember me