Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link

Hairani Hairani - Universitas Bumigora, Mataram, 83127, Indonesia
Anthony Anggrawan - Universitas Bumigora, Mataram, 83127, Indonesia
Dadang Priyanto - Universitas Bumigora, Mataram, 83127, Indonesia

Citation Format:



Most of the health data contained unbalanced data that affected the performance of the classification method. Unbalanced data causes the classification method to classify the majority data more and ignore the minority class. One of the health data that has unbalanced data is Pima Indian Diabetes. Diabetes is a deadly disease caused by the body's inability to produce enough insulin. Complications of diabetes can cause heart attacks and strokes. Early diagnosis of diabetes is needed to minimize the occurrence of more severe complications. In the diabetes dataset used, there is an imbalanced data between positive and negative diabetes classes. Diabetes negative class data (500 data) is more than diabetes positive class (268), so it can affect the performance of the classification method. Therefore, this study aims to apply the Smote-Tomeklink and Random Forest methods in the classification of diabetes. The research methodology used is the collection of diabetes data obtained from Kaggle, as many as 768 data with eight input attributes and 1 output attribute as a class, pre-processing data is used to balance the dataset with Smote-Tomeklink, classification using the random forest method, and performance evaluation based on accuracy, sensitivity, precision, and F1-score. Based on the tests conducted by dividing data using 10-fold cross-validation, the Random Forest algorithm with Smote-TomekLink gets the highest accuracy, sensitivity, precision, and F1-score compared to Random Forest with Smote. The Random Forest algorithm with Smote-Tomeklink has 86.4% accuracy, 88.2% sensitivity, 82.3% precision, and 85.1% F1-score. Thus, using Smote-Tomeklink can improve the performance of the random forest method based on accuracy, sensitivity, precision, and F1-score.


Class Imbalance; Smote-Tomeklink;Random Fores Method;Diabetest Disease

Full Text:



O. Heranova, “Synthetic Minority Oversampling Technique pada Averaged One Dependence Estimators untuk Klasifikasi Credit Scoring,†J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 3, no. 3, pp. 443–450, 2019, doi: 10.29207/resti.v3i3.1275.

T. Zhu, Y. Lin, and Y. Liu, “Synthetic minority oversampling technique for multiclass imbalance problems,†Pattern Recognit., vol. 72, pp. 327–340, Dec. 2017, doi: 10.1016/j.patcog.2017.07.024.

F. Last, G. Douzas, and F. Bacao, “Oversampling for Imbalanced Learning Based on K-Means and SMOTE,†no. November, 2017.

G. A. Pradipta, R. Wardoyo, A. Musdholifah, and I. N. H. Sanjaya, “Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning from Imbalanced Data,†IEEE Access, vol. 9, pp. 74763–74777, 2021, doi: 10.1109/ACCESS.2021.3080316.

M. Kamaladevi, V. Venkataraman, and K. R. Sekar, “Tomek link Undersampling with Stacked Ensemble classifier for Imbalanced data classification,†vol. 25, no. 4, pp. 2182–2190, 2021.

W. C. Lin, C. F. Tsai, Y. H. Hu, and J. S. Jhang, “Clustering-based undersampling in class-imbalanced data,†Inf. Sci. (Ny)., vol. 409–410, pp. 17–26, 2017, doi: 10.1016/j.ins.2017.05.008.

Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random Forest for medical imbalanced data,†J. Biomed. Inform., p. 103465, 2020, doi: 10.1016/j.jbi.2020.103465.

E. AT, A. M, A.-M. F, and S. M, “Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method,†Glob. J. Technol. Optim., vol. 01, no. S1, 2016, doi: 10.4172/2229-8711.s1111.

Z. Wang, C. Wu, K. Zheng, X. Niu, and X. Wang, “SMOTETomek-Based Resampling for Personality Recognition,†IEEE Access, vol. 7, pp. 129678–129689, 2019, doi: 10.1109/ACCESS.2019.2940061.

N. V Chawla, K. W. Bowyer, and L. O. Hall, “SMOTE : Synthetic Minority Over-sampling TEchnique,†J. Artif. Intell. Res., vol. 16, pp. 341–378, 2002.

H. Hairani, K. E. Saputro, and S. Fadli, “K-means-SMOTE for handling class imbalance in the classification of diabetes with C4.5, SVM, and naive Bayes,†J. Teknol. dan Sist. Komput., vol. 8, no. 2, pp. 89–93, 2020, doi: 10.14710/jtsiskom.8.2.2020.89-93.

I. Tomek, “Tomek Link: Two Modifications of CNN,†IEEE Trans. Syst. Man Cybern., pp. 769–772, 1976.

E. F. Swana, W. Doorsamy, and P. Bokoro, “Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset,†Sensors, vol. 22, no. 9, 2022, doi: 10.3390/s22093246.

A. Alzahrani and A. Safhi, “The role of data mining techniques and tools in big data management in healthcare field,†Sustain. Eng. Innov., vol. 4, no. 1, pp. 58–65, 2022, doi: 10.37868/sei.v4i1.id128.

S. SaraÄ and B. Duraković, “Analysis of student performances in online and face-to-face learning: A case study from a Bosnian public university,†Herit. Sustain. Dev., vol. 4, no. 2, pp. 87–94, 2022, doi: 10.37868/HSD.V4I2.91.

R. Kaur, “Predicting diabetes by adopting classification approach in data mining,†Int. J. Informatics Vis., vol. 3, no. 2–2, pp. 218–221, 2019, doi: 10.30630/joiv.3.2-2.229.

A. Azrar, M. Awais, Y. Ali, and K. Zaheer, “Data mining models comparison for diabetes prediction,†Int. J. Adv. Comput. Sci. Appl., vol. 9, no. 8, pp. 320–323, 2018, doi: 10.14569/ijacsa.2018.090841.

S. Barik, S. Mohanty, S. Mohanty, and D. Singh, “Analysis of prediction accuracy of diabetes using classifier and hybrid machine learning techniques,†Smart Innov. Syst. Technol., vol. 153, no. January, pp. 399–409, 2021, doi: 10.1007/978-981-15-6202-0_41.

H. Hairani, M. Innuddin, and M. Rahardi, “Accuracy Enhancement of Correlated Naive Bayes Method by Using Correlation Feature Selection (CFS) for Health Data Classification,†in 2020 3rd International Conference on Information and Communications Technology (ICOIACT), Nov. 2020, pp. 51–55, doi: 10.1109/ICOIACT50329.2020.9332021.

C. Fiarni, E. M. Sipayung, and S. Maemunah, “Analysis and prediction of diabetes complication disease using data mining algorithm,†Procedia Comput. Sci., vol. 161, pp. 449–457, 2019, doi: 10.1016/j.procs.2019.11.144.

Erlin, Y. N. Marlim, Junadhi, L. Suryati, and N. Agustina, “Early Detection of Diabetes Using Machine Learning with Logistic Regression Algorithm,†J. Nas. Tek. Elektro dan Teknol. Inf., vol. 11, no. 2, pp. 88–96, 2022.

C. Azad, B. Bhushan, R. Sharma, A. Shankar, K. K. Singh, and A. Khamparia, “Prediction model using SMOTE, genetic algorithm and decision tree (PMSGD) for classification of diabetes mellitus,†Multimed. Syst., vol. 28, no. 4, pp. 1289–1307, 2022, doi: 10.1007/s00530-021-00817-2.

X. Shi, T. Qu, G. Van Pottelbergh, M. van den Akker, and B. De Moor, “A Resampling Method to Improve the Prognostic Model of End-Stage Kidney Disease: A Better Strategy for Imbalanced Data,†Front. Med., vol. 9, no. March, pp. 1–9, 2022, doi: 10.3389/fmed.2022.730748.

K. Wang et al., “Improving risk identification of adverse outcomes in chronic heart failure using smote +enn and machine learning,†Risk Manag. Healthc. Policy, vol. 14, no. May, pp. 2453–2463, 2021, doi: 10.2147/RMHP.S310295.

H. Kaur, H. S. Pannu, and A. K. Malhi, “A systematic review on imbalanced data challenges in machine learning: Applications and solutions,†ACM Computing Surveys, vol. 52, no. 4. Association for Computing Machinery, pp. 1–34, Aug. 2019, doi: 10.1145/3343440.

K. Guo, X. Wan, L. Liu, Z. Gao, and M. Yang, “Fault diagnosis of intelligent production line based on digital twin and improved random forest,†Appl. Sci., vol. 11, no. 16, pp. 1–18, 2021, doi: 10.3390/app11167733.

J. Chen, H. Huang, A. G. Cohn, D. Zhang, and M. Zhou, “Machine learning-based classification of rock discontinuity trace: SMOTE oversampling integrated with GBT ensemble learning,†Int. J. Min. Sci. Technol., vol. 32, no. 2, pp. 309–322, 2021, doi: 10.1016/j.ijmst.2021.08.004.

Y. Sun, H. Zhang, T. Zhao, Z. Zou, B. Shen, and L. Yang, “A New Convolutional Neural Network with Random Forest Method for Hydrogen Sensor Fault Diagnosis,†IEEE Access, vol. 8, pp. 85421–85430, 2020, doi: 10.1109/ACCESS.2020.2992231.

H. Hartono and E. Ongko, “Avoiding Overfitting dan Overlapping in Handling Class Imbalanced Using Hybrid Approach with Smoothed Bootstrap Resampling and Feature Selection,†Int. J. Informatics Vis., vol. 6, no. June, pp. 343–348, 2022.

H. Hairani, A. Anggrawan, A. I. Wathan, K. A. Latif, K. Marzuki, and M. Zulfikri, “The Abstract of Thesis Classifier by Using Naive Bayes Method,†in Proceedings - 2021 International Conference on Software Engineering and Computer Systems and 4th International Conference on Computational Science and Information Management, ICSECS-ICOCSIM 2021, 2021, no. August, pp. 312–315, doi: 10.1109/ICSECS52883.2021.00063.

A. Luque, A. Carrasco, A. Martín, and A. de las Heras, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,†Pattern Recognit., vol. 91, pp. 216–231, 2019, doi: 10.1016/j.patcog.2019.02.023.

H. Qteat and M. Awad, “Using Hybrid Model of Particle Swarm Optimization and Multi-Layer Perceptron Neural Networks for Classification of Diabetes,†Int. J. Intell. Eng. Syst., vol. 14, no. 3, pp. 11–22, 2021, doi: 10.22266/ijies2021.0630.02.

H. Hanafi, A. H. Muhammad, I. Verawati, and R. Hardi, “An Intrusion Detection System Using SDAE to Enhance Dimensional Reduction in Machine Learning,†Int. J. Informatics Vis., vol. 6, no. June, pp. 306–316, 2022.

H. Hairani, A. S. Suweleh, and D. Susilowaty, “Penanganan Ketidak Seimbangan Kelas Menggunakan Pendekatan Level Data,†MATRIK J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 20, no. 1, pp. 109–116, 2020, doi: 10.30812/matrik.v20i1.846.

M. Y. Thanoun, M. T. Yaseen, and A. M. Aleesa, “Development of Intelligent Parkinson Disease Detection System Based on Machine Learning Techniques Using Speech Signal,†Int. J. Adv. Sci. Eng. Inf. Technol., vol. 11, no. 1, pp. 388–392, 2021.