k-Means Cluster-based Random Undersampling and Meta-Learning Approach for Village Development Status Classification

Ahmad Ilham - Universitas Muhammadiyah Semarang, Semarang, Indonesia
Luqman Assaffat - Universitas Muhammadiyah Semarang, Semarang, Indonesia
Laelatul Khikmah - Institut Teknologi Statistika dan Bisnis Muhammadiyah Semarang, Semarang, Indonesia
Safuan Safuan - Universitas Muhammadiyah Semarang, Semarang, Indonesia
Suprapedi Suprapedi - National Research and Innovation Agency, South Jakarta, Indonesia


Citation Format:



DOI: http://dx.doi.org/10.30630/joiv.7.2.989

Abstract


There is a significant imbalanced class in the village development index (called IDM - Indeks Desa Membangun) dataset, marked by the number of self-supporting classes more than the disadvantaged class. The traditional classifiers are able to achieve high accuracy (ACC) by training all cases of the majority class but forsaking the minority class, so that possible for the classification results to be biased. In this study, a random under-sampling technique was employed based on k-means cluster (KMC) and a meta-learning approach to improving ACC of the village status classification model. Furthermore, the AdaBoost and Random Forest were used as meta technique and base learner, respectively. The proposed model has been evaluated using the area under the curve (AUC), and experimental results showed that it yielded excellent performance compared to the prior studies with the AUC, ACC, precision (PR), recall (RC), and g-mean (Gm) values of 95.50%, 95.52%, 95.5%, 95.5%, and 92.95%, respectively. Similarly, the result of the t-test also showed the proposed model yielded excellent performance compared to previous studies. It can be concluded that the AdaBoost algorithm improved misclassification and changed the distribution of data loss function in random forests. It indicates that the proposed model effectively deals with imbalanced classes in the village development status classification model. 

Keywords


village development index; village development status classification; imbalanced class; meta-learning; random forest

Full Text:

PDF

References


K. Cheng, S. Gao, W. Dong, X. Yang, Q. Wang, and H. Yu, “Boosting label weighted extreme learning machine for classifying multi-label imbalanced data,†Neurocomputing, vol. 403, pp. 360–370, Aug. 2020, doi: 10.1016/j.neucom.2020.04.098.

A. Anil and S. R. Singh, “Effect of class imbalance in heterogeneous network embedding: An empirical study,†J Informetr, vol. 14, no. 2, p. 101009, May 2020, doi: 10.1016/j.joi.2020.101009.

E. Mortaz, “Imbalance accuracy metric for model selection in multi-class imbalance classification problems,†Knowl Based Syst, vol. 210, p. 106490, Dec. 2020, doi: 10.1016/j.knosys.2020.106490.

H. He and E. A. Garcia, “Learning from Imbalanced Data,†Curr Top Med Chem, vol. 8, no. 18, pp. 1691–1709, 2008, doi: 10.2174/156802608786786589.

J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with class imbalance,†J Big Data, vol. 6, no. 1, pp. 1–54, Dec. 2019, doi: 10.1186/s40537-019-0192-5.

A. Ali, S. M. Shamsuddin, and A. L. Ralescu, “Classification with class imbalance problem: A review,†International Journal of Advances in Soft Computing and its Applications, vol. 7, no. 3, pp. 176–204, 2015.

S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Handling imbalanced datasets : A review,†GETS International Transactions on Computer Science and Engineering, vol. 30, no. 1, pp. 25–36, 2010, doi: 10.1007/978-0-387-09823-4_45.

Han and Kamber, Data Mining Concepts and Techniques Third Edition, 3rd ed., vol. 1. USA: Morgan Kaufmann Publishers is an imprint of Elsevier, 2012. doi: 10.1017/CBO9781107415324.004.

J. Sun, H. Li, H. Fujita, B. Fu, and W. Ai, “Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting,†Information Fusion, vol. 54, pp. 128–144, Feb. 2020, doi: 10.1016/j.inffus.2019.07.006.

J. Song, X. Lu, and X. Wu, “An Improved AdaBoost Algorithm for Unbalanced Classification Data,†Sixth International Conference on Fuzzy Systems and Knowledge Discovery, pp. 109–113, 2009, doi: 10.1109/FSKD.2009.608.

Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, “Cost-sensitive boosting for classification of imbalanced data,†Pattern Recognit, vol. 40, no. 12, pp. 3358–3378, Dec. 2007, doi: 10.1016/j.patcog.2007.04.009.

H. Prasetyo and A. Purwarianti, “Comparison of distance measures for clustering data with mix attribute types for Indonesian potential-based regional grouping,†in 2014 International Conference on Information Technology Systems and Innovation (ICITSI), Nov. 2014, pp. 13–18. doi: 10.1109/ICITSI.2014.7048230.

M. S. Sari, D. Safitri, and Sugito, “Klasifikasi Wilayah Desa-Perdesaan dan Desa-Perkotaan Wilayah Kabupaten Semarang dengan Support Vector Machine,†Jurnal Gaussian, vol. 3, no. 4, pp. 751–760, 2014, Accessed: Jun. 20, 2021. [Online]. Available: https://ejournal3.undip.ac.id/index.php/gaussian/article/view/8086

E. Siswanto, Suprapedi, and Purwanto, “Metode Sample Boostraping Pada K-Nearest Neighbor Untuk Klasifikasi Status Desa,†Jurnal Teknologi Informasi, vol. 14, pp. 13–23, 2018.

A. Mahmud, A. Pangestika, A. P. Ramadhanty, G. M. Putra, G. S. N. D. S. Putri, and R. Nooraeni, “Klasifikasi Status Desa/Kelurahan DIY (Yogyakarta) Menggunakan Model Decision Tree (Studi Kasus Data Praktik Kerja Lapangan Politeknik Statistika STIS Tahun 2020),†Engineering, MAthematics and Computer Science (EMACS) Journal, vol. 3, no. 1, pp. 33–41, Feb. 2021, doi: 10.21512/emacsjournal.v3i1.6787.

N. S. Kumar, K. N. Rao, A. Govardhan, K. S. Reddy, and A. M. Mahmood, “Undersampled K-means approach for handling imbalanced distributed data,†Progress in Artificial Intelligence, vol. 3, no. 1, pp. 29–38, Aug. 2014, doi: 10.1007/s13748-014-0045-6.

Jing-Hao Xue and P. Hall, “Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis?,†IEEE Trans Pattern Anal Mach Intell, vol. 37, no. 5, pp. 1109–1112, May 2015, doi: 10.1109/TPAMI.2014.2359660.

L. Gautheron, A. Habrard, E. Morvant, and M. Sebban, “Metric Learning from Imbalanced Data with Generalization Guarantees,†Pattern Recognit Lett, vol. 133, pp. 298–304, May 2020, doi: 10.1016/j.patrec.2020.03.008.

G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,†Expert Syst Appl, vol. 73, pp. 220–239, May 2017, doi: 10.1016/j.eswa.2016.12.035.

J. Ri and H. Kim, “G-mean based extreme learning machine for imbalance learning,†Digit Signal Process, vol. 98, p. 102637, Mar. 2020, doi: 10.1016/j.dsp.2019.102637.

W. Wang and D. Sun, “The improved AdaBoost algorithms for imbalanced data classification,†Inf Sci (N Y), vol. 563, pp. 358–374, Jul. 2021, doi: 10.1016/j.ins.2021.03.042.