k-Means Cluster-based Random Undersampling and Meta-Learning Approach for Village Development Status Classification

Ahmad Ilham; Luqman Assaffat; Laelatul Khikmah; Safuan Safuan; Suprapedi Suprapedi

doi:10.30630/joiv.7.2.989

k-Means Cluster-based Random Undersampling and Meta-Learning Approach for Village Development Status Classification

Ahmad Ilham - Universitas Muhammadiyah Semarang, Semarang, Indonesia
Luqman Assaffat - Universitas Muhammadiyah Semarang, Semarang, Indonesia
Laelatul Khikmah - Institut Teknologi Statistika dan Bisnis Muhammadiyah Semarang, Semarang, Indonesia
Safuan Safuan - Universitas Muhammadiyah Semarang, Semarang, Indonesia
Suprapedi Suprapedi - National Research and Innovation Agency, South Jakarta, Indonesia

Citation Format:

DOI: http://dx.doi.org/10.30630/joiv.7.2.989

Abstract

There is a significant imbalanced class in the village development index (called IDM - Indeks Desa Membangun) dataset, marked by the number of self-supporting classes more than the disadvantaged class. The traditional classifiers are able to achieve high accuracy (ACC) by training all cases of the majority class but forsaking the minority class, so that possible for the classification results to be biased. In this study, a random under-sampling technique was employed based on k-means cluster (KMC) and a meta-learning approach to improving ACC of the village status classification model. Furthermore, the AdaBoost and Random Forest were used as meta technique and base learner, respectively. The proposed model has been evaluated using the area under the curve (AUC), and experimental results showed that it yielded excellent performance compared to the prior studies with the AUC, ACC, precision (PR), recall (RC), and g-mean (Gm) values of 95.50%, 95.52%, 95.5%, 95.5%, and 92.95%, respectively. Similarly, the result of the t-test also showed the proposed model yielded excellent performance compared to previous studies. It can be concluded that the AdaBoost algorithm improved misclassification and changed the distribution of data loss function in random forests. It indicates that the proposed model effectively deals with imbalanced classes in the village development status classification model.Â

Keywords

village development index; village development status classification; imbalanced class; meta-learning; random forest

Full Text:

PDF

References

K. Cheng, S. Gao, W. Dong, X. Yang, Q. Wang, and H. Yu, â€œBoosting label weighted extreme learning machine for classifying multi-label imbalanced data,â€ Neurocomputing, vol. 403, pp. 360â€“370, Aug. 2020, doi: 10.1016/j.neucom.2020.04.098.

A. Anil and S. R. Singh, â€œEffect of class imbalance in heterogeneous network embedding: An empirical study,â€ J Informetr, vol. 14, no. 2, p. 101009, May 2020, doi: 10.1016/j.joi.2020.101009.

E. Mortaz, â€œImbalance accuracy metric for model selection in multi-class imbalance classification problems,â€ Knowl Based Syst, vol. 210, p. 106490, Dec. 2020, doi: 10.1016/j.knosys.2020.106490.

H. He and E. A. Garcia, â€œLearning from Imbalanced Data,â€ Curr Top Med Chem, vol. 8, no. 18, pp. 1691â€“1709, 2008, doi: 10.2174/156802608786786589.

J. M. Johnson and T. M. Khoshgoftaar, â€œSurvey on deep learning with class imbalance,â€ J Big Data, vol. 6, no. 1, pp. 1â€“54, Dec. 2019, doi: 10.1186/s40537-019-0192-5.

A. Ali, S. M. Shamsuddin, and A. L. Ralescu, â€œClassification with class imbalance problem: A review,â€ International Journal of Advances in Soft Computing and its Applications, vol. 7, no. 3, pp. 176â€“204, 2015.

S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, â€œHandling imbalanced datasets : A review,â€ GETS International Transactions on Computer Science and Engineering, vol. 30, no. 1, pp. 25â€“36, 2010, doi: 10.1007/978-0-387-09823-4_45.

Han and Kamber, Data Mining Concepts and Techniques Third Edition, 3rd ed., vol. 1. USA: Morgan Kaufmann Publishers is an imprint of Elsevier, 2012. doi: 10.1017/CBO9781107415324.004.

J. Sun, H. Li, H. Fujita, B. Fu, and W. Ai, â€œClass-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting,â€ Information Fusion, vol. 54, pp. 128â€“144, Feb. 2020, doi: 10.1016/j.inffus.2019.07.006.

J. Song, X. Lu, and X. Wu, â€œAn Improved AdaBoost Algorithm for Unbalanced Classification Data,â€ Sixth International Conference on Fuzzy Systems and Knowledge Discovery, pp. 109â€“113, 2009, doi: 10.1109/FSKD.2009.608.

Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, â€œCost-sensitive boosting for classification of imbalanced data,â€ Pattern Recognit, vol. 40, no. 12, pp. 3358â€“3378, Dec. 2007, doi: 10.1016/j.patcog.2007.04.009.

H. Prasetyo and A. Purwarianti, â€œComparison of distance measures for clustering data with mix attribute types for Indonesian potential-based regional grouping,â€ in 2014 International Conference on Information Technology Systems and Innovation (ICITSI), Nov. 2014, pp. 13â€“18. doi: 10.1109/ICITSI.2014.7048230.

M. S. Sari, D. Safitri, and Sugito, â€œKlasifikasi Wilayah Desa-Perdesaan dan Desa-Perkotaan Wilayah Kabupaten Semarang dengan Support Vector Machine,â€ Jurnal Gaussian, vol. 3, no. 4, pp. 751â€“760, 2014, Accessed: Jun. 20, 2021. [Online]. Available: https://ejournal3.undip.ac.id/index.php/gaussian/article/view/8086

E. Siswanto, Suprapedi, and Purwanto, â€œMetode Sample Boostraping Pada K-Nearest Neighbor Untuk Klasifikasi Status Desa,â€ Jurnal Teknologi Informasi, vol. 14, pp. 13â€“23, 2018.

A. Mahmud, A. Pangestika, A. P. Ramadhanty, G. M. Putra, G. S. N. D. S. Putri, and R. Nooraeni, â€œKlasifikasi Status Desa/Kelurahan DIY (Yogyakarta) Menggunakan Model Decision Tree (Studi Kasus Data Praktik Kerja Lapangan Politeknik Statistika STIS Tahun 2020),â€ Engineering, MAthematics and Computer Science (EMACS) Journal, vol. 3, no. 1, pp. 33â€“41, Feb. 2021, doi: 10.21512/emacsjournal.v3i1.6787.

N. S. Kumar, K. N. Rao, A. Govardhan, K. S. Reddy, and A. M. Mahmood, â€œUndersampled K-means approach for handling imbalanced distributed data,â€ Progress in Artificial Intelligence, vol. 3, no. 1, pp. 29â€“38, Aug. 2014, doi: 10.1007/s13748-014-0045-6.

Jing-Hao Xue and P. Hall, â€œWhy Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis?,â€ IEEE Trans Pattern Anal Mach Intell, vol. 37, no. 5, pp. 1109â€“1112, May 2015, doi: 10.1109/TPAMI.2014.2359660.

L. Gautheron, A. Habrard, E. Morvant, and M. Sebban, â€œMetric Learning from Imbalanced Data with Generalization Guarantees,â€ Pattern Recognit Lett, vol. 133, pp. 298â€“304, May 2020, doi: 10.1016/j.patrec.2020.03.008.

G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, â€œLearning from class-imbalanced data: Review of methods and applications,â€ Expert Syst Appl, vol. 73, pp. 220â€“239, May 2017, doi: 10.1016/j.eswa.2016.12.035.

J. Ri and H. Kim, â€œG-mean based extreme learning machine for imbalance learning,â€ Digit Signal Process, vol. 98, p. 102637, Mar. 2020, doi: 10.1016/j.dsp.2019.102637.

W. Wang and D. Sun, â€œThe improved AdaBoost algorithms for imbalanced data classification,â€ Inf Sci (N Y), vol. 563, pp. 358â€“374, Jul. 2021, doi: 10.1016/j.ins.2021.03.042.

Username
Password
Remember me