Extreme Gradient Boosting Algorithm to Improve Machine Learning Model Performance on Multiclass Imbalanced Dataset

Yoga Pristyanto; Zulfikar Mukarabiman; Anggit Ferdita Nugraha

doi:10.30630/joiv.7.3.1102

Extreme Gradient Boosting Algorithm to Improve Machine Learning Model Performance on Multiclass Imbalanced Dataset

Yoga Pristyanto - Universitas Amikom Yogyakarta, Yogyakarta, 55281, Indonesia
Zulfikar Mukarabiman - Universitas Amikom Yogyakarta, Yogyakarta, 55281, Indonesia
Anggit Ferdita Nugraha - Universitas Amikom Yogyakarta, Yogyakarta, 55281, Indonesia

Citation Format:

DOI: http://dx.doi.org/10.30630/joiv.7.3.1102

Abstract

Unbalanced conditions in the dataset often become a real-world problem, especially in machine learning. Class imbalance in the dataset is a condition where the number of minority classes is much smaller than the majority class, or the number is insufficient. Machine learning models tend to recognize patterns in the majority class more than in the minority class. This problem is one of the most critical challenges in machine learning research, so several methods have been developed to overcome it. However, most of these methods only focus on binary datasets, so few methods still focus on multiclass datasets. Handling unbalanced multiclass is more complex than handling unbalanced binary because it involves more classes than binary class datasets. With these problems, we need an algorithm with features that can support adjustments to the difficulties that arise in multiclass unbalanced datasets. One of the algorithms that have features for adjustment is the ensemble algorithm, namely Xtreme Gradient Boosting. Based on the research, our proposed method with Xtreme Gradient Boosting showed better results than the other classification and ensemble algorithms on eight datasets with five evaluation metrics indicators such as balanced accuracy, the geometric-mean, multiclass area under the curve, true positive rate, and true negative rate. In future research, we suggest combining methods at the data level and Xtreme Gradient Boosting. With the performance increase in Xtreme Gradient Boosting, it can be a solution and reference in the case of handling multiclass imbalanced problems. Besides, we also recommended testing with datasets in the form of categorical and continuous data.

Keywords

Class Imbalanced; Ensemble Algorithm; XGBoost; Classification; Multiclass

Full Text:

PDF

References

B. Kim and J. Kim, â€œAdjusting decision boundary for class imbalanced learning,â€ IEEE Access, vol. 8, pp. 81674â€“81685, 2020, doi: 10.1109/ACCESS.2020.2991231.

C. L. Liu and P. Y. Hsieh, â€œModel-Based Synthetic Sampling for Imbalanced Data,â€ IEEE Trans. Knowl. Data Eng., vol. 32, no. 8, pp. 1543â€“1556, 2020, doi: 10.1109/TKDE.2019.2905559.

Q. Wang, W. Cao, J. Guo, J. Ren, Y. Cheng, and D. N. Davis, â€œDMP_MI: An effective diabetes mellitus classification algorithm on imbalanced data with missing values,â€ IEEE Access, vol. 7, pp. 102232â€“102238, 2019, doi: 10.1109/ACCESS.2019.2929866.

A. S. Tarawneh, A. B. A. Hassanat, K. Almohammadi, D. Chetverikov, and C. Bellinger, â€œSMOTEFUNA: Synthetic Minority Over-Sampling Technique Based on Furthest Neighbour Algorithm,â€ IEEE Access, vol. 8, pp. 59069â€“59082, 2020, doi: 10.1109/ACCESS.2020.2983003.

S. GarcÃa, Z. L. Zhang, A. Altalhi, S. Alshomrani, and F. Herrera, â€œDynamic ensemble selection for multi-class imbalanced datasets,â€ Inf. Sci. (Ny)., vol. 445â€“446, pp. 22â€“37, 2018, doi: 10.1016/j.ins.2018.03.002.

S. Huda et al., â€œAn Ensemble Oversampling Model for Class Imbalance Problem in Software Defect Prediction,â€ IEEE Access, vol. 6, pp. 24184â€“24195, 2018, doi: 10.1109/ACCESS.2018.2817572.

X. Xu, W. Chen, and Y. Sun, â€œOver-sampling algorithm for imbalanced data classification,â€ J. Syst. Eng. Electron., vol. 30, no. 6, pp. 1182â€“1191, 2019, doi: 10.21629/JSEE.2019.06.12.

G. G. Warsi, S. Saini, and K. Khatri, â€œEnsemble Learning on Diabetes Data Set and Early Diabetes Prediction,â€ in 2019 International Conference on Computing, Power and Communication Technologies (GUCON), 2019, pp. 182â€“187.

M. A. Febriantono, S. H. Pramono, Rahmadwati, and G. Naghdy, â€œClassification of multiclass imbalanced data using cost-sensitive decision tree c5.0,â€ IAES Int. J. Artif. Intell., vol. 9, no. 1, pp. 65â€“72, 2020, doi: 10.11591/ijai.v9.i1.pp65-72.

T. M. Khan, S. Xu, Z. G. Khan, and M. U. Chishti, â€œImplementing multilabeling, ADASYN, and relieff techniques for classification of breast cancer diagnostic through machine learning: Efficient computer-aided diagnostic system,â€ J. Healthc. Eng., vol. 2021, 2021, doi: 10.1155/2021/5577636.

Y. Yang, P. Xiao, Y. Cheng, W. Liu, and Z. Huang, â€œEnsemble Strategy for Hard Classifying Samples in Class-Imbalanced Data Set,â€ Proc. - 2018 IEEE Int. Conf. Big Data Smart Comput. BigComp 2018, pp. 170â€“175, 2018, doi: 10.1109/BigComp.2018.00033.

J. Mathew, C. K. Pang, M. Luo, and W. H. Leong, â€œClassification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines,â€ IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 9, pp. 4065â€“4076, 2018, doi: 10.1109/TNNLS.2017.2751612.

H. R. Sanabila and W. Jatmiko, â€œEnsemble Learning on Large Scale Financial Imbalanced Data,â€ 2018 Int. Work. Big Data Inf. Secur. IWBIS 2018, pp. 93â€“98, 2018, doi: 10.1109/IWBIS.2018.8471702.

R. Ghorbani and R. Ghousi, â€œComparing Different Resampling Methods in Predicting Studentsâ€™ Performance Using Machine Learning Techniques,â€ IEEE Access, vol. 8, pp. 67899â€“67911, 2020, doi: 10.1109/ACCESS.2020.2986809.

L. Abdi and S. Hashemi, â€œTo Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques,â€ IEEE Trans. Knowl. Data Eng., vol. 28, no. 1, pp. 238â€“251, 2016, doi: 10.1109/TKDE.2015.2458858.

S. Ding et al., â€œKernel based online learning for imbalance multiclass classification,â€ Neurocomputing, vol. 277, pp. 139â€“148, 2018, doi: 10.1016/j.neucom.2017.02.102.

J. Bi and C. Zhang, â€œAn empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme,â€ Knowledge-Based Syst., vol. 158, no. May, pp. 81â€“93, 2018, doi: 10.1016/j.knosys.2018.05.037.

B. Krawczyk, M. Koziarski, and M. WoÅºniak, â€œRadial-Based Oversampling for Multiclass Imbalanced Data Classification,â€ IEEE Trans. Neural Networks Learn. Syst., vol. 31, no. 8, pp. 2818â€“2831, 2020, doi: 10.1109/TNNLS.2019.2913673.

E. R. Q. Fernandes and A. C. P. L. F. de Carvalho, â€œEvolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning,â€ Inf. Sci. (Ny)., vol. 494, pp. 141â€“154, 2019, doi: 10.1016/j.ins.2019.04.052.

O. Sagi and L. Rokach, â€œApproximating XGBoost with an interpretable decision tree,â€ Inf. Sci. (Ny)., vol. 572, pp. 522â€“542, 2021, doi: 10.1016/j.ins.2021.05.055.

A. B. Parsa, A. Movahedi, H. Taghipour, S. Derrible, and A. (Kouros) Mohammadian, â€œToward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis,â€ Accid. Anal. Prev., vol. 136, no. October 2019, p. 105405, 2020, doi: 10.1016/j.aap.2019.105405.

J. Nobre and R. F. Neves, â€œCombining Principal Component Analysis, Discrete Wavelet Transform and XGBoost to trade in the financial markets,â€ Expert Syst. Appl., vol. 125, pp. 181â€“194, 2019, doi: 10.1016/j.eswa.2019.01.083.

W. Wang, G. Chakraborty, and B. Chakraborty, â€œPredicting the risk of chronic kidney disease (Ckd) using machine learning algorithm,â€ Appl. Sci., vol. 11, no. 1, pp. 1â€“17, 2021, doi: 10.3390/app11010202.

G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, â€œLearning from class-imbalanced data: Review of methods and applications,â€ Expert Syst. Appl., vol. 73, pp. 220â€“239, 2017, doi: 10.1016/j.eswa.2016.12.035.

S. Wang, L. L. Minku, and X. Yao, â€œA Systematic Study of Online Class Imbalance Learning with Concept Drift,â€ IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 10, pp. 4802â€“4821, 2018, doi: 10.1109/TNNLS.2017.2771290.

F. Shakeel, A. S. Sabhitha, and S. Sharma, â€œExploratory review on class imbalance problem: An overview,â€ 2017, doi: 10.1109/ICCCNT.2017.8204150.

Username
Password
Remember me