Extreme Gradient Boosting Algorithm to Improve Machine Learning Model Performance on Multiclass Imbalanced Dataset

Yoga Pristyanto - Universitas Amikom Yogyakarta, Yogyakarta, 55281, Indonesia
Zulfikar Mukarabiman - Universitas Amikom Yogyakarta, Yogyakarta, 55281, Indonesia
Anggit Ferdita Nugraha - Universitas Amikom Yogyakarta, Yogyakarta, 55281, Indonesia


Citation Format:



DOI: http://dx.doi.org/10.30630/joiv.7.3.1102

Abstract


Unbalanced conditions in the dataset often become a real-world problem, especially in machine learning. Class imbalance in the dataset is a condition where the number of minority classes is much smaller than the majority class, or the number is insufficient. Machine learning models tend to recognize patterns in the majority class more than in the minority class. This problem is one of the most critical challenges in machine learning research, so several methods have been developed to overcome it. However, most of these methods only focus on binary datasets, so few methods still focus on multiclass datasets. Handling unbalanced multiclass is more complex than handling unbalanced binary because it involves more classes than binary class datasets. With these problems, we need an algorithm with features that can support adjustments to the difficulties that arise in multiclass unbalanced datasets. One of the algorithms that have features for adjustment is the ensemble algorithm, namely Xtreme Gradient Boosting. Based on the research, our proposed method with Xtreme Gradient Boosting showed better results than the other classification and ensemble algorithms on eight datasets with five evaluation metrics indicators such as balanced accuracy, the geometric-mean, multiclass area under the curve, true positive rate, and true negative rate. In future research, we suggest combining methods at the data level and Xtreme Gradient Boosting. With the performance increase in Xtreme Gradient Boosting, it can be a solution and reference in the case of handling multiclass imbalanced problems. Besides, we also recommended testing with datasets in the form of categorical and continuous data.

Keywords


Class Imbalanced; Ensemble Algorithm; XGBoost; Classification; Multiclass

Full Text:

PDF

References


B. Kim and J. Kim, “Adjusting decision boundary for class imbalanced learning,†IEEE Access, vol. 8, pp. 81674–81685, 2020, doi: 10.1109/ACCESS.2020.2991231.

C. L. Liu and P. Y. Hsieh, “Model-Based Synthetic Sampling for Imbalanced Data,†IEEE Trans. Knowl. Data Eng., vol. 32, no. 8, pp. 1543–1556, 2020, doi: 10.1109/TKDE.2019.2905559.

Q. Wang, W. Cao, J. Guo, J. Ren, Y. Cheng, and D. N. Davis, “DMP_MI: An effective diabetes mellitus classification algorithm on imbalanced data with missing values,†IEEE Access, vol. 7, pp. 102232–102238, 2019, doi: 10.1109/ACCESS.2019.2929866.

A. S. Tarawneh, A. B. A. Hassanat, K. Almohammadi, D. Chetverikov, and C. Bellinger, “SMOTEFUNA: Synthetic Minority Over-Sampling Technique Based on Furthest Neighbour Algorithm,†IEEE Access, vol. 8, pp. 59069–59082, 2020, doi: 10.1109/ACCESS.2020.2983003.

S. García, Z. L. Zhang, A. Altalhi, S. Alshomrani, and F. Herrera, “Dynamic ensemble selection for multi-class imbalanced datasets,†Inf. Sci. (Ny)., vol. 445–446, pp. 22–37, 2018, doi: 10.1016/j.ins.2018.03.002.

S. Huda et al., “An Ensemble Oversampling Model for Class Imbalance Problem in Software Defect Prediction,†IEEE Access, vol. 6, pp. 24184–24195, 2018, doi: 10.1109/ACCESS.2018.2817572.

X. Xu, W. Chen, and Y. Sun, “Over-sampling algorithm for imbalanced data classification,†J. Syst. Eng. Electron., vol. 30, no. 6, pp. 1182–1191, 2019, doi: 10.21629/JSEE.2019.06.12.

G. G. Warsi, S. Saini, and K. Khatri, “Ensemble Learning on Diabetes Data Set and Early Diabetes Prediction,†in 2019 International Conference on Computing, Power and Communication Technologies (GUCON), 2019, pp. 182–187.

M. A. Febriantono, S. H. Pramono, Rahmadwati, and G. Naghdy, “Classification of multiclass imbalanced data using cost-sensitive decision tree c5.0,†IAES Int. J. Artif. Intell., vol. 9, no. 1, pp. 65–72, 2020, doi: 10.11591/ijai.v9.i1.pp65-72.

T. M. Khan, S. Xu, Z. G. Khan, and M. U. Chishti, “Implementing multilabeling, ADASYN, and relieff techniques for classification of breast cancer diagnostic through machine learning: Efficient computer-aided diagnostic system,†J. Healthc. Eng., vol. 2021, 2021, doi: 10.1155/2021/5577636.

Y. Yang, P. Xiao, Y. Cheng, W. Liu, and Z. Huang, “Ensemble Strategy for Hard Classifying Samples in Class-Imbalanced Data Set,†Proc. - 2018 IEEE Int. Conf. Big Data Smart Comput. BigComp 2018, pp. 170–175, 2018, doi: 10.1109/BigComp.2018.00033.

J. Mathew, C. K. Pang, M. Luo, and W. H. Leong, “Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines,†IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 9, pp. 4065–4076, 2018, doi: 10.1109/TNNLS.2017.2751612.

H. R. Sanabila and W. Jatmiko, “Ensemble Learning on Large Scale Financial Imbalanced Data,†2018 Int. Work. Big Data Inf. Secur. IWBIS 2018, pp. 93–98, 2018, doi: 10.1109/IWBIS.2018.8471702.

R. Ghorbani and R. Ghousi, “Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques,†IEEE Access, vol. 8, pp. 67899–67911, 2020, doi: 10.1109/ACCESS.2020.2986809.

L. Abdi and S. Hashemi, “To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques,†IEEE Trans. Knowl. Data Eng., vol. 28, no. 1, pp. 238–251, 2016, doi: 10.1109/TKDE.2015.2458858.

S. Ding et al., “Kernel based online learning for imbalance multiclass classification,†Neurocomputing, vol. 277, pp. 139–148, 2018, doi: 10.1016/j.neucom.2017.02.102.

J. Bi and C. Zhang, “An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme,†Knowledge-Based Syst., vol. 158, no. May, pp. 81–93, 2018, doi: 10.1016/j.knosys.2018.05.037.

B. Krawczyk, M. Koziarski, and M. Woźniak, “Radial-Based Oversampling for Multiclass Imbalanced Data Classification,†IEEE Trans. Neural Networks Learn. Syst., vol. 31, no. 8, pp. 2818–2831, 2020, doi: 10.1109/TNNLS.2019.2913673.

E. R. Q. Fernandes and A. C. P. L. F. de Carvalho, “Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning,†Inf. Sci. (Ny)., vol. 494, pp. 141–154, 2019, doi: 10.1016/j.ins.2019.04.052.

O. Sagi and L. Rokach, “Approximating XGBoost with an interpretable decision tree,†Inf. Sci. (Ny)., vol. 572, pp. 522–542, 2021, doi: 10.1016/j.ins.2021.05.055.

A. B. Parsa, A. Movahedi, H. Taghipour, S. Derrible, and A. (Kouros) Mohammadian, “Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis,†Accid. Anal. Prev., vol. 136, no. October 2019, p. 105405, 2020, doi: 10.1016/j.aap.2019.105405.

J. Nobre and R. F. Neves, “Combining Principal Component Analysis, Discrete Wavelet Transform and XGBoost to trade in the financial markets,†Expert Syst. Appl., vol. 125, pp. 181–194, 2019, doi: 10.1016/j.eswa.2019.01.083.

W. Wang, G. Chakraborty, and B. Chakraborty, “Predicting the risk of chronic kidney disease (Ckd) using machine learning algorithm,†Appl. Sci., vol. 11, no. 1, pp. 1–17, 2021, doi: 10.3390/app11010202.

G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,†Expert Syst. Appl., vol. 73, pp. 220–239, 2017, doi: 10.1016/j.eswa.2016.12.035.

S. Wang, L. L. Minku, and X. Yao, “A Systematic Study of Online Class Imbalance Learning with Concept Drift,†IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 10, pp. 4802–4821, 2018, doi: 10.1109/TNNLS.2017.2771290.

F. Shakeel, A. S. Sabhitha, and S. Sharma, “Exploratory review on class imbalance problem: An overview,†2017, doi: 10.1109/ICCCNT.2017.8204150.