Combining Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) and Hybrid Sampling in Handling Multi-Class Imbalance and Overlapping

Hartono Hartono - Department of Computer Science, Universitas IBBI, Medan, Indonesia
Erianto Ongko - Department of Informatics, Akademi Teknologi Industri Immanuel, Medan, Indonesia


Citation Format:



DOI: http://dx.doi.org/10.30630/joiv.5.1.420

Abstract


The class imbalance problem in the multi-class dataset is more challenging to manage than the problem in the two classes and this problem is more complicated if accompanied by overlapping. One method that has proven reliable in dealing with this problem is the Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method which is classified as a hybrid approach that combines sampling and classifier ensembles. However, in terms of diversity among classifiers, a hybrid approach that combines sampling and classifier ensembles will give better results. HAR-MI provides excellent results in handling multi-class imbalances. The HAR-MI method uses SMOTE to increase the number of samples in the minority class. However, this SMOTE also has a weakness where an extremely imbalanced dataset and a large number of attributes will be over-fitting. To overcome the problem of over-fitting, the Hybrid Sampling method was proposed. HAR-MI combination with Hybrid Sampling is done to increase the number of samples in the minority class and at the same time reduce the number of noise samples in the majority class. The preprocessing stages at HAR-MI will use the Minimizing Overlapping Selection under Hybrid Sampling (MOSHS) method, and the processing stages will use Different Contribution Sampling. The results obtained will be compared with the results using Neighbourhood-based under-sampling. Overlapping and Classifier Performance will be measured using Augmented R-Value, the Matthews Correlation Coefficient (MCC), Precision, Recall, and F-Value. The results showed that HAR-MI with Hybrid Sampling gave better results in terms of Augmented R-Value, Precision, Recall, and F-Value

Keywords


Class imbalance; multi-class dataset; multi-class imbalance; hybrid approach; HAR-MI.

Full Text:

PDF

References


G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from Class-Imbalanced Data: Review of Methods and Applications,†Expert Systems With Applications, vol. 73, pp. 220–239, May 2017.

A. Guzmán-Ponce, J. S. Sánchez, R. M. Valdovinos, and J. R. Marcial-Romero, “DBIG-US: A two-stage under-sampling algorithm to face the class imbalance problem,†Expert Systems with Applications, p. 114301, Nov. 2020, doi: 10.1016/j.eswa.2020.114301.

B. Liu and G. Tsoumakas, “Dealing with class imbalance in classifier chains via random undersampling,†Knowledge-Based Systems, vol. 192, p. 105292, Mar. 2020, doi: 10.1016/j.knosys.2019.105292.

J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with class imbalance,†J Big Data, vol. 6, no. 1, p. 27, Mar. 2019, doi: 10.1186/s40537-019-0192-5.

P. Shamsolmoali, M. Zareapoor, L. Shen, A. H. Sadka, and J. Yang, “Imbalanced data learning by minority class augmentation using capsule adversarial networks,†Neurocomputing, Jul. 2020, doi: 10.1016/j.neucom.2020.01.119.

W. Hou, X. Wang, H. Zhang, J. Wang, and L. Li, “A novel dynamic ensemble selection classifier for an imbalanced data set: An application for credit risk assessment,†Knowledge-Based Systems, vol. 208, p. 106462, Nov. 2020, doi: 10.1016/j.knosys.2020.106462.

F. Rayhan, S. Ahmed, A. Mahbub, R. Jani, S. Shatabda, and D. M. Farid, “CUSBoost: Cluster-Based Under-Sampling with Boosting for Imbalanced Classification,†in 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), Dec. 2017, pp. 1–5, doi: 10.1109/CSITSS.2017.8447534.

J. Zhao, J. Jin, S. Chen, R. Zhang, B. Yu, and Q. Liu, “A weighted hybrid ensemble method for classifying imbalanced data,†Knowledge-Based Systems, vol. 203, p. 106087, Sep. 2020, doi: 10.1016/j.knosys.2020.106087.

Z. Liu, D. Tang, Y. Cai, R. Wang, and F. Chen, “A hybrid method based on ensemble WELM for handling multi class imbalance in cancer microarray data,†Neurocomputing, vol. 266, pp. 641–650, Nov. 2017, doi: 10.1016/j.neucom.2017.05.066.

E. R. Q. Fernandes and A. C. P. L. F. de Carvalho, “Evolutionary inversion of class distribution in overlapping areas for multi-class imbalanced learning,†Information Sciences, vol. 494, pp. 141–154, Aug. 2019, doi: 10.1016/j.ins.2019.04.052.

Y. Zhu, Y. Yan, Y. Zhang, and Y. Zhang, “EHSO: Evolutionary Hybrid Sampling in overlapping scenarios for imbalanced learning,†Neurocomputing, vol. 417, pp. 333–346, Dec. 2020, doi: 10.1016/j.neucom.2020.08.060.

P. Zyblewski, R. Sabourin, and M. Woźniak, “Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams,†Information Fusion, vol. 66, pp. 138–154, Feb. 2021, doi: 10.1016/j.inffus.2020.09.004.

L. Yijing, G. Haixiang, L. Xiao, L. Yanan, and L. Jinling, “Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data,†Knowledge-Based Systems, vol. 94, pp. 88–104, Feb. 2016, doi: 10.1016/j.knosys.2015.11.013.

H. Hartono, Y. Risyani, E. Ongko, and D. Abdullah, “HAR-MI method for multi-class imbalanced datasets,†TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 18, no. 2, Art. no. 2, Apr. 2020, doi: 10.12928/telkomnika.v18i2.14818.

G.-H. Fu, Y.-J. Wu, M.-J. Zong, and L.-Z. Yi, “Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics,†Chemometrics and Intelligent Laboratory Systems, vol. 196, p. 103906, Jan. 2020, doi: 10.1016/j.chemolab.2019.103906.

X. Gao et al., “An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling,†Expert Systems with Applications, vol. 160, p. 113660, Dec. 2020, doi: 10.1016/j.eswa.2020.113660.

J. Wei, H. Huang, L. Yao, Y. Hu, Q. Fan, and D. Huang, “New imbalanced bearing fault diagnosis method based on Sample-characteristic Oversampling TechniquE (SCOTE) and multi-class LS-SVM,†Applied Soft Computing, vol. 101, p. 107043, Mar. 2021, doi: 10.1016/j.asoc.2020.107043.

Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data,†Journal of Biomedical Informatics, vol. 107, p. 103465, Jul. 2020, doi: 10.1016/j.jbi.2020.103465.

P. Vuttipittayamongkol and E. Elyan, “Neighbourhood-based undersampling approach for handling imbalanced and overlapped data,†Information Sciences, vol. 509, pp. 47–70, Jan. 2020, doi: 10.1016/j.ins.2019.08.062.

M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches,†IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463–484, Jul. 2012, doi: 10.1109/TSMCC.2011.2161285.

S. Oh, “A new dataset evaluation method based on category overlap,†Comput. Biol. Med., vol. 41, no. 2, pp. 115–122, Feb. 2011, doi: 10.1016/j.compbiomed.2010.12.006.

A. Luque, A. Carrasco, A. Martín, and A. de las Heras, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,†Pattern Recognition, vol. 91, pp. 216–231, Jul. 2019, doi: 10.1016/j.patcog.2019.02.023.

J. Alcalá-Fdez et al., “KEEL: a software tool to assess evolutionary algorithms for data mining problems,†Soft Comput, vol. 13, no. 3, pp. 307–318, Feb. 2009, doi: 10.1007/s00500-008-0323-y.

F. Wilcoxon, “Individual Comparisons by Ranking Methods on JSTOR,†Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945