Avoiding Overfitting dan Overlapping in Handling Class Imbalanced Using Hybrid Approach with Smoothed Bootstrap Resampling and Feature Selection

Hartono Hartono - Universitas Potensi Utama, Medan, Indonesia
Erianto Ongko - Akademi Teknologi Industri Immanuel, Medan, Indonesia


Citation Format:



DOI: http://dx.doi.org/10.30630/joiv.6.2.985

Abstract


The dataset tends to have the possibility to experience imbalance as indicated by the presence of a class with a much larger number (majority) compared to other classes(minority). This condition results in the possibility of failing to obtain a minority class even though the accuracy obtained is high. In handling class imbalance, the problems of diversity and classifier performance must be considered. Hence, the Hybrid Approach method that combines the sampling method and classifier ensembles presents satisfactory results. The Hybrid Approach generally uses the oversampling method, which is prone to overfitting problems. The overfitting condition is indicated by high accuracy in the training data, but the testing data can show differences in accuracy. Therefore, in this study, Smoothed Bootstrap Resampling is the oversampling method used in the Hybrid Approach, which can prevent overfitting. However, it is not only the class imbalance that contributes to the decline in classifier performance. There are also overlapping issues that need to be considered. The approach that can be used to overcome overlapping is Feature Selection. Feature selection can reduce overlap by minimizing the overlap degree. This research combined the application of Feature Selection with Hybrid Approach Redefinition, which modifies the use of Smoothed Bootstrap Resampling in handling class imbalance in medical datasets. The preprocessing stage in the proposed method was carried out using Smoothed Bootstrap Resampling and Feature Selection. The Feature Selection method used is Feature Assessment by Sliding Thresholds (FAST). While the processing is done using Random Under Sampling and SMOTE. The overlapping measurement parameters use Augmented R-Value, and Classifier Performance uses the Balanced Error Rate, Precision, Recall, and F-Value parameters. The Balanced Error Rate states the combined error of the majority and minority classes in the 10-Fold Validation test, allowing each subset to become training data. The results showed that the proposed method provides better performance when compared to the comparison method

Keywords


Class Imbalance; Overfitting; Hybrid Approach Redefinition; Overlapping; Feature Selection

Full Text:

PDF

References


R. Ahsan, F. Ebrahimi, and M. Ebrahimi, “Classification of imbalanced protein sequences with deep-learning approaches; application on influenza A imbalanced virus classes,†Informatics in Medicine Unlocked, p. 100860, Jan. 2022, doi: 10.1016/j.imu.2022.100860.

L. Dou, F. Yang, L. Xu, and Q. Zou, “A comprehensive review of the imbalance classification of protein post-translational modifications,†Briefings in Bioinformatics, vol. 22, no. 5, p. bbab089, Sep. 2021, doi: 10.1093/bib/bbab089.

D. I. Tsilimigras et al., “A Machine-Based Approach to Preoperatively Identify Patients with the Most and Least Benefit Associated with Resection for Intrahepatic Cholangiocarcinoma: An International Multi-institutional Analysis of 1146 Patients,†Ann Surg Oncol, vol. 27, no. 4, pp. 1110–1119, Apr. 2020, doi: 10.1245/s10434-019-08067-3.

Y.-C. Wang and C.-H. Cheng, “A multiple combined method for rebalancing medical data with class imbalances,†Computers in Biology and Medicine, vol. 134, p. 104527, Jul. 2021, doi: 10.1016/j.compbiomed.2021.104527.

K. De Angeli et al., “Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types,†Journal of Biomedical Informatics, vol. 125, p. 103957, Jan. 2022, doi: 10.1016/j.jbi.2021.103957.

W.-C. Lin, C.-F. Tsai, Y.-H. Hu, and J.-S. Jhang, “Clustering-based undersampling in class-imbalanced data,†Information Sciences, vol. 409–410, pp. 17–26, Oct. 2017, doi: 10.1016/j.ins.2017.05.008.

U. R. Salunkhe and S. N. Mali, “Classifier Ensemble Design for Imbalanced Data Classification: A Hybrid Approach,†Procedia Computer Science, vol. 85, pp. 725–732, Jan. 2016, doi: 10.1016/j.procs.2016.05.259.

I. D. Mienye and Y. Sun, “Performance analysis of cost-sensitive learning methods with application to imbalanced medical data,†Informatics in Medicine Unlocked, vol. 25, p. 100690, Jan. 2021, doi: 10.1016/j.imu.2021.100690.

N. Liu, X. Li, E. Qi, M. Xu, L. Li, and B. Gao, “A Novel Ensemble Learning Paradigm for Medical Diagnosis With Imbalanced Data,†IEEE Access, vol. 8, pp. 171263–171280, 2020, doi: 10.1109/ACCESS.2020.3014362.

S. Balasubramanian, R. Kashyap, S. T. CVN, and M. Anuradha, “Hybrid Prediction Model For Type-2 Diabetes With Class Imbalance,†in 2020 IEEE International Conference on Machine Learning and Applied Network Technologies (ICMLANT), Dec. 2020, pp. 1–6. doi: 10.1109/ICMLANT50963.2020.9355975.

J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya, “A survey on addressing high-class imbalance in big data,†Journal of Big Data, vol. 5, no. 1, p. 42, Nov. 2018, doi: 10.1186/s40537-018-0151-6.

R. Akbani, S. Kwek, and N. Japkowicz, “Applying Support Vector Machines to Imbalanced Datasets,†in Machine Learning: ECML 2004, Berlin, Heidelberg, 2004, pp. 39–50. doi: 10.1007/978-3-540-30115-8_7.

M. Koziarski, “Radial-Based Undersampling for imbalanced data classification,†Pattern Recognition, vol. 102, p. 107262, Jun. 2020, doi: 10.1016/j.patcog.2020.107262.

N. Rodríguez, D. López, A. Fernández, S. García, and F. Herrera, “SOUL: Scala Oversampling and Undersampling Library for imbalance classification,†SoftwareX, vol. 15, p. 100767, Jul. 2021, doi: 10.1016/j.softx.2021.100767.

S. Y. Ho, L. Wong, and W. W. B. Goh, “Avoid Oversimplifications in Machine Learning: Going beyond the Class-Prediction Accuracy,†Patterns, vol. 1, no. 2, p. 100025, May 2020, doi: 10.1016/j.patter.2020.100025.

P. Wibowo and C. Fatichah, “Pruning-based oversampling technique with smoothed bootstrap resampling for imbalanced clinical dataset of Covid-19,†Journal of King Saud University - Computer and Information Sciences, Sep. 2021, doi: 10.1016/j.jksuci.2021.09.021.

P. Vuttipittayamongkol, E. Elyan, and A. Petrovski, “On the class overlap problem in imbalanced data classification,†Knowledge-Based Systems, vol. 212, p. 106631, Jan. 2021, doi: 10.1016/j.knosys.2020.106631.

A. Wahid et al., “Feature selection and classification for gene expression data using novel correlation based overlapping score method via Chou’s 5-steps rule,†Chemometrics and Intelligent Laboratory Systems, vol. 199, p. 103958, Apr. 2020, doi: 10.1016/j.chemolab.2020.103958.

S. Sreejith, H. Khanna Nehemiah, and A. Kannan, “Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection,†Computers in Biology and Medicine, vol. 126, p. 103991, Nov. 2020, doi: 10.1016/j.compbiomed.2020.103991.

S. Huda, J. Yearwood, H. F. Jelinek, M. M. Hassan, G. Fortino, and M. Buckland, “A Hybrid Feature Selection With Ensemble Classification for Imbalanced Healthcare Data: A Case Study for Brain Tumor Diagnosis,†IEEE Access, vol. 4, pp. 9145–9154, 2016, doi: 10.1109/ACCESS.2016.2647238.

T. Thaher, M. Mafarja, B. Abdalhaq, and H. Chantar, “Wrapper-based Feature Selection for Imbalanced Data using Binary Queuing Search Algorithm,†Oct. 2019. doi: 10.1109/ICTCS.2019.8923039.

A. Ghazikhani, H. S. Yazdi, and R. Monsefi, “Class imbalance handling using wrapper-based random oversampling,†in 20th Iranian Conference on Electrical Engineering (ICEE2012), May 2012, pp. 611–616. doi: 10.1109/IranianCEE.2012.6292428.

M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches,†IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463–484, Jul. 2012, doi: 10.1109/TSMCC.2011.2161285.

X. Hou, T. Zhang, L. Ji, and Y. Wu, “Combating highly imbalanced steganalysis with small training samples using feature selection,†J. Vis. Comun. Image Represent., vol. 49, no. C, pp. 243–256, Nov. 2017, doi: 10.1016/j.jvcir.2017.09.016.

S. Oh, “A new dataset evaluation method based on category overlap,†Comput. Biol. Med., vol. 41, no. 2, pp. 115–122, Feb. 2011, doi: 10.1016/j.compbiomed.2010.12.006.

X. Chen and M. Wasikowski, “FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems,†in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, Aug. 2008, pp. 124–132. doi: 10.1145/1401890.1401910.

A. Luque, A. Carrasco, A. Martín, and A. de las Heras, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,†Pattern Recognition, vol. 91, pp. 216–231, Jul. 2019, doi: 10.1016/j.patcog.2019.02.023.

J. Alcalá-Fdez et al., “KEEL: a software tool to assess evolutionary algorithms for data mining problems,†Soft Comput, vol. 13, no. 3, pp. 307–318, Feb. 2009, doi: 10.1007/s00500-008-0323-y.

F. Wilcoxon, “Individual Comparisons by Ranking Methods on JSTOR,†Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945.