Avoiding Overfitting dan Overlapping in Handling Class Imbalanced Using Hybrid Approach with Smoothed Bootstrap Resampling and Feature Selection

— The dataset tends to have the possibility to experience imbalance as indicated by the presence of a class with a much larger number (majority) compared to other classes(minority). This condition results in the possibility of failing to obtain a minority class even though the accuracy obtained is high. In handling class imbalance, the problems of diversity and classifier performance must be considered. Hence, the Hybrid Approach method that combines the sampling method and classifier ensembles presents satisfactory results. The Hybrid Approach generally uses the oversampling method, which is prone to overfitting problems. The overfitting condition is indicated by high accuracy in the training data, but the testing data can show differences in accuracy. Therefore, in this study, Smoothed Bootstrap Resampling is the oversampling method used in the Hybrid Approach, which can prevent overfitting. However, it is not only the class imbalance that contributes to the decline in classifier performance. There are also overlapping issues that need to be considered. The approach that can be used to overcome overlapping is Feature Selection. Feature selection can reduce overlap by minimizing the overlap degree. This research combined the application of Feature Selection with Hybrid Approach Redefinition, which modifies the use of Smoothed Bootstrap Resampling in handling class imbalance in medical datasets. The preprocessing stage in the proposed method was carried out using Smoothed Bootstrap Resampling and Feature Selection. The Feature Selection method used is Feature Assessment by Sliding Thresholds (FAST). While the processing is done using Random Under Sampling and SMOTE. The overlapping measurement parameters use Augmented R-Value, and Classifier Performance uses the Balanced Error Rate, Precision, Recall, and F-Value parameters. The Balanced Error Rate states the combined error of the majority and minority classes in the 10-Fold Validation test, allowing each subset to become training data. The results showed that the proposed method provides better performance when compared to the comparison method.


I. INTRODUCTION
The problem of dataset imbalance is often experienced in classification algorithms caused by the fact that datasets in the real world are rarely perfectly balanced [1]. The classification algorithm provides optimum results in a situation where the sample distribution is balanced in each class and requires special handling of the sample imbalance problem to achieve optimum performance [2]. Classes with fewer instances (minority class) are often ignored in the classification algorithm, or there is misclassification of the minority class into another class even though the minority class is a class with a high value because it is the center of observation [3]. Class imbalance is unavoidable; for example, medical datasets are obtained from patient medical data, where the number of patients suffering from the disease is much less than the number of patients without the disease [4].
There are 2 (two) algorithms for dealing with class imbalance problems: data-level techniques and algorithmlevel methods [5]. Data-level techniques are used in the form of sampling to reduce imbalance by increasing the number of samples in the minority class (oversampling) or reducing the number of samples in the majority class (undersampling) [6]. Criticism of Data-Level is especially related to overfitting problems in the application of oversampling or omitting important data from a class in undersampling [7]. The Algorithm-level works by generating many classifiers through a modification process to the classification algorithm. Algorithm-level accuracy tends to decrease in highdimensional datasets [8]. Many researchers have proposed a Hybrid Approach that combines the advantages of data-level and algorithm-level in handling class imbalance [9], [10]. The Hybrid Approach has the advantage of overcoming a weakness at both the data-level and algorithm level to complement each other to provide better performance [11]. Akbani et al. [12] shows that combining data-level and algorithm-level with SVM and SMOTE gives better results than using only data levels such as RUS and SMOTE or only using algorithm levels such as SVM.
The Hybrid Approach tends to use oversampling compared to undersampling because, based on research from many researchers, it is found that oversampling gives better results than undersampling on severely imbalanced datasets, although the differences are not significant [13], [14]. However, overfitting problems in oversampling should be emphasized because overfitting can cause good accuracy in training data, but this is not the case with testing data [15]. Therefore, a number of oversampling methods have been proposed that offer the ability to handle overfitting, and one of them is Smoothed Bootstrap Resampling. The Smoothed Bootstrap Resampling method has shown good performance in terms of performance on training data and testing data [16].
It is not only the class imbalance that needs attention to obtain good classification results. The problem of overlapping often goes unnoticed, even though this overlap can also affect the prediction results [17]. One of the efforts to handle overlapping is to minimize overlapping degrees by using Feature Selection [18]. One method that combines feature selection with oversampling is Wrapper Approach-SMOTE [19]. The use of Feature Selection and Oversampling, in addition to being effective in dealing with overlapping is also proven to provide accurate results and also fast detection of class imbalance problems [20]. The advantage of feature selection with the wrapper approach is that it can find the appropriate region classifier for the sampling process so that it could be more effective [21]. Research conducted by Ghazikhani et al. [22] shows that the Wrapper Approach is the most suitable feature selection method to be combined with SMOTE in dealing with overlapping and class imbalance.
Based on the consideration of the importance of efforts to deal with overfitting and overlapping in handling class imbalance, this research combined the application of Feature Selection with Hybrid Approach Redefinition, which modifies the use of Smoothed Bootstrap Resampling in handling class imbalance. The results of this study were compared with the Wrapper Approach-SMOTE.

A. Hybrid Approach
The pseudocode of the Hybrid Approach is as follows [23].
Based on the pseudocode above, it can be seen that in the Hybrid Approach, data-level and algorithm-level are used, which are applied to the preprocessing and processing stages. The preprocessing stage is carried out to ensure that the dataset or samples are ready to undergo the processing stage.

B. Smoothed Bootstrap Resampling (SBR)
The pseudocode of the SBR is as follows [16].
-%% < ! U>VWW ( ! 1 % + @ ( ! Based on the pseudocode above, several parameters need to be considered, namely: Q R E (F) is a sample estimate of the standard deviation of the q-th dimension belong to the class . F . ℎ E (F) is matrix smoothing, a is the mean, and Q is the value of the standard deviation, and Q is the variance.

C. Feature Selection
The Feature Selection method used in this study is Feature Assessment by Sliding Thresholds (FAST) [24]. The pseudocode of FAST is as follows. b: : In the pseudocode above, it can be seen that Feature Selection with FAST starts with determining the number of attributes or features from the dataset. The loop was executed based on the number of existing features. Each stage used each feature to determine the value of tpr, fpr, and Area Under ROC.

D. Augmented R-Value
Augmented R-Value states how much overlapping occurs. The greater the Augmented R-Value, the greater the overlapping [25].
Where u , , … , vw are k class labels with | u | x | | x ⋯ x | vw | and l<m: Dataset D containing predictors in set V. Larger @ zVk is higher overlap degree of a dataset.

E. Classifier Performance
Classifier Performance was measured using Accuracy, Precision, Recall, MicroF1, and MacroF1. This classifier performance measurement is carried out based on the confusion matrix, which can be seen in  The Balanced Error Rate, Precision, Recall, MicroF1, and MacroF1 calculations can be seen in the following equation [27] [5].
In Equation 4, it can be seen that the balanced Error Rate states the average error that occurs in both the minority class and majority class, which becomes more accurate if it is used to calculate the accuracy of the imbalanced dataset. Equation 5 states that precision is the number of minority classes (positive samples) that are correctly classified from the overall classification results, which declare an instance as a minority class. Meanwhile, Equation 6 states that recall is the number of minority classes (positive samples) that are correctly classified from the entire minority class, including those incorrectly classified as majority class. Equation 7 F-Value states the accuracy associated with the balance of precision and recall.

F. Proposed Method / Algorithm
The research stages can be seen in Figure 1. Figure 1 shows the stages of research that passed in this research. The research process can be briefly described as consisting of 2 (two) major stages: preprocessing and processing. The preprocessing stage begins with the resampling process using

Smoothed Bootstrap Resampling. The Smoothed Bootstrap
Resampling process is a resampling process that calculates the Gaussian Distribution value of each sample. This process is important to prevent overfitting in the oversampling process. After that, the stage switches to the Feature Selection process using FAST. The feature selection stage is intended to reduce the degree associated with overlapping. The results of the Smoothed Bootstrap Resampling and FAST processes are preprocessed datasets. The preprocessed dataset then enter the processing stage using Different Contribution Sampling.

1) Preprocessing Using Smoothed Bootstrap Resampling and FAST:
The pseudocode of the preprocessing stage is as follows. !# ! $@, +$@, % -D % @" 13: 1 % + Based on the pseudocode, it can be seen that the very first step is to form a smoothing matrix based on the existing dataset. The smoothing matrix is determined based on the standard deviation value, which played a role in determining the Gaussian distribution value. The purpose of determining the value of the Gaussian distribution is to anticipate the occurrence of overfitting in the oversampling process. Then after that, the process was continued with determining the number of features in the dataset, and an iterative process was carried out as many as the number of features or attributes to determine the TPF, FPR, and Area Under ROC values, which this process is a feature selection process which is the last stage of the preprocessing. This stage gives results in the form of a preprocessed dataset which was continued to the processing stage.

2) Processing
Using SMOTE and RandomUndersampling: The pseudocode of the processing stage is as follows. In the processing stage, it can be seen that different handling is given to the majority and minority classes. Especially for the majority class, the undersampling process is carried out using Random Under Sampling, while for the minority class, the oversampling process is carried out using SMOTE.

A. Dataset Description
KEEL Repository provides access to the dataset used in this study [28]. The dataset used can be seen in Table II. In Table II, it can be seen that the selected dataset varies in terms of the number of samples, the number of attributes, and the imbalance ratio. It can be said that the results of training and testing using the dataset can accurately describe the results of handling class imbalances.

B. Experimental Setup
Performance testing of the proposed method is carried out on the datasets that have been stated in the previous section. Evaluation is carried out using traditional performance metrics consisting of: Augmented R-Value, Balanced Error Rate, Precision, Recall, and F-Value. The evaluation was carried out using a stratified k-fold (k=10). In the stratified kfold, it can be said that the training data is divided into 10 subsets of the same size, while still considering the distribution of each class in order to maintain the imbalance ratio. During the testing process, one of the subsets still acts as testing data, and the remaining k-1 subsets act as training data. The process was repeated for k iterations, where each subset of k was used once as testing data. The results obtained are a combination of the results in each iteration.

C. Testing Result
The first test was conducted to obtain Augmented R-Value and Balanced Error Rate (BER). The test results can be seen in Table III.

D. Statistical Tests
The Wilcoxon Signed-Rank Test was conducted to test whether there were significant differences between each method in each of the measurement parameters that had been carried out [29]. It is said that there is a significant difference if the P-Value <0.05. The statistical test results can be seen in Table V. Resampling and Feature Selection gives better and more significant results on Augmented R-Value, which indicates that the overlapping treatment results obtained are better than Wrapper Approach-SMOTE. However, this does not mean that the results given Wrapper Approach-SMOTE are not good; both methods provide good overlapping handling results. This is indicated by the two methods providing a very small Augmented R-Value value, meaning that the overlap that occurs is very small. There is a tendency that overlapping problems need more attention in datasets with large imbalance ratios. As for the Balanced Error Rate (BER), which states the error from both the majority and minority classes shows a very low value, with 10-Fold Validation where each subset becomes testing data, the results obtained are good, which shows that the Hybrid Approach with Smoothed Bootstrap Resampling and Feature Selection and the Wrapper Approach-SMOTE have provided good overfitting results. On BER, there can be no significant difference between the two methods.
On the results of the precision, recall, and F1-Value tests, the Hybrid Approach with Smoothed Bootstrap Resampling and Feature Selection gives better and more significant results than the Wrapper Approach-SMOTE. Both methods have basically resulted in good handling of class imbalance.

IV. CONCLUSION
Based on the results in Tables III, IV, and V, it is found that the results obtained with the Hybrid Approach with Smoothed Bootstrap Resampling and Feature Selection in handling overfitting and overlapping on imbalanced datasets are good. The main objective of this study is to treat class imbalance by not forgetting the handling of overfitting and overlapping. For handling class imbalance, the results obtained are good, as indicated by good Precision, Recall, and F-1 Value values. When compared with the Wrapper Approach-SMOTE method as a comparison, there are significant differences.
As for handling Overlapping, the Hybrid Approach with Smoothed Bootstrap Resampling and Feature Selection method gives very good and significant results to the Wrapper Approach-SMOTE method. As for BER, the results obtained apart from depending on the imbalance ratio also depend on the number of instances of each dataset.