Handling Imbalanced Data for Acute Coronary Syndrome Classification Based on Ensemble and K-Means SMOTE Method

— Acute Coronary Syndrome (ACS) is a disease that has a high mortality rate with a mortality percentage of 40% after 5 years from diagnosis. Despite the high mortality rate, the conventional process of overestimating ACS can be life-threatening. For this reason, several alternatives for prediagnosis have been investigated to reduce the detection of ACS intensively, one of which is by using a machine learning approach. The machine learning-based prediagnosis approach utilizes patient medical record data as input for making detection models. This approach can produce an optimal model when there is quite a lot of data and the labels have a fairly balanced comparison. However, in machine learning-based ACS detection studies, researchers often do not have balanced data between positive and negative labels that have the potential to cause overfitting. That problem occurs because obtaining additional data with specific labels is difficult. To solve the imbalanced problem in ACS detection, we generated synthetic ACS data using the K-Means SMOTE method. The synthesis data is used as training data to build an ensemble-based machine-learning model. In this study, we obtain an increase in the F1 score of more than 10% when compared to machine learning models that do not use the K-Means SMOTE as an oversampling process. In addition to the greater F1 score, the results obtained are relatively more resistant to overfitting because the data variations in the training set are more diverse.


I. INTRODUCTION
Acute Coronary Syndrome (ACS) occurs when part of the heart muscle does not function properly or dies due to a decrease in the supply of blood flow in the coronary arteries [1].This is triggered by cholesterol plaques forming the inner walls of the coronary arteries (atherosclerosis) [2].Individuals suffering from ACS have a 40% chance of dying within five years [3], so this health problem is a major concern for most countries.Although ACS is very dangerous [4], detecting ACS is very difficult, and excessive detection can be lifethreatening for the patient [5].For these reasons, several studies have been carried out to support the initial diagnosis of ACS; one is using a machine learning approach.
Machine learning (ML) is a data processing technique that can be used to classify based on existing data [6].The study in [7] uses the artificial neural network (ANN) method with an F1 score of 0.849.Other researchers [8], [9] use a Decision Tree with an F1 score of 0.979 and the Random Forest algorithm with an accuracy of 83.45%.However, the results of previous studies were still not optimal because the dataset used to build the classification model was extremely imbalanced [10].To solve the imbalance problem, one of the effective methods that can be used is oversampling [11].
Oversampling is a method that can be used to overcome imbalanced problems without losing any information from the original data, such as the undersampling approach [12].Many oversampling algorithms have been formulated by researchers.One of the most stable oversampling algorithms for tabular data cases is K-means Smote [13].
K-Means SMOTE is an advanced oversampling method from SMOTE which is added with clustering and filtering processes to minimize noise and improve the quality of the resulting synthetic data.This algorithm is used in several similar studies and produces optimal results [14], [15].In this study, we use the K-Means SMOTE method to overcome the imbalanced problem in ACS classification so that the dataset used to build the prediction model is balanced.To evaluate the results, we used the Random Forest classification algorithm to measure accuracy, F1 score, and ROC AUC.Also, we used the Mann-Whitney U statistical test to measure whether the results of ML using K-Means SMOTE had a significant impact or not on the ML results.

A. Dataset Description
The ACS dataset we used in this study was taken from Indonesia General Hospital and has been approved by the ethics committees of Institut Teknologi Bandung (August 26, 2022).Belmont Report and International Ethical Guideline performed all procedures.Due to the privacy policy, we cannot publicly post the dataset.This ACS dataset contains 480 instances, with 138 (28.75%) identified cases of ACS and 342 (71.25%) unidentified cases.This dataset consists of 14 features can be seen in Table 1.To calculate the value of the imbalance data in this study, we used the imbalance ratio (IR) as follows: The IR value produces a range of 0 to 1, where a value of 1 means balanced and 0 is imbalanced.In this study, we will carry out an oversampling process to make the IR value close to or equal to 1.We also map data of type to continue to see the distribution of our data.Complete data can be seen in Table 2.In this study, the age range of the data used is 3-88 years, resting blood pressure 40-200, cholesterol 71-564, maximum heart rate 40-202, st depression ECG 0-6.2, and the number of major vessels 0-4.

B. K-Means SMOTE
K-Means SMOTE [13] is an improvement of the SMOTE algorithm [16] which still has much noise during the data generation process.This algorithm has three main steps: clustering, filtering, and oversampling.
The clustering process is carried out to separate the majority and minority classes.If, in a cluster, there is an IR value less than 1, then the oversampling process will be carried out using the SMOTE algorithm.Details of the algorithm are presented in Algorithm 1.
In this study, we used K-Means Smote algorithm to produce synthetic data for the minority class, namely the positive ACS class.Thus, the data used can be balanced with IR = 1.

C. Random Forest
Random Forest [17] is a machine-learning algorithm built using multiple decision trees (bagging concept).Compared to conventional tree algorithms, the advantages of this algorithm are that it has better noise resistance, does not produce overfitting, and has better accuracy [18], [19].In the Random Forest algorithm, several stages are carried out to build the model [18], which can be seen in the stages in Algorithm 2, namely:  Conduct random sampling of data and features for each input tree. Build a tree model. voting (average, majority, etc.)

D. Autoencoder
Autoencoder is an algorithm based on the artificial neural network [20], which encodes data that does not have a label [21].The purpose of this algorithm is to reconstruct the output so that it is close to the input.Autoencoder algorithms are generally used for data transformation and feature selection with three main layers, encoder, bottleneck, and decoder.Architectural drawings can be seen in Fig 1.The function of the encoder is to receive input and transform the data into lower dimensions, proceed to the bottleneck layer, which will carry out the encoding process, and end with the decoder layer, where the data reconstruction process is carried out using the encoding results.In this study, we use an autoencoder algorithm to perform preprocessing and feature selection to make the data used as the input model more optimal.

E. Model Evaluation
Accuracy, F1 score, and ROC AUC are used in this study to measure the performance results of each experimental scenario carried out.F1 score is obtained from the precision and recall values as follows: f1 %": with each precision and recall formula as follows: The false negative (FN) value is obtained from the positive class, which is predicted to be a negative class.Meanwhile, ROC AUC is obtained from the area under the ROC curve.The ROC value is obtained by comparing the True Positive Rate (TPR) and the False Positive Rate (FPR), plotting into a two-dimensional graph based on all classification thresholds.TPR and FPR formula are as follows: A true negative value (TN) is obtained from each negative class that is predicted to be a negative class.

F. Scenario
In this study, we took several steps to obtain the research results, which can be seen in Fig 2 .The first step is to collect the dataset and preprocess the data.The preprocessing carried out includes the disposal of unreasonable data, such as data with an age of 0 years, filling in empty columns using averages, and categorizing data with non-continuous types.After the dataset is obtained, we transform the data using an autoencoder algorithm and divide the dataset into train and test before the oversampling process.After separating the datasets, we oversampled the dataset using K-Means SMOTE algorithm.Next, we carry out the learning process with the following configurations of folds: 3, 5, 7, 9, 10, and 30The final step is to evaluate the model using the F1 Score, Accuracy, and ROC AUC parameters obtained from the machine learning model testing process.The evaluation process involves statistical processes and tests with nonparametric-based statistical tests.

A. Feature Analysis
In this study, we ranked the features with the most significance on the ACS labeling Gini importance [22].Gini importance, also known as impurity importance, is obtained from the value of impurity reduction carried out in the feature tree splitting process.The value obtained from each tree will be averaged against the number of trees in a Random Forest to compare the values between the variables.The higher the Gini value, the more significant the feature is on the target.The results of the ranking can be seen in Fig 3

Fig. 3 Ranking of Features Using Relative Importance
In Fig 3, we can see that the most significant features on the ACS label are st depression ECG and thalassemia, which account for more than 17% of the total features.Meanwhile, the resting blood pressure, num major vessels, maximum heart rate, angina type, and cholesterol features affect between 7.8% and 10.5%.Other features only have an impact of less than 7.5% each.

B. Autoencoder Result
In this research, we use two layers of an encoder with one additional bottleneck layer and end with two layers of a decoder.The output of this process is the transformed data.The auto-encoder layer and parameters can be seen in

C. K-Means SMOTE Impact on Machine Learning Models
In this study, all machine learning models built using the k-Means SMOTE data train had better F1 scores, accuracy, and ROC AUC scores compared to models built using the original data train.The entire distribution of results can be seen in Table 3.The best F1 score obtained is 0.8515 with 30-fold configurations.In comparison, the lowest F1 score is 0.7087 with 3-fold configuration and uses the original training data as a modeling material.In this study, we also compared the results of the k-Means SMOTE with several other oversampling algorithms such as SMOTE [16], ADASYN [23], Gaussian SMOTE [24], Cure SMOTE [25], SMOTE PSO [26] and Borderline SMOTE [27].The results of each algorithm can be seen in Table IV.In the experimental results, K-Means SMOTE gives the highest results for all scenarios when compared to the results of other oversampling algorithms.

D. Comparison with Previous Studies
In Table 5, we present a comparison of our research with previous studies.In the study of [7]- [9] and this study, ACS cases were only under 32% of the total data, whereas in the [28] study ACS cases had 97% of the total data.The composition of the ACS data in these studies is imbalanced.However, in [29] study, the ACS case had 50% of the composition of all data, which means the data in the study were balanced.Although several previous studies had an F1 score that was better than ours, this cannot be a measure of the quality of the predictive model because of the bias in the type and quality of data used in each study.

E. Statistical Evaluation
To obtain more accurate and unbiased comparison results, we use nonparametric-based statistical tests [30] to compare the results between the model built with the original data with the K-Means SMOTE based on Table 3.The use of this method can provide an exact description of the distribution of values.In this study, we used Mann-Whitney U [31] for statistical tests with the following as follows: n = --+ o s ( s j ) q r ( (8) whereandare the values of the F1 score from the results of the k-means smote data and the original data.( and ( are the rank of sum in the groups.To measure the null hypothesis, we use a threshold α = 0.05.Based on the calculation result, we obtained a p-value of 0.00216.that result can be concluded that t u is rejected.The results mean that the use of the K-Means SMOTE has a significant effect on increasing the F1 score.IV.CONCLUSION In this study, we addressed the problem of data imbalance in the ACS classification case by using the K-Means SMOTE algorithm to oversample the training data.Our simulations showed that all models built using K-Means SMOTE oversampling data increased F1 scores in all scenarios, with an average increase of 10.07%.We also compared the performance of other oversampling algorithms and found that K-Means SMOTE had the most significant increase in F1 scores.
Our study's findings suggest that oversampling algorithms can improve the output of machine learning models on imbalanced ACS datasets.However, we acknowledge that our research has some limitations, such as using only one dataset and an oversampling algorithm.Therefore, future research could explore other oversampling algorithms, feature engineering processes, and advanced machine learning algorithms to improve the output of these models further.
In conclusion, our research provides insight into the use of oversampling algorithms to address data imbalance in the ACS classification case.Our findings can be used as a foundation for future research to improve the output of machine-learning models on imbalanced ACS datasets.

Fig. 1
Fig. 1 Autoencoder value (TP) is obtained from each positive class predicted to be a positive class, while a false positive value (FP) is obtained from every negative class predicted to be a positive class.('",##= lg lgjm

Fig. 2
Fig. 2 Scenario Fig 4. The autoencoder was run for 200 epochs and produced the lowest training loss value of 0.0181 and the lowest loss validation value of 0.0149.The loss value of each epoch can be seen in Fig 5.