ON

— There is a significant imbalanced class in the village development index (called IDM - Indeks Desa Membangun ) dataset, marked by the number of self-supporting classes more than the disadvantaged class. The traditional classifiers are able to achieve high accuracy (ACC) by training all cases of the majority class but forsaking the minority class, so that possible for the classification results to be biased. In this study, a random under-sampling technique was employed based on k-means cluster (KMC) and a meta-learning approach to improving ACC of the village status classification model. Furthermore, the AdaBoost and Random Forest were used as meta technique and base learner, respectively. The proposed model has been evaluated using the area under the curve (AUC), and experimental results showed that it yielded excellent performance compared to the prior studies with the AUC, ACC, precision (PR), recall (RC), and g-mean (Gm) values of 95.50%, 95.52%, 95.5%, 95.5%, and 92.95%, respectively. Similarly, the result of the t-test also showed the proposed model yielded excellent performance compared to previous studies. It can be concluded that the AdaBoost algorithm improved misclassification and changed the distribution of data loss function in random forests. It indicates that the proposed model effectively deals with imbalanced classes in the village development status classification model.


I. INTRODUCTION
The Ministry of Village, Development of Disadvantaged Regions, and Transmigration Republic of Indonesia (KEMENDESA), the Ministry of National Development Planning of the Republic of Indonesia (BAPPENAS), and the Central Bureau of Statistics Indonesia (BPS) developed a system that provides information regarding village development status. This information is compiled as a unit of analysis based on the village development index in Indonesia in line with law No. 6 of 2014. Furthermore, the information is utilized to formulate and summarize village development policies and oversight plans.
In recent decades, classification models have been used to develop policies for different functions based on class classification analysis in different fields [1]- [7]. This process generally begins with pre-processing, which deals with identifying potential problems. According to Han and Kamber [8], missing data, outliers, and imbalanced classes often provoke bias in classification results.
An imbalanced class has been identified in the IDM dataset, affecting the model's performance. Imbalanced class distribution in a dataset has caused severe difficulties for most base classifier models because they assume all data have a balanced class distribution [9], [10]. Due to two characterization classes, one is represented by a big sample and the other by a relatively tiny sample. According to Sun et al. [11], base classifier learning performs poorly on imbalanced datasets because they are designed to generalize from training data, and the results of the most straightforward hypotheses best fit the data. Besides, they assume all data have a balanced class distribution [3].
Several studies were conducted to identify the best machine learning models for determining the classification of the village development status in Indonesia, including kprototype [12], support vector machine (SVM) [13], bootstrap sampling k-nearest neighbors (BS-KNN) [14], and decision tree (DT) [15]. Presently, performance classification of the village development status model has been the focus of further studies since the best performance of all evaluations has not been fully achieved, and no one has been able to reconcile imbalanced class data from those utilized. The distribution of IDM data can be seen in Fig. 1. Fig. 1 Class distribution of the IDM dataset As shown in Fig. 1, the distribution of self-supporting, advanced, and very disadvantaged classes is smaller than developing, and disadvantaged classes are identified as an imbalanced class of data problem. Therefore, a combination of random undersampling based on k-means cluster (RUS-KMC) and meta-learning (RFM) was proposed to improve the accuracy (ACC) of the village development status classification model. RUS-KMC was employed to handle imbalanced classes, while RFM was used to enhance the classifier's performance. It is important to note that RUS was selected because many previous studies reported that this method is often used to tackle the imbalanced class. Besides, KMC selected as a cluster model cause able to handle mixedtype attributes and big data sets, as well as automatically determine the clusters' ideal number and attributes that are not normally distributed [16].

A. Dataset
IDM dataset obtained from Ministry of Village, Development of Disadvantaged Regions, and Transmigration of Indonesia in 2016. The dataset includes the potential village information (PODES -Potensi Desa) from 16 provinces formed in three main dimensions, Social Resilience Index (SRI), Economic Resilience Index (EcRI), and Village Ecological Resilience Index (EnRI). There are 3863 villages observed in the dataset, 62 attributes, and 5 classes, including very disadvantaged, disadvantaged, developing, advanced, and self-supporting, where each attribute has a score of one to five, which indicates a score of one is very disadvantaged, and score five is self-supporting. Data types for all attributes are numeric and categorical; for the type of data, classes are categorical, as shown in Table I. Citizen's Environmental Security System Categorical a27 Conflict score Categorical a28 Mineral water Categorical a29 Latrine access Categorical a30 Garbage Categorical a31 Washing bath Categorical a32 Electricity score Numeric a33 Signal score Numeric a34 Internet score Numeric Citizen's internet access Categorical Economy (EcRI) b1 Production diversity score Numeric b2 Economy score Numeric b3 Grocery store score Numeric b4 Shop & lodging score Numeric b5 Shop score Numeric b6 Market score Numeric b7 Road quality score Numeric b8 Region openness score Numeric b9 Mode general trans score Numeric b10 Postal and logistics services score Numeric b11 Credit fast score Numeric b12 Bank BPR score Numeric Water pollution Categorical c2 Soil pollution Categorical c3 Air pollution Categorical c4 River waste pollution Categorical c5 Pollution score Numeric c6 Avalanche Categorical c7 Flood Categorical c8 Forest fires Categorical c9 Disaster score Categorical c10 Early warning Categorical c11 Tsunami early warning Numeric c12 Safety equipment Categorical c13 Evacuation route Categorical c14 Disaster response score Categorical Class Very disadvantaged, Disadvantaged, Developing, Advanced, Self-supporting

Categorical
The total number of attributes is 62 Several government regulations determine the existence of IDM dataset, 1) presidential decree (PERPRES) No.

B. General Step
The experiments are conducted on a computer platform with the following specifications: Intel HD Graphics 4000 1536 MB, 2.5 GHz Dual-Core Intel Core i5, 8 GB RAM, and macOS Cataline Version 10.15.7 64-bit operating system, as well as the data analytics program Weka version 3.8.5. Weka will produce an AUC and a confusion matrix as computation outputs, and IBM SPSS Statistics will produce a t-test for a statistical comparison between the proposed model and prior studies.
We proposed a model called RUS-KMC+RFM, a random undersampling (RUS). It is based on the k-means cluster (KMC) and hybrid meta-learning technique (RFM) to tackle imbalanced class problems for high accuracy in the village development status classification model, as shown in Fig. 2. The KMC is a clustering technique that produces clusters of relatively uniform sizes created by Kumar et al. [16] and designed to handle very large datasets. AdaBoost was used in the meta technique, while random forest (RF) was used as the base learner. Furthermore, meta AdaBoost was employed to tackle imbalanced classes to improve RF classification performance. The model is utilized to assign different weights to misclassified samples and reduce weights correctly classified, effectively changing the data training distribution [7]. The proposed model was evaluated using the IDM dataset.
As shown in Fig. 2, the dataset was fed into the training and testing phase. The training phase was used to build the model, while the testing phase deals with testing and performance evaluation. In the pre-processing step, the KMC technique was employed in the training phase to group the dataset and its number was set to 5 to create a 5-binning or 5-quartile. The clusters are conducted randomly until the number of members in the majority, and minority classes are equal for each cluster. Consequently, those with the same proportion for each class are combined to create a new dataset.
The new dataset is later fed into a hybrid technique with a 10-fold cross-validation approach and was divided into ten pieces, in which nine serve as a training dataset, while the other one is for testing. AdaBoost and RF were employed in hybrid strategy as meta and based learners, respectively. After completing the learning process, the model is fed with test data in the testing phase, and assessment results are recorded.
This study used the area under the curve (AUC) to evaluate the proposed model. Furthermore, it is a numerical measure of differentiating the model's performance and its effectiveness in distinguishing between positive and negative observations. According to Xue and Hall [17], AUC greatly improve convergence across empirical studies in imbalanced class problems, and it is a single-measure classifier performance that is useful for determining whether the model performs better. The general rule for categorizing ACC for the diagnostic test based on AUC, also reported by Gautheron et al. [18], divides five categories, excellent, good, fair, poor, and failure, with the respective range of 90% to 100%, 80% to 90%, 70% to 80%, 60% to 70%, and 50% to 60%. The AUC is calculated star from 1 to 8.  Table II. When both actual and predicted classes are in error, it is called TP. When the predicted class is flawed, but the actual class is not, this condition is called FP. It is important to note that in a non-faulty class, TN and TP are equivalent, but if it is defective, FN occurs. The calculation was conducted under the confusion matrix generated by the model. (3)

III. RESULTS AND DISCUSSION
This experiment was conducted on a computer platform with the following specifications, Intel HD Graphics 4000 1536 MB, 2.5 GHz Dual-Core Intel Core i5, 8 GB RAM, and macOS Cataline Version 10.15.7 64-bit operating system, as well as the data analytics program Weka version 3.8.5. The program produced an AUC and a confusion matrix as computational outputs, and IBM SPSS Statistics generated a t-test for a statistical comparison between the proposed model and prior studies. First, a comparison experiment was performed between RF and the meta-learning strategy (RFM) on the IDM dataset without the KMC undersampling-based filter method. To evaluate the model, Weka directly generated AUC, ACC, PR, RC, error classification (EC), and Gm, as shown in Table III, and the comparison performance can be seen in Fig. 4. results are smaller than the conventional RF approach, which is 9.51% EC. This means the meta-learning strategy is promising enough for all performance evaluations but does not fully handle imbalanced class data, as stated by Wang and Sun [21]. Besides, AdaBoost algorithm is an effective solution for classification, but it still needs to improve the imbalanced class problem.
In the second experiment, the RUS-KMC+RFM technique was implemented, and the results, as shown in Table IV, and the comparison performance can be seen in Fig. 5. The ACC. PR. RC. Gm. and EC is directly calculated from Weka before calculating AUC.   [14] 0.00001 Significant This proposed model outperformed the first and second experiments of all evaluation performance in terms of performance. While the presentation of misclassification also gets, the best EC is smaller than the two models (RF and RFM), wherein the first experiment 4.48% < 12.71% and 9.51%, respectively. Based on this result, the overall second experiment was better than the first. On the other hand, this proposed strategy also answered what was stated by Wang and Sun [21].
Current and previous studies utilized the same private dataset; therefore, they are compared. Table V shows the k-Means+SB-kNN proposed by Siswanto, Suprapedi, and Purwanto [14] were selected for comparison. The AUC was utilized in this comparison because it is the primary evaluation in imbalanced class classification. In this comparison, a bold font means the best AUC value, and conversely underlined font represents the second best.
The proposed model outperforms prior studies and evaluations as it produced excellent AUC results and was statistically also compared to Siswanto, Suprapedi, and Purwanto [14] using the t-test can be seen in Table VI. According to this findings, although the proposed model gains the best accuracy also promising since it has a difference statistically with the best result.
In this research, specifically, the t-test model was employed to determine the difference between the proposed approach and other studies and discover which models perform better. The pair of proposed models vs. Siswanto, Suprapedi, and Purwanto [14] has a p-value of 0.00001, which indicates a substantial different, and that means it has a higher AUC value. Therefore, according to the t-test results, the proposed model showed an outstanding result and is competitive with the findings of the most recent study.

IV. CONCLUSION
In summary, the traditional classification model can achieve high accuracy in imbalanced class problems. It occurs because almost all the traditional classification models do only learn in the majority class and exclude the minority class, so the results are biased. In the pre-processing stage, the random under-sampling technique based on k-means cluster (RUS-KMC) was successfully used in the classification process. RUS was selected because many previous studies used this approach when data used contained imbalanced classes in classification problems. While KMC was selected as a clustering method because it promises to solve at least some of these problems, for example, (1) ability to handle mixed-type variables and large datasets, (2) the automatic determination of the number of optimal clusters, and (3) variables that may not be normally distributed. The evaluation results showed that the combination of RUS and KMC was very effective. The effectiveness is in selecting variables that might not be normally distributed and improving the performance of random forest classification based on metalearning (RFM) in village development status classification better than previous studies in terms of AUC, ACC, PR, RC, and Gm, respectively 95.50%, 95.52%, 95.5%, 95.5%, and 92.95 %. In addition, the results of the t-test also reported a very good performance compared to previous studies. It can be concluded that this proposed model is effective in handling imbalanced classes in IDM dataset for the village development status classification model in Indonesia.
IDM dataset structure makes it difficult to study feature discretization in handling noisy attributes based on clustering techniques. Therefore, future studies need to consider comparing the suggested approach to other clustering models such DBSCAN, Fuzzy C-means, etc., as well as other metalearning methods, including bagging and boosting.