ON INFORMATICS VISUALIZATION

— Unbalanced datasets are a common problem in supervised machine learning. It leads to a deeper understanding of the majority of classes in machine learning. Therefore, the machine learning model is more effective at recognizing the majority classes than the minority classes. Naturally, imbalanced data, such as disease data and data networking, has emerged in real life. DDOS is one of the network intrusions found to happen more often than R2L. There is an imbalance in the composition of network attacks in Intrusion Detection System (IDS) public datasets such as NSL-KDD and UNSW-NB15. Besides, researchers propose many techniques to transform it into balanced data by duplicating the minority class and producing synthetic data. Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic (ADASYN) algorithms duplicate the data and construct synthetic data for the minority classes. Meanwhile, machine learning algorithms can capture the labeled data's pattern by considering the input features. Unfortunately, not all the input features have an equal impact on the output (predicted class or value). Some features are interrelated and misleading. Therefore, the important features should be selected to produce a good model. In this research, we implement the recursive feature elimination (RFE) technique to select important features from the available dataset. According to the experiment, SMOTE provides a better synthetic dataset than ADASYN for the UNSW-B15 dataset with a high level of imbalance. RFE feature selection slightly reduces the model's accuracy but improves the training speed. Then, the Decision Tree classifier consistently achieves a better recognition rate than Random Forest and KNN.


I. INTRODUCTION
With the current high level of internet usage, network attacks pose a serious threat.The attacks are evolving in line with the advance of computing capacity.To ensure the safety of data communication, defensive action must be taken.Therefore, researchers in network defense are working hard all the time to encounter new types of attacks.
Based on publicly available datasets, many researchers develop methods and tools to recognize network intrusions, such as random forest, decision tree, logistic regression, KNN, and ANN.The common problems identified in many academic papers are that certain classes of attacks have rarely happened.Therefore, the available data is limited, while other popular attacks, such as denial of service (DDOS), are dominated by network attacks.Natural data imbalances are observed in most of the IDS datasets.
The quality of the dataset is very important in the classification process, and certain imbalance classes dominate the dataset.According to Johnson and Khoshgoftaar [11], the existing classification model has a higher capability of recognizing the majority class and tends to fail to recognize the minority classes.The imbalance of class problems has been identified as the cause of low classification performance [12].Therefore, it needs to run a pre-processing activity to make the training data into equal samples in each class [13].
The second problem is the features of the datasets.Not all features are relevant to the class label.Some features are intercorrelated, and therefore redundant information appears in the input.Redundant features make training take longer without making the model better.Researchers are working to solve this issue in various ways.Some researchers use statistical measures like intercorrelation between features like Information Gain (IG) [8], [20].Principal component analysis (PCA) and Linear discriminant analysis (LDA) were proposed by Ibrahimi and Ouaddane [21].Recursive feature elimination (RFE) research was conducted by some other previous studies [22], [23], [24].The most commonly used feature selection method is information gain, a filter-based feature selection [25], [26].Information gain starts with a basic attribute ranking, removes the background noise caused by unimportant features, and finds the feature with the most information about a certain class [20].Calculating a feature's entropy is one way to evaluate which feature is superior to others.The entropy of a system is a measure of uncertainty that can be used to get a quick idea of how the system's characteristics are spread out [27].PCA is the most popular method due to its computations' adaptability and approach's reversibility.PCA is useful for solving dimensionality reduction problems [28].The PCA is accomplished by removing higher-dimensional space less significant attributes [29].According to Gupta and Agrawal [24], using RFE in the training process can remove useless and redundant features to get higher accuracy and minimize training time.
IDS performance is improved by using modern and up-todate datasets.As a result, modern network normal and attack operations necessitate the development of new cutting-edge datasets to evaluate IDS more efficiently and accurately.Using the UNSW-NB15 dataset, this research created a network attack detection framework [30], [31].This dataset includes recent attacks.IDS benchmarked datasets previously used KDD99 and NSL-KDD.Aging datasets are less useful for understanding today's network traffic [32], [33].Despite limited work, several researchers have used the new UNSW-NB15 data set to detect attacks.
This paper aims to determine the impact of balancing techniques (SMOTE, ADASYN, and feature selection using RFE) on the classification of results on machine learningbased IDS.Researchers also evaluate decision trees (DT), random forest (RF), logistic regression (LR), and K-nearest neighbor (KNN) to classify multiclass network attacks.This experiment will be carried out with two scenarios defined in the research framework.First, the selection feature is applied after the data has been balanced, and the second starts with feature selection and then initiates data balancing.

II. MATERIALS AND METHODS
This research aims to observe the impact of balancing the data and feature selection toward the performance of the classification.Fig. 1 explains the first research framework, where the balancing dataset was carried out before the feature selection.This research also includes an experiment where the feature selection is carried out before the dataset balancing.Fig. 2 shows the second research framework where feature selection is carried out before imbalance dataset handling.

A. Pre-processing
Pre-processing aims to prepare the dataset to enter the subsequent process.It includes data standardization and normalization.We used data standardization to transform the data from a normal distribution to a standard normal distribution because the dataset contained characteristics with a wide range of possible values [8].Because of this, we had to adjust the data to follow a standard normal distribution instead of a normal distribution.c.The formula for making a standard score, also called a z-score, is as follows: Where x is the sample of data, μ represents the average, and σ represents the standard deviation [34].During the data normalization process, the value of each continuous attribute is scaled so that the results of the attributes do not overlap [35].This is done by assigning a value between 0 and 1 to the value of each continuous attribute and giving that value.This inquiry used the normalizer class that is available in Python programming.The utilization of this class paves the way for the successful normalization of a dataset.

B. Dataset UNSW-NB15
The Australian Centre put together the UNSW-NB15 dataset for Cyber Security [36].The details of the dataset are presented in Table I.It covers nine attack types, which have a total of 49 attributes.The following categories of assaults are included in this dataset: Worms, shellcode, reconnaissance, port scans, generic, backdoor, DoS, exploits, and fuzzers [37].The distribution of the training and testing dataset is shown in table II.The distribution shows that the data distribution is a high degree of imbalance.

C. Imbalance Dataset Treatment
This research observes SMOTE and ADASYN as two popular synthetic oversampling techniques.Researchers aim to identify the best synthetic data to serve the classification task of the highly degree-imbalanced UNSW-NB15 dataset.
According to Chawla et al. [38], it has been suggested that SMOTE be used as an oversampling method.The new synthetic data found in the underrepresented group was made using the oversampling method.SMOTE does not turn little amounts of data into a large number when it generates new data; rather, it creates synthetic data [10], [39], [40].Generating new data by randomly picking cases from a minority class near the feature space requires significant labor.Then produce a new data point for the minority class by utilizing a linear combination of two samples from the minority class that are comparable.The newly obtained point value is interpreted in the same way between the instances that belong to the minority and those of their respective nearest neighbors.The position of the general data point in relation to the class that constitutes the majority is ignored by SMOTE.Because of this, the class may begin to overlap or become noisy [39].
A previous study by He et al. [41] suggested ADASYN as an oversampling method for underrepresented classes.Using the method of pseudo-probabilistic oversampling, new synthetic data were constructed and evaluated [40].ADASYN calculates the weight distribution for each point in the various minority classes based on the difficulty that each minority group has in learning the material [40], [42].If the difficulty level goes up, more synthesis data will be made than if the level of difficulty goes down.

D. Recursive Feature Elimination (RFE)
Recursive feature elimination is one of the simple methods to select only important features in the input space [24].Unlike principal component analysis (PCA), RFE deletes the possible unnecessary columns.It may reduce the noise in the input, but with a risk of losing important information.However, PCA can reduce the input dimension with a transformation; therefore, they keep the information as much as possible with the risk of maintaining the noise in the input space.In this research, we use RFE as the feature selection technique.RFE selects features based on how they affect a particular model's performance.RFE works iteratively until the optimal number of features remains.

E. Classification Algorithm
Machine learning algorithms are responsible for forming a model to recognize the training data pattern and the new unknown class data.To do that, we choose four algorithms: KNN, RF, LR, and DT.Random Forest (RF) is a supervised machine learning architecture that may be used for classification and regression issues [43].Random Forest is an ensemble classifier utilized to produce more accurate classification results [42].It is simple to use, generates a decision forest using a Decision tree, and solves problems in this manner.For this purpose, it generates a random collection of trees.Throughout the procedure, many Decision trees are trained to produce the most accurate classification.The majority of the time, even without the usage of a hyperparameter, it is possible to obtain acceptable results.It is one of the most used techniques because it delivers accurate, rapid answers even for mixed, incomplete, and noisy datasets.RF has been shown to produce fewer classification errors compared to other classifiers.When building different trees in RF, the optimum nodes for splitting selection are made by randomization to maximize efficiency [44].
A decision tree (DT) is a supervised learning method used to classify numerical and class data.In the DT algorithm, the classes' labels are stored on leaf nodes, and attributes are evaluated on interior nodes of the tree.The branches show the results of the evaluations of the attributes [45].Methods of attribute selection are utilized in the process of identifying nodes.It has a goal variable that has already been predefined.In addition, it consists of leaf nodes maintained by decisionmaking processes to accomplish one of the top-down objectives of the algorithm structure [46].It processes enormous volumes of data quickly due to its straightforward architecture, which allows it to do so.There are situations in which more complicated trees are required to cope with the categorization of datasets.In these kinds of circumstances, decision trees get more complicated, and achieving any of the goals becomes more challenging.Another issue that might arise with decision tree algorithms is overfitting.In order to find a solution to this issue, certain leaf nodes of the DT will need to be removed.An entropy and information gain must be calculated on a decision tree [47].
An example of a supervised learning algorithm is the K Nearest Neighbor (KNN) Algorithm.It is unique among supervised learning algorithms in that it does not include a stage for training the model [48].For K-nearest neighbors to function, new data points must first be connected to the training set's existing data points before being given a value based on the strength of that connection.The forecast is made based on comparable features.In KNN, the Euclidean, Manhattan, or Hamming distances can be used to determine the separation between a set of test data and each record of training data [49].After that, the rows are arranged in descending order based on the value of the distance.The first K rows from the top of those rows are the ones that are chosen.Classes are allocated to the test points in accordance with the rows' classes in which they occur the most frequently.
Logistic Regression (LR) is another classification model that uses a linear algebra approach to classify data [50].The LR model is based on the likelihood of a class instance.This probability is calculated by applying a logistic function to each class data set.The logistic function is derived from linear regression, in which a linear function represents the probability of a specific data point in the class.However, s function, the logistic regression model can also be represented by the logit function [50].These are commonly known as logit functions, and their classification is known as log-linear classification [50].

F. Classification Evaluation
In order to evaluate the effectiveness of different machine learning algorithms, a confusion matrix is typically utilized.The values of False Positive (FP), True Negative (TN), False Negative (FN), and True Positive (TP) are combined in this matrix in order to provide a variety of different metrics.These metrics are created by combining the values of TN, TP, and FN, FP [51].The following is a list of some of the performance metrics that may be used to evaluate models by making use of a confusion matrix: The degree to which a model's estimated value corresponds accurately or closely to the model's real or correct value is referred to as its accuracy, and it is measured as the percentage of the total number of samples that are correctly categorized [52].Formula eq. 2 is used to calculate the accuracy of the model.

Accuracy = (2)
Precision indicates what percentage of the relevant occurrences from the chosen instances genuinely exhibit better characteristics [53].To calculate precision using the formula eq.3.
Recall, also known as TPR represents True Positive Rate, which is a calculation that determines the percentage of genuine positives that are accurately detected [54].To find recall, use the formula eq.4: The harmonic mean of precision and recall, which combines the weighted average of precision and recall, is what is meant by the F1 score.The F1-score is calculated using the following equation, eq.5:

III. RESULT AND DISCUSSION
The flow of the experimental design is shown in Fig. 1 and Fig. 2. The training and testing data is adopted from UNSW-NB15 original dataset.Before being tested with the dataset model, it is processed first through the standardized and normalized pre-processing stages.After the pre-processing was carried out, we carried out two experiment scenarios with different balancing and feature selection tasks.We evaluate the classification result in each dataset modification due to balancing and feature simplification.Then the data set is tested using the RF, DT, LR, and KNN models.

A. Balancing Prior to Feature Selection
The first scenario, handling the imbalance dataset, was carried out by utilizing SMOTE and ADASYN.The result of the balanced dataset with full features (columns) consists of 56.000 rows and 44 columns in training data.After the dataset with synthetics data is created, the recursive feature elimination (RFE) is responsible for reducing the feature and maintaining the information held by the input side of the training dataset.Fig. 3 shows the accuracy of multiclass classification.
As can be seen in Table III, the decision tree achieves the best accuracy, followed by Random Forest, K-Nearest Neighbors, and Logistic Regression.Regarding training time, the model that recorded the best performance was DT with 7 seconds, followed by RF, LR, and KNN, which had the longest training time with 521 seconds.
The classification performance on the balanced dataset with complete features is better than that imbalance dataset.However, it was paid for by the heavier computing load, as indicated by the slower training time.As can be seen in Table IV and Table V, the computation time of the balanced dataset both in SMOTE and ADASYN are much longer compared to that of the imbalance dataset in Table III.This is caused by the impact of the number of rows in the balanced dataset much higher than the original data.A higher number of balanced samples usually lead to a better model but need a longer training duration due to the iterative process.Tables VI and VII show the classification result of a selected feature on the balanced dataset.RFE has taken its role in simplifying the dataset by eliminating the potential noise in the synthetic dataset.The number of columns decreased from 44 to 13, reducing lot training duration, as can be seen in table VI compared to table VII.The accuracy, recall, precision, and F1-score are slightly decreased.On the other hand, the training duration is much faster due to the data reduction.

B. Feature Selection Prior to Balancing
The feature selection was carried out in the second scenario before the balancing task.The recursive feature elimination gets the original dataset with 13 features remaining in the list.Fig. 4 shows the accuracy of multiclass classification.Comparing the classifier's performance, it is obvious that logistic regression gets the worst capability to capture the pattern on the training dataset.Among the four compared algorithms, KNN and LR show low accuracy.Decision Tree is generally better than its competitors and recorded as the quickest to execute.Decision Tree achieves slightly better accuracy, recall, precision, and F1-score than Random Forest except in the balanced dataset with ADASYN.Tables IX and X show the balance dataset's classification after synthetics data created by SMOTE and ADASYN.Although the accuracy, precision, and recall did not touch the achieved value on the complete dataset, the gap got smaller.It shows that the impact of selecting a subset of the features slightly decreases the classifier performance but is still acceptable due to the small gap.Balancing the dataset give a positive impact on the classifier's performance.However, it leads to slower training time.It is acceptable since the training of the model has no hardware limitation.We can train the model in high-performance computing infrastructures.Once the model is trained, it can be implemented in various lower computing resource devices without sacrificing speed.[55], it is observed that SMOTE overperforms ADASYN in helping the classifier improve its accuracy in a high level of imbalance class.UNSW-NB15 dataset also has a high level of imbalance, as shown in Table III.Therefore, it is reasonable to see a better classification result on SMOTE balanced dataset.In [37], they reported comparing ADASYN and SMOTE in 5 binary class classifications.According to their result, in 5 datasets under their investigation, they found that ADASYN balance dataset leads to better classification performance.Their dataset is imbalance, but the degree of imbalance is not as high as UNSW-NB15.Our finding is in line with [55], where the SMOTE balanced dataset achieved better classification performance as the degree of imbalance improved.
In the experiment, we discovered that SMOTE provides better synthetic data for the UNSW-B15 dataset with high imbalances levels than ADASYN.DT has the maximum accuracy, at 84%, according to Tables VI and VII, whereas KNN has the lowest accuracy, at 31%, when balancing data with feature selection.Another interesting finding was that, even though KNN indicated almost the same time, it produced poor measurement findings.The best time consumption was determined to be between 6 and 8 seconds and was acquired by DT.According to tables IX and X, where feature selection trials were carried out with data balancing, DT obtained the maximum accuracy with a value of 84.6%, while LR obtained the lowest gain with 21.09%.Other findings in Tables 9 and   10 show DT achieves better measurement results than KNN, but KNN reduces training time, with a value of 8 seconds compared to a DT value of 18 to 20 seconds.
The order of pre-processing play's an important role in terms of reducing computing complexity.It was reflected in training time.First, the balancing dataset carries out before the feature selection task.It leads to a balanced dataset with much more rows and complete features.More columns highly affect to the data size, and it is directly slowing down the training.
The findings of this study have to be seen in the light of some limitations.We only conducted experiments related to imbalanced data using UNSW-NB15.However, it is quite difficult to determine whether the results depend on a single dataset.In other words, it is necessary to prove whether it can be applied to other imbalanced data on other datasets.

IV. CONCLUSION
The type of intrusion to the network (attack) is naturally imbalance.Popular attacks like DDOS dominate attack incidents and are reflected in the IDS dataset.The machine learning model cannot work well in the imbalanced training dataset, leading to failure to recognize the minority class.Imbalance dataset handling eases the imbalance problem and improve the classifier performance.We observed the performance of four classic machine learning algorithms and found that Decision Tree consistently achieved the best accuracy compared to RF, KNN, and LR.The result was obtained using UNSW-NB15 IDS dataset.SMOTE and ADASYN handle the imbalanced dataset by creating synthetic data to increase the number of minority samples.On UNSW-NB15, a high-level imbalanced dataset, SMOTE provides better synthetics data for the classification task.In this research, we also observed process order's impact and found that carrying out feature selection before creating synthetic data leads to more efficient computation without sacrificing the recognition rate.Currently, we cover one publicly available dataset.In future research, observing the impact of the imbalance handling mechanism and feature selection impact on the recognition rate on various datasets would be interesting.Experimenting with more advanced synthetics data creation algorithms such as GAN, CGAN is also our future direction.Implementing neural networks and deep learning algorithms to reduce data dimension without losing too much information, like Autoencoder, would be a future research direction.

Fig. 2
Fig. 2 Research Framework for RFE before imbalance handling.

Fig. 3
Fig. 3 Accuracy of Classification on First Experiment Scenario

Fig. 4
Fig. 4 Accuracy classification of the second scenario TableVIIIshows the simplified imbalance dataset after RFE was executed to the original training dataset.The recognition rate slightly decreased compared to the original imbalance dataset.However, it was paid for by the efficiency in computing load.As can be seen in table VIII, the computation time halved in the selected features classification compared to the imbalance with full features in table IX.