Impact of Data Balancing and Feature Selection on Machine Learning-based Network Intrusion Detection

Azhari Barkah - Universitas Amikom Purwokerto, Purwokerto Utara, Banyumas, 55127, Indonesia
Siti Selamat - Universiti Teknikal Malaysia Melaka, Melaka, Malaysia
Zaheera Abidin - Universiti Teknikal Malaysia Melaka, Melaka, Malaysia
Rizki Wahyudi - Universitas Amikom Purwokerto, Purwokerto Utara, Banyumas, 55127, Indonesia

Citation Format:



Unbalanced datasets are a common problem in supervised machine learning. It leads to a deeper understanding of the majority of classes in machine learning. Therefore, the machine learning model is more effective at recognizing the majority classes than the minority classes. Naturally, imbalanced data, such as disease data and data networking, has emerged in real life. DDOS is one of the network intrusions found to happen more often than R2L. There is an imbalance in the composition of network attacks in Intrusion Detection System (IDS) public datasets such as NSL-KDD and UNSW-NB15. Besides, researchers propose many techniques to transform it into balanced data by duplicating the minority class and producing synthetic data. Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic (ADASYN) algorithms duplicate the data and construct synthetic data for the minority classes. Meanwhile, machine learning algorithms can capture the labeled data's pattern by considering the input features. Unfortunately, not all the input features have an equal impact on the output (predicted class or value). Some features are interrelated and misleading. Therefore, the important features should be selected to produce a good model. In this research, we implement the recursive feature elimination (RFE) technique to select important features from the available dataset. According to the experiment, SMOTE provides a better synthetic dataset than ADASYN for the UNSW-B15 dataset with a high level of imbalance. RFE feature selection slightly reduces the model's accuracy but improves the training speed. Then, the Decision Tree classifier consistently achieves a better recognition rate than Random Forest and KNN.


Intrusion Detection; Feature Selection; Imbalance; SMOTE; ADASYN

Full Text:



J. H. Seo and Y. H. Kim, "Machine-learning approach to optimize smote ratio in class imbalance dataset for intrusion detection," Computational Intelligence and Neuroscience, vol. 2018. 2018. doi: 10.1155/2018/9704672.

K. Jiang, W. Wang, A. Wang, and H. Wu, "Network intrusion detection combined hybrid sampling with deep hierarchical network," IEEE Access, 2020.

J. Liu, Y. Gao, and F. Hu, "A fast network intrusion detection system using adaptive synthetic oversampling and LightGBM," Comput Secur, 2021.

R. Ahsan, W. Shi, and J. P. Corriveau, "Network intrusion detection using machine learning approaches: Addressing data imbalance," IET Cyber-Physical Systems: Theory and Applications, vol. 7, no. 1, pp. 30–39, Mar. 2022, doi: 10.1049/cps2.12013.

H. Zhang, L. Huang, C. Q. Wu, and Z. Li, "An effective convolutional neural network based on SMOTE and Gaussian mixture model for intrusion detection in imbalanced dataset," Computer Networks, vol. 177. 2020. doi: 10.1016/j.comnet.2020.107315.

X. Jiao and J. Li, "An Effective Intrusion Detection Model for Class-imbalanced Learning Based on SMOTE and Attention Mechanism," 2021 18th International Conference on Privacy, Security and Trust, PST 2021. 2021. doi: 10.1109/PST52912.2021.9647756.

S. Bagui and K. Li, "Resampling imbalanced data for network intrusion detection datasets," Journal of Big Data., 2021. doi: 10.1186/s40537-020-00390-x.

H. A. Ahmed, A. Hameed, and N. Z. Bawany, "Network intrusion detection using oversampling technique and machine learning algorithms," PeerJ Computer Science, vol. 8. 2022. doi: 10.7717/PEERJ-CS.820.

D. Gonzalez-Cuautle et al., "Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets," Applied Sciences (Switzerland), vol. 10, no. 3. 2020. doi: 10.3390/app10030794.

S. Al and M. Dener, "STL-HDL: A new hybrid network intrusion detection system for imbalanced dataset on big data environment," Comput Secur, 2021.

J. M. Johnson and T. M. Khoshgoftaar, "Survey on deep learning with class imbalance," J Big Data, vol. 6, no. 1, p. 27, Dec. 2019, doi: 10.1186/s40537-019-0192-5.

Shuo Wang and Xin Yao, "Multiclass Imbalance Problems: Analysis and Potential Solutions," IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 4, pp. 1119–1130, Aug. 2012, doi: 10.1109/TSMCB.2012.2187280.

C. Romero, J. R. Romero, and S. Ventura, "A Survey on Pre-Processing Educational Data," 2014, pp. 29–64. doi: 10.1007/978-3-319-02738-8_2.

G. Douzas, F. Bacao, and F. Last, "Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE," Inf Sci (N Y), vol. 465, pp. 1–20, Oct. 2018, doi: 10.1016/j.ins.2018.06.056.

G. Wei, W. Mu, Y. Song, and J. Dou, "An improved and random synthetic minority oversampling technique for imbalanced data," Knowl Based Syst, vol. 248, p. 108839, Jul. 2022, doi: 10.1016/j.knosys.2022.108839.

S. Feng, J. Keung, X. Yu, Y. Xiao, and M. Zhang, "Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction," Inf Softw Technol, vol. 139, p. 106662, Nov. 2021, doi: 10.1016/j.infsof.2021.106662.

I. A. Khan, D. Pi, Z. U. Khan, Y. Hussain, and A. Nawaz, "HML-IDS: A hybrid-multilevel anomaly prediction approach for intrusion detection in SCADA systems," IEEE Access, 2019.

G. Karatas, O. Demir, and O. K. Sahingoz, "Increasing the Performance of Machine Learning-Based IDSs on an Imbalanced and Up-to-Date Dataset," IEEE Access, vol. 8. pp. 32150–32162, 2020. doi: 10.1109/ACCESS.2020.2973219.

M. H. Ali, B. A. D. Al Mohammed, A. Ismail, and M. F. Zolkipli, "A New Intrusion Detection System Based on Fast Learning Network and Particle Swarm Optimization," IEEE Access, vol. 6, pp. 20255–20261, 2018, doi: 10.1109/ACCESS.2018.2820092.

Kurniabudi, D. Stiawan, Darmawijoyo, M. Y. Bin Idris, A. M. Bamhdi, and R. Budiarto, “CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection,†IEEE Access, vol. 8, pp. 132911–132921, 2020, doi: 10.1109/ACCESS.2020.3009843.

K. Ibrahimi and M. Ouaddane, "Management of intrusion detection systems based-KDD99: Analysis with LDA and PCA," in Proceedings - 2017 International Conference on Wireless Networks and Mobile Communications, WINCOM 2017, 2017. doi: 10.1109/WINCOM.2017.8238171.

N. V Sharma and N. S. Yadav, "An optimal intrusion detection system using recursive feature elimination and ensemble of classifiers," Microprocess Microsyst, vol. 85, p. 104293, Sep. 2021, doi: 10.1016/j.micpro.2021.104293.

S. Ustebay, Z. Turgut, and M. A. Aydin, "Intrusion detection system with recursive feature elimination by using random forest and deep learning classifier,"… congress on big data, deep …, 2018.

A. R. B. Gupta and J. Agrawal, "Machine Learning-Based Intrusion Detection System with Recursive Feature Elimination," 2021, pp. 157–172. doi: 10.1007/978-981-33-4305-4_13.

T. A. Alhaj, M. M. Siraj, A. Zainal, H. T. Elshoush, and F. Elhaj, "Feature Selection Using Information Gain for Improved Structural-Based Alert Correlation," PLoS One, vol. 11, no. 11, p. e0166017, Nov. 2016, doi: 10.1371/journal.pone.0166017.

Z. Karimi, M. Mansour Riahi Kashani, and A. Harounabadi, "Feature Ranking in Intrusion Detection Dataset using Combination of Filtering Methods," Int J Comput Appl, vol. 78, no. 4, pp. 21–27, Sep. 2013, doi: 10.5120/13478-1164.

P. Bereziński, B. Jasiul, and M. Szpyrka, “An Entropy-Based Network Anomaly Detection Method,†Entropy, vol. 17, no. 4, pp. 2367–2408, Apr. 2015, doi: 10.3390/e17042367.

K. Keerthi Vasan and B. Surendiran, "Dimensionality reduction using Principal Component Analysis for network intrusion detection," Perspect Sci (Neth), vol. 8, pp. 510–512, Sep. 2016, doi: 10.1016/j.pisc.2016.05.010.

P. Nskh, M. N. Varma, and R. R. Naik, "Principle component analysis based intrusion detection system using support vector machine," in 2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), May 2016, pp. 1344–1350. doi: 10.1109/RTEICT.2016.7808050.

B. A. Tama and K.-H. Rhee, "An in-depth experimental study of anomaly detection using gradient boosted machine," Neural Comput Appl, vol. 31, no. 4, pp. 955–965, Apr. 2019, doi: 10.1007/s00521-017-3128-z.

N. Belhadj aissa, M. Guerroumi, and A. Derhab, "NSNAD: negative selection-based network anomaly detection approach with relevant feature subset," Neural Comput Appl, vol. 32, no. 8, pp. 3475–3501, Apr. 2020, doi: 10.1007/s00521-019-04396-2.

H. N. Viet, Q. N. Van, L. L. T. Trang, and S. Nathan, "Using Deep Learning Model for Network Scanning Detection," in Proceedings of the 4th International Conference on Frontiers of Educational Technologies - ICFET '18, 2018, pp. 117–121. doi: 10.1145/3233347.3233379.

V. Kumar, D. Sinha, A. K. Das, S. C. Pandey, and R. T. Goswami, "An integrated rule based intrusion detection system: analysis on UNSW-NB15 data set and the real time online dataset," Cluster Comput, vol. 23, no. 2, pp. 1397–1418, Jun. 2020, doi: 10.1007/s10586-019-03008-x.

Y. Xiao, C. Xing, T. Zhang, and Z. Zhao, "An Intrusion Detection Model Based on Feature Reduction and Convolutional Neural Networks," IEEE Access, vol. 7. pp. 42210–42219, 2019. doi: 10.1109/ACCESS.2019.2904620.

D. Gupta, S. Singhal, S. Malik, and A. Singh, "Network intrusion detection system using various data mining techniques," in 2016 International Conference on Research Advances in Integrated Navigation Systems (RAINS), May 2016, pp. 1–6. doi: 10.1109/RAINS.2016.7764418.

N. Moustafa and J. Slay, "UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)," in 2015 Military Communications and Information Systems Conference, MilCIS 2015 - Proceedings, Nov. 2015, pp. 1–6. doi: 10.1109/MilCIS.2015.7348942.

N. Moustafa and J. Slay, "The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set," Information Security Journal: A Global Perspective, vol. 25, no. 1–3, pp. 18–31, Apr. 2016, doi: 10.1080/19393555.2015.1125974.

N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," J Artif Intell Res, vol. 16, 2002, doi: 10.1613/jair.953.

J. H. Lee and K. H. Park, "GAN-based imbalanced data intrusion detection system," Pers Ubiquitous Comput, vol. 25, no. 1, pp. 121–128, Feb. 2021, doi: 10.1007/s00779-019-01332-y.

S. Bagui and K. Li, "Resampling imbalanced data for network intrusion detection datasets," J Big Data, vol. 8, no. 1, p. 6, Dec. 2021, doi: 10.1186/s40537-020-00390-x.

Haibo He, Yang Bai, E. A. Garcia, and Shutao Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Jun. 2008, pp. 1322–1328. doi: 10.1109/IJCNN.2008.4633969.

J. Liu, Y. Gao, and F. Hu, "A fast network intrusion detection system using adaptive synthetic oversampling and LightGBM," Comput Secur, vol. 106, p. 102289, Jul. 2021, doi: 10.1016/j.cose.2021.102289.

M. Belgiu and L. Drăguţ, "Random forest in remote sensing: A review of applications and future directions," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 114, pp. 24–31, Apr. 2016, doi: 10.1016/j.isprsjprs.2016.01.011.

J. Jiang, Q. Wang, Z. Shi, B. Lv, and B. Qi, "RST-RF: A Hybrid Model based on Rough Set Theory and Random Forest for Network Intrusion Detection," in Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, Mar. 2018, pp. 77–81. doi: 10.1145/3199478.3199489.

S. Afraei, K. Shahriar, and S. H. Madani, "Developing intelligent classification models for rock burst prediction after recognizing significant predictor variables, Section 2: Designing classifiers," Tunnelling and Underground Space Technology, vol. 84, pp. 522–537, Feb. 2019, doi: 10.1016/j.tust.2018.11.011.

G. H. Nicholas Frosst, "Distilling a neural network into a soft decision tree," 2017, doi:

G. Karatas, O. Demir, and O. K. Sahingoz, "Increasing the Performance of Machine Learning-Based IDSs on an Imbalanced and Up-to-Date Dataset," IEEE Access, vol. 8. pp. 32150–32162, 2020. doi: 10.1109/ACCESS.2020.2973219.

A. Vijay, K. Patidar, M. Yadav, and R. Kushwah, "An efficient intrusion detection mechanism based on particle swarm optimization and KNN," ACCENTS Transactions on Information Security, vol. 5, no. 20, pp. 36–41, Oct. 2020, doi: 10.19101/TIS.2020.517003.

S. Jain, S. C. Jain, and S. Vishwakarma, "Analysis and Prediction of Customers' Reviews with Amazon Dataset on Products," 2020, pp. 445–456. doi: 10.1007/978-981-15-0936-0_48.

A. Sharma and P. K. Mishra, "Performance analysis of machine learning based optimized feature selection approaches for breast cancer diagnosis," International Journal of Information Technology, vol. 14, no. 4, pp. 1949–1960, Jun. 2022, doi: 10.1007/s41870-021-00671-5.

A. Tripathy, A. Agrawal, and S. K. Rath, "Classification of sentiment reviews using n-gram machine learning approach," Expert Syst Appl, vol. 57, pp. 117–126, Sep. 2016, doi: 10.1016/j.eswa.2016.03.028.

P. Lin, K. Ye, and C.-Z. Xu, "Dynamic Network Anomaly Detection System by Using Deep Learning Techniques," 2019, pp. 161–176. doi: 10.1007/978-3-030-23502-4_12.

B. Roy and H. Cheung, "A Deep Learning Approach for Intrusion Detection in Internet of Things using Bi-Directional Long Short-Term Memory Recurrent Neural Network," in 2018 28th International Telecommunication Networks and Applications Conference (ITNAC), Nov. 2018, pp. 1–6. doi: 10.1109/ATNAC.2018.8615294.

S. A. Ludwig, "Intrusion detection of multiple attack classes using a deep neural net ensemble," in 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Nov. 2017, pp. 1–7. doi: 10.1109/SSCI.2017.8280825.

J. Brandt and E. Lanzén, "A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification," 2020.