Classification of Student Graduation using Naïve Bayes by Comparing between Random Oversampling and Feature Selections of Information Gain and Forward Selection

Dony Fahrudy - Universitas Islam Negeri Sunan Kalijaga
Shofwatul 'Uyun - Universitas Islam Negeri Sunan Kalijaga


Citation Format:



DOI: http://dx.doi.org/10.30630/joiv.6.4.982

Abstract


Class-imbalanced data with high attribute dimensions in datasets frequently contribute to issues in a classification process as this can affect algorithms’ performance in the computing process because there are imbalanced numbers of data in each class and irrelevant attributes that must be processed; therefore, this needs for some techniques to overcome the class-imbalanced data and feature selection to reduce data complexity and irrelevant features. Therefore, this study applied random oversampling (ROs) method to overcome the class-imbalanced data and two feature selections (information gain and forward selection) compared to determine which feature selection is superior, more effective and more appropriate to apply. The results of feature selection then were used to classify the student graduation by creating a classification model of Naïve Bayes algorithm. This study indicated an increase in the average accuracy of the Naïve Bayes method without the ROs preprocessing and the feature selection (81.83%), with the ROs (83.84%), with information gain with 3 selected features (86.03%) and forward selection with 2 selected features (86.42%); consequently, these led to increasing accuracy of 4.2% from no pre-processing to information gain and 4.59% from no pre-processing to forward selection. Therefore, the best feature selection was the forward selection with 2 selected features (GPA of the 8th semester and the overall GPA), and the ROs and both feature selections were proven to improve the performance of the Naïve Bayes method.


Keywords


Forward Selection; Information Gain; Student Graduation; Naïve Bayes; ROs.

Full Text:

PDF

References


ACCJC/WASC, “Guide To Evaluating Institutions,†p. 56, 2012, [Online]. Available: www.g-fras.org.

Nuffic, “Education system Indonesia,†2017, [Online]. Available: www.nuffic.nl/en/home/copyright.

J. S. Bassi, E. G. Dada, A. A. Hamidu, and M. D. Elijah, “Students Graduation on Time Prediction Model Using Artificial Neural Network,†IOSR J. Comput. Eng., vol. 21, no. 3, pp. 28–35, 2019, doi: 10.9790/0661-2103012835.

C. Lei and K. F. Li, “Academic Performance Predictors,†Proc. - IEEE 29th Int. Conf. Adv. Inf. Netw. Appl. Work. WAINA 2015, pp. 577–581, 2015, doi: 10.1109/WAINA.2015.114.

Q. Wang, Z. Luo, J. Huang, Y. Feng, and Z. Liu, “A Novel Ensemble Method for Imbalanced Data Learning,†Comput. Intell. Neurosci., vol. 2017, pp. 1–11, 2017.

S. Naganjaneyulu and M. R. Kuppa, “A novel framework for class imbalance learning using intelligent under-sampling,†Prog. Artif. Intell., vol. 2, no. 1, pp. 73–84, 2013, doi: 10.1007/s13748-012-0038-2.

S. Vanaja and K. Ramesh Kumar, “Analysis of Feature Selection Algorithms on Classification: A Survey,†Int. J. Comput. Appl., vol. 96, no. 17, pp. 29–35, 2014, doi: 10.5120/16888-6910.

C.Deisy, B.Subbulakshmi, D. S.Baskar, and Dr.N.Ramaraj, “Efficient Dimensionality Reduction Approaches for Feature Selection,†Proc. - Int. Conf. Comput. Intell. Multimed. Appl. ICCIMA 2007, vol. 4, pp. 270–272, 2007, doi: 10.1109/ICCIMA.2007.288.

S. L. Ting, W. H. Ip, and A. H. C. Tsang, “Is Naïve bayes a good classifier for document classification?,†Int. J. Softw. Eng. its Appl., vol. 5, no. 3, pp. 37–46, 2011.

G. Chandrashekar and F. Sahin, “A survey on feature selection methods,†Comput. Electr. Eng., vol. 40, no. 1, pp. 16–28, 2014, doi: 10.1016/j.compeleceng.2013.11.024.

S. Visalakshi and V. Radha, “A literature review of feature selection techniques and applications: Review of feature selection in data mining,†2014 IEEE Int. Conf. Comput. Intell. Comput. Res. IEEE ICCIC 2014, no. 1997, 2015, doi: 10.1109/ICCIC.2014.7238499.

M. W. Mwadulo, “A Review on Feature Selection Methods For Classification Tasks,†Int. J. Comput. Appl. Technol. Res., vol. 5, no. 6, pp. 395–402, 2016, doi: 10.1109/ICISC.2017.8068746.

J. C. Ang, A. Mirzal, H. Haron, and H. N. A. Hamed, “Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection,†IEEE/ACM Trans. Comput. Biol. Bioinforma., vol. 13, no. 5, pp. 971–989, 2016, doi: 10.1109/TCBB.2015.2478454.

D. A. A. Gnana, Singh, S. A. Balamurugan, and E. J. Leavline, “Literature Review on Feature Selection Methods for High-Dimensional Data,†Int. J. Comput. Appl., vol. 136, no. 1, pp. 9–17, 2016.

D. Jain and V. Singh, “Feature selection and classification systems for chronic disease prediction: A review,†Egypt. Informatics J., vol. 19, no. 3, pp. 179–189, 2018, doi: 10.1016/j.eij.2018.03.002.

U. M. Khaire and R. Dhanalakshmi, “Stability of feature selection algorithm: A review,†J. King Saud Univ. - Comput. Inf. Sci., no. xxxx, 2019, doi: 10.1016/j.jksuci.2019.06.012.

A. Khemphila and V. Boonjing, “Heart disease classification using neural network and feature selection,†Proc. - ICSEng 2011 Int. Conf. Syst. Eng., no. 2007, pp. 406–409, 2011, doi: 10.1109/ICSEng.2011.80.

Y. B. Wah, N. Ibrahim, H. A. Hamid, S. Abdul-Rahman, and S. Fong, “Feature selection methods: Case of filter and wrapper approaches for maximising classification accuracy,†Pertanika J. Sci. Technol., vol. 26, no. 1, pp. 329–340, 2018.

R. Panthong and A. Srivihok, “Wrapper Feature Subset Selection for Dimension Reduction Based on Ensemble Learning Algorithm,†Procedia Comput. Sci., vol. 72, pp. 162–169, 2015, doi: 10.1016/j.procs.2015.12.117.

P. Kalyani and D. M. Karnan, “Attribute Reduction using Forward Selection and Relative Reduct Algorithm,†Int. J. Comput. Appl., vol. 11, no. 3, pp. 8–12, 2010, doi: 10.5120/1564-1499.

A. H. Seh, “A Review on Heart Disease Prediction Using Machine Learning Techniques A Review on Heart Disease Prediction Using Machine Learning Techniques,†vol. 9, no. April, pp. 208–224, 2019.

A. Ashari, I. Paryudi, and A. M. Tjoa, “Performance Comparison between Naïve Bayes, Decision Tree and k-Nearest Neighbor in Searching Alternative Design in an Energy Simulation Tool,†Int. J. Adv. Comput. Sci. Appl., vol. 4, no. 11, pp. 33–39, 2013.

K. Artaye, “International Conference On Information Technology And Business ISSN 2460-7223 IMPLEMENTATION OF NAÃVE BAYES CLASSIFICATION METHOD TO PREDICT GRADUATION TIME OF IBI DARMAJAYA SCHOLAR Z . A . Pagar Alam Street No . 93 Bandar Lampung,†no. August, pp. 284–290, 2015.

A. Ali, A. Khairan, F. Tempola, and A. Fuad, “Application Of Naïve Bayes to Predict the Potential of Rain in Ternate City,†E3S Web Conf., vol. 328, p. 04011, 2021, doi: 10.1051/e3sconf/202132804011.

S. Uyun and L. Choridah, “Feature selection mammogram based on breast cancer mining,†Int. J. Electr. Comput. Eng., vol. 8, no. 1, pp. 60–69, 2018, doi: 10.11591/ijece.v8i1.pp60-69.

S. Uyun and E. Sulistyowati, “Feature selection for multiple water quality status: Integrated bootstrapping and SMOTE approach in imbalance classes,†Int. J. Electr. Comput. Eng., vol. 10, no. 4, pp. 4331–4339, 2020, doi: 10.11591/ijece.v10i4.pp4331-4339.

S. Sugriyono and M. U. Siregar, “Preprocessing kNN algorithm classification using K-means and distance matrix with students’ academic performance dataset,†J. Teknol. dan Sist. Komput., vol. 8, no. 4, pp. 311–316, 2020, doi: 10.14710/jtsiskom.2020.13874.

C. Mulyadi and . Sukron, “Prediction of Timeliness of Graduating with Naïve Bayes Algorithm,†no. Icri 2018, pp. 3043–3050, 2020, doi: 10.5220/0009946430433050.

H. Z. Hashemi, P. Parvasideh, Z. H. Larijani, and F. Moradi, “Analyze students performance of a national exam using feature selection methods,†2018 8th Int. Conf. Comput. Knowl. Eng. ICCKE 2018, no. Iccke, pp. 7–11, 2018, doi: 10.1109/ICCKE.2018.8566671.

A. Saifudin, Ekawati, Yulianti, and T. Desyani, “Forward Selection Technique to Choose the Best Features in Prediction of Student Academic Performance Based on Naïve Bayes,†J. Phys. Conf. Ser., vol. 1477, no. 2, 2020, doi: 10.1088/1742-6596/1477/3/032007.

G. Shivali, Joni Birla, “Knowledge Discovery in Data-Mining,†Int. J. Eng. Res. Technol., vol. 3, no. 10, pp. 1–5, 2015, [Online]. Available: https://www.ijert.org/research/knowledge-discovery-in-data-mining-IJERTCONV3IS10051.pdf.

J. W. Grzymala-Busse and W. J. Grzymala-Busse, “Handling Missing Attribute Values,†Data Min. Knowl. Discov. Handb., pp. 33–51, 2010, doi: 10.1007/978-0-387-09823-4_3.

O. Heranova, “Synthetic Minority Oversampling Technique pada Averaged One Dependence Estimators untuk Klasifikasi Credit Scoring,†vol. 1, no. 10, pp. 10–12, 2021.

S. Tangirala, “Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm,†Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 2, pp. 612–619, 2020, doi: 10.14569/ijacsa.2020.0110277.

M. I. Prasetiyowati, N. U. Maulidevi, and K. Surendro, “Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest,†J. Big Data, vol. 8, no. 1, 2021, doi: 10.1186/s40537-021-00472-4.

Y. Wang, Y. Li, Y. Song, X. Rong, and S. Zhang, “Improvement of ID3 algorithm based on simplified information entropy and coordination degree,†Algorithms, vol. 10, no. 4, pp. 1–18, 2017, doi: 10.3390/a10040124.

P. Bermejo, J. A. Gámez, and J. M. Puerta, “Speeding up incremental wrapper feature subset selection with Naive Bayes classifier,†Knowledge-Based Syst., vol. 55, pp. 140–147, 2014, doi: 10.1016/j.knosys.2013.10.016.

J. W. van Lith and J. Vanschoren, “From Strings to Data Science: a Practical Framework for Automated String Handling,†pp. 1–19, 2021, [Online]. Available: https://arxiv.org/abs/2111.01868v1.

M. Albarak, M. Alrazgan, and R. Bahsoon, “Identifying and Managing Technical Debt in Database Normalization Using Machine Learning and Trade-off Analysis,†2017, [Online]. Available: http://arxiv.org/abs/1711.06109.

H. Henderi, “Comparison of Min-Max normalization and Z-Score Normalization in the K-nearest neighbor (kNN) Algorithm to Test the Accuracy of Types of Breast Cancer,†IJIIS Int. J. Informatics Inf. Syst., vol. 4, no. 1, pp. 13–20, 2021, doi: 10.47738/ijiis.v4i1.73.

D. Berrar, “Cross-validation,†Encycl. Bioinforma. Comput. Biol. ABC Bioinforma., vol. 1–3, no. January 2018, pp. 542–545, 2018, doi: 10.1016/B978-0-12-809633-8.20349-X.

D. Normawati and D. P. Ismi, “K-Fold Cross Validation for Selection of Cardiovascular Disease Diagnosis Features by Applying Rule-Based Datamining,†Signal Image Process. Lett., vol. 1, no. 2, pp. 23–35, 2019, doi: 10.31763/simple.v1i2.3.

F. Harahap, A. Y. N. Harahap, E. Ekadiansyah, R. N. Sari, R. Adawiyah, and C. B. Harahap, “Implementation of Naïve Bayes Classification Method for Predicting Purchase,†2018 6th Int. Conf. Cyber IT Serv. Manag. CITSM 2018, no. Citsm, pp. 1–5, 2019, doi: 10.1109/CITSM.2018.8674324.

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. 2011.

T. R. Patil and M. S. S. Sherekar, “Performance Analysis of ANN and Naive Bayes Classification Algorithm for Data Classification,†Int. J. Intell. Syst. Appl. Eng., vol. 7, no. 2, pp. 88–91, 2019, doi: 10.18201/ijisae.2019252786.

I. Markoulidakis, I. Rallis, I. Georgoulas, G. Kopsiaftis, A. Doulamis, and N. Doulamis, “Multiclass Confusion Matrix Reduction Method and Its Application on Net Promoter Score Classification Problem,†Technologies, vol. 9, no. 4, p. 81, 2021, doi: 10.3390/technologies9040081.