Classification of Student Graduation using NaÃ¯ve Bayes by Comparing between Random Oversampling and Feature Selections of Information Gain and Forward Selection

Dony Fahrudy; Shofwatul 'Uyun

doi:10.30630/joiv.6.4.982

Classification of Student Graduation using NaÃ¯ve Bayes by Comparing between Random Oversampling and Feature Selections of Information Gain and Forward Selection

Dony Fahrudy - Universitas Islam Negeri Sunan Kalijaga
Shofwatul 'Uyun - Universitas Islam Negeri Sunan Kalijaga

Citation Format:

DOI: http://dx.doi.org/10.30630/joiv.6.4.982

Abstract

Class-imbalanced data with high attribute dimensions in datasets frequently contribute to issues in a classification process as this can affect algorithmsâ€™ performance in the computing process because there are imbalanced numbers of data in each class and irrelevant attributes that must be processed; therefore, this needs for some techniques to overcome the class-imbalanced data and feature selection to reduce data complexity and irrelevant features. Therefore, this study applied random oversampling (ROs) method to overcome the class-imbalanced data and two feature selections (information gain and forward selection) compared to determine which feature selection is superior, more effective and more appropriate to apply. The results of feature selection then were used to classify the student graduation by creating a classification model of NaÃ¯ve Bayes algorithm. This study indicated an increase in the average accuracy of the NaÃ¯ve Bayes method without the ROs preprocessing and the feature selection (81.83%), with the ROs (83.84%), with information gain with 3 selected features (86.03%) and forward selection with 2 selected features (86.42%); consequently, these led to increasing accuracy of 4.2% from no pre-processing to information gain and 4.59% from no pre-processing to forward selection. Therefore, the best feature selection was the forward selection with 2 selected features (GPA of the 8^th semester and the overall GPA), and the ROs and both feature selections were proven to improve the performance of the NaÃ¯ve Bayes method.

Keywords

Forward Selection; Information Gain; Student Graduation; NaÃ¯ve Bayes; ROs.

Full Text:

PDF

References

ACCJC/WASC, â€œGuide To Evaluating Institutions,â€ p. 56, 2012, [Online]. Available: www.g-fras.org.

Nuffic, â€œEducation system Indonesia,â€ 2017, [Online]. Available: www.nuffic.nl/en/home/copyright.

J. S. Bassi, E. G. Dada, A. A. Hamidu, and M. D. Elijah, â€œStudents Graduation on Time Prediction Model Using Artificial Neural Network,â€ IOSR J. Comput. Eng., vol. 21, no. 3, pp. 28â€“35, 2019, doi: 10.9790/0661-2103012835.

C. Lei and K. F. Li, â€œAcademic Performance Predictors,â€ Proc. - IEEE 29th Int. Conf. Adv. Inf. Netw. Appl. Work. WAINA 2015, pp. 577â€“581, 2015, doi: 10.1109/WAINA.2015.114.

Q. Wang, Z. Luo, J. Huang, Y. Feng, and Z. Liu, â€œA Novel Ensemble Method for Imbalanced Data Learning,â€ Comput. Intell. Neurosci., vol. 2017, pp. 1â€“11, 2017.

S. Naganjaneyulu and M. R. Kuppa, â€œA novel framework for class imbalance learning using intelligent under-sampling,â€ Prog. Artif. Intell., vol. 2, no. 1, pp. 73â€“84, 2013, doi: 10.1007/s13748-012-0038-2.

S. Vanaja and K. Ramesh Kumar, â€œAnalysis of Feature Selection Algorithms on Classification: A Survey,â€ Int. J. Comput. Appl., vol. 96, no. 17, pp. 29â€“35, 2014, doi: 10.5120/16888-6910.

C.Deisy, B.Subbulakshmi, D. S.Baskar, and Dr.N.Ramaraj, â€œEfficient Dimensionality Reduction Approaches for Feature Selection,â€ Proc. - Int. Conf. Comput. Intell. Multimed. Appl. ICCIMA 2007, vol. 4, pp. 270â€“272, 2007, doi: 10.1109/ICCIMA.2007.288.

S. L. Ting, W. H. Ip, and A. H. C. Tsang, â€œIs NaÃ¯ve bayes a good classifier for document classification?,â€ Int. J. Softw. Eng. its Appl., vol. 5, no. 3, pp. 37â€“46, 2011.

G. Chandrashekar and F. Sahin, â€œA survey on feature selection methods,â€ Comput. Electr. Eng., vol. 40, no. 1, pp. 16â€“28, 2014, doi: 10.1016/j.compeleceng.2013.11.024.

S. Visalakshi and V. Radha, â€œA literature review of feature selection techniques and applications: Review of feature selection in data mining,â€ 2014 IEEE Int. Conf. Comput. Intell. Comput. Res. IEEE ICCIC 2014, no. 1997, 2015, doi: 10.1109/ICCIC.2014.7238499.

M. W. Mwadulo, â€œA Review on Feature Selection Methods For Classification Tasks,â€ Int. J. Comput. Appl. Technol. Res., vol. 5, no. 6, pp. 395â€“402, 2016, doi: 10.1109/ICISC.2017.8068746.

J. C. Ang, A. Mirzal, H. Haron, and H. N. A. Hamed, â€œSupervised, unsupervised, and semi-supervised feature selection: A review on gene selection,â€ IEEE/ACM Trans. Comput. Biol. Bioinforma., vol. 13, no. 5, pp. 971â€“989, 2016, doi: 10.1109/TCBB.2015.2478454.

D. A. A. Gnana, Singh, S. A. Balamurugan, and E. J. Leavline, â€œLiterature Review on Feature Selection Methods for High-Dimensional Data,â€ Int. J. Comput. Appl., vol. 136, no. 1, pp. 9â€“17, 2016.

D. Jain and V. Singh, â€œFeature selection and classification systems for chronic disease prediction: A review,â€ Egypt. Informatics J., vol. 19, no. 3, pp. 179â€“189, 2018, doi: 10.1016/j.eij.2018.03.002.

U. M. Khaire and R. Dhanalakshmi, â€œStability of feature selection algorithm: A review,â€ J. King Saud Univ. - Comput. Inf. Sci., no. xxxx, 2019, doi: 10.1016/j.jksuci.2019.06.012.

A. Khemphila and V. Boonjing, â€œHeart disease classification using neural network and feature selection,â€ Proc. - ICSEng 2011 Int. Conf. Syst. Eng., no. 2007, pp. 406â€“409, 2011, doi: 10.1109/ICSEng.2011.80.

Y. B. Wah, N. Ibrahim, H. A. Hamid, S. Abdul-Rahman, and S. Fong, â€œFeature selection methods: Case of filter and wrapper approaches for maximising classification accuracy,â€ Pertanika J. Sci. Technol., vol. 26, no. 1, pp. 329â€“340, 2018.

R. Panthong and A. Srivihok, â€œWrapper Feature Subset Selection for Dimension Reduction Based on Ensemble Learning Algorithm,â€ Procedia Comput. Sci., vol. 72, pp. 162â€“169, 2015, doi: 10.1016/j.procs.2015.12.117.

P. Kalyani and D. M. Karnan, â€œAttribute Reduction using Forward Selection and Relative Reduct Algorithm,â€ Int. J. Comput. Appl., vol. 11, no. 3, pp. 8â€“12, 2010, doi: 10.5120/1564-1499.

A. H. Seh, â€œA Review on Heart Disease Prediction Using Machine Learning Techniques A Review on Heart Disease Prediction Using Machine Learning Techniques,â€ vol. 9, no. April, pp. 208â€“224, 2019.

A. Ashari, I. Paryudi, and A. M. Tjoa, â€œPerformance Comparison between NaÃ¯ve Bayes, Decision Tree and k-Nearest Neighbor in Searching Alternative Design in an Energy Simulation Tool,â€ Int. J. Adv. Comput. Sci. Appl., vol. 4, no. 11, pp. 33â€“39, 2013.

K. Artaye, â€œInternational Conference On Information Technology And Business ISSN 2460-7223 IMPLEMENTATION OF NAÃVE BAYES CLASSIFICATION METHOD TO PREDICT GRADUATION TIME OF IBI DARMAJAYA SCHOLAR Z . A . Pagar Alam Street No . 93 Bandar Lampung,â€ no. August, pp. 284â€“290, 2015.

A. Ali, A. Khairan, F. Tempola, and A. Fuad, â€œApplication Of NaÃ¯ve Bayes to Predict the Potential of Rain in Ternate City,â€ E3S Web Conf., vol. 328, p. 04011, 2021, doi: 10.1051/e3sconf/202132804011.

S. Uyun and L. Choridah, â€œFeature selection mammogram based on breast cancer mining,â€ Int. J. Electr. Comput. Eng., vol. 8, no. 1, pp. 60â€“69, 2018, doi: 10.11591/ijece.v8i1.pp60-69.

S. Uyun and E. Sulistyowati, â€œFeature selection for multiple water quality status: Integrated bootstrapping and SMOTE approach in imbalance classes,â€ Int. J. Electr. Comput. Eng., vol. 10, no. 4, pp. 4331â€“4339, 2020, doi: 10.11591/ijece.v10i4.pp4331-4339.

S. Sugriyono and M. U. Siregar, â€œPreprocessing kNN algorithm classification using K-means and distance matrix with studentsâ€™ academic performance dataset,â€ J. Teknol. dan Sist. Komput., vol. 8, no. 4, pp. 311â€“316, 2020, doi: 10.14710/jtsiskom.2020.13874.

C. Mulyadi and . Sukron, â€œPrediction of Timeliness of Graduating with NaÃ¯ve Bayes Algorithm,â€ no. Icri 2018, pp. 3043â€“3050, 2020, doi: 10.5220/0009946430433050.

H. Z. Hashemi, P. Parvasideh, Z. H. Larijani, and F. Moradi, â€œAnalyze students performance of a national exam using feature selection methods,â€ 2018 8th Int. Conf. Comput. Knowl. Eng. ICCKE 2018, no. Iccke, pp. 7â€“11, 2018, doi: 10.1109/ICCKE.2018.8566671.

A. Saifudin, Ekawati, Yulianti, and T. Desyani, â€œForward Selection Technique to Choose the Best Features in Prediction of Student Academic Performance Based on NaÃ¯ve Bayes,â€ J. Phys. Conf. Ser., vol. 1477, no. 2, 2020, doi: 10.1088/1742-6596/1477/3/032007.

G. Shivali, Joni Birla, â€œKnowledge Discovery in Data-Mining,â€ Int. J. Eng. Res. Technol., vol. 3, no. 10, pp. 1â€“5, 2015, [Online]. Available: https://www.ijert.org/research/knowledge-discovery-in-data-mining-IJERTCONV3IS10051.pdf.

J. W. Grzymala-Busse and W. J. Grzymala-Busse, â€œHandling Missing Attribute Values,â€ Data Min. Knowl. Discov. Handb., pp. 33â€“51, 2010, doi: 10.1007/978-0-387-09823-4_3.

O. Heranova, â€œSynthetic Minority Oversampling Technique pada Averaged One Dependence Estimators untuk Klasifikasi Credit Scoring,â€ vol. 1, no. 10, pp. 10â€“12, 2021.

S. Tangirala, â€œEvaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm,â€ Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 2, pp. 612â€“619, 2020, doi: 10.14569/ijacsa.2020.0110277.

M. I. Prasetiyowati, N. U. Maulidevi, and K. Surendro, â€œDetermining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest,â€ J. Big Data, vol. 8, no. 1, 2021, doi: 10.1186/s40537-021-00472-4.

Y. Wang, Y. Li, Y. Song, X. Rong, and S. Zhang, â€œImprovement of ID3 algorithm based on simplified information entropy and coordination degree,â€ Algorithms, vol. 10, no. 4, pp. 1â€“18, 2017, doi: 10.3390/a10040124.

P. Bermejo, J. A. GÃ¡mez, and J. M. Puerta, â€œSpeeding up incremental wrapper feature subset selection with Naive Bayes classifier,â€ Knowledge-Based Syst., vol. 55, pp. 140â€“147, 2014, doi: 10.1016/j.knosys.2013.10.016.

J. W. van Lith and J. Vanschoren, â€œFrom Strings to Data Science: a Practical Framework for Automated String Handling,â€ pp. 1â€“19, 2021, [Online]. Available: https://arxiv.org/abs/2111.01868v1.

M. Albarak, M. Alrazgan, and R. Bahsoon, â€œIdentifying and Managing Technical Debt in Database Normalization Using Machine Learning and Trade-off Analysis,â€ 2017, [Online]. Available: http://arxiv.org/abs/1711.06109.

H. Henderi, â€œComparison of Min-Max normalization and Z-Score Normalization in the K-nearest neighbor (kNN) Algorithm to Test the Accuracy of Types of Breast Cancer,â€ IJIIS Int. J. Informatics Inf. Syst., vol. 4, no. 1, pp. 13â€“20, 2021, doi: 10.47738/ijiis.v4i1.73.

D. Berrar, â€œCross-validation,â€ Encycl. Bioinforma. Comput. Biol. ABC Bioinforma., vol. 1â€“3, no. January 2018, pp. 542â€“545, 2018, doi: 10.1016/B978-0-12-809633-8.20349-X.

D. Normawati and D. P. Ismi, â€œK-Fold Cross Validation for Selection of Cardiovascular Disease Diagnosis Features by Applying Rule-Based Datamining,â€ Signal Image Process. Lett., vol. 1, no. 2, pp. 23â€“35, 2019, doi: 10.31763/simple.v1i2.3.

F. Harahap, A. Y. N. Harahap, E. Ekadiansyah, R. N. Sari, R. Adawiyah, and C. B. Harahap, â€œImplementation of NaÃ¯ve Bayes Classification Method for Predicting Purchase,â€ 2018 6th Int. Conf. Cyber IT Serv. Manag. CITSM 2018, no. Citsm, pp. 1â€“5, 2019, doi: 10.1109/CITSM.2018.8674324.

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. 2011.

T. R. Patil and M. S. S. Sherekar, â€œPerformance Analysis of ANN and Naive Bayes Classification Algorithm for Data Classification,â€ Int. J. Intell. Syst. Appl. Eng., vol. 7, no. 2, pp. 88â€“91, 2019, doi: 10.18201/ijisae.2019252786.

I. Markoulidakis, I. Rallis, I. Georgoulas, G. Kopsiaftis, A. Doulamis, and N. Doulamis, â€œMulticlass Confusion Matrix Reduction Method and Its Application on Net Promoter Score Classification Problem,â€ Technologies, vol. 9, no. 4, p. 81, 2021, doi: 10.3390/technologies9040081.

Username
Password
Remember me