The Effects of Imbalanced Datasets on Machine Learning Algorithms in Predicting Student Performance

Khaled Mahmud Sujon - Software Engineering Research Group, Faculty of Computing, Universiti Teknologi Malaysia (UTM), Johor, Malaysia
Rohayanti Hassan - Software Engineering Research Group, Faculty of Computing, Universiti Teknologi Malaysia (UTM), Johor, Malaysia
Alif Ridzuan Khairudin - Software Engineering Research Group, Faculty of Computing, Universiti Teknologi Malaysia (UTM), Johor, Malaysia
Sim Hiew Moi - Software Engineering Research Group, Faculty of Computing, Universiti Teknologi Malaysia (UTM), Johor, Malaysia
Muhammad Luqman Mohd Shafie - Software Engineering Research Group, Faculty of Computing, Universiti Teknologi Malaysia (UTM), Johor, Malaysia
Zainuri Saringat - Faculty of Computer Sciences and Information Technology, Universiti Tun Hussein Onn Malaysia (UTHM), Parit Raja, Malaysia
Aldo Erianda - Department of Information Technology, Politeknik Negeri Padang, Padang, Indonesia


Citation Format:



DOI: http://dx.doi.org/10.62527/joiv.8.3-2.2449

Abstract


Predictive analytics technologies are becoming increasingly popular in higher education institutions. Students' grades are one of the most critical performance indicators educators can use to predict their academic achievement. Academics have developed numerous techniques and machine-learning approaches for predicting student grades over the last several decades. Although much work has been done, a practical model is still lacking, mainly when dealing with imbalanced datasets. This study examines the impact of imbalanced datasets on machine learning models' accuracy and reliability in predicting student performance. This study compares the performance of two popular machine learning algorithms, Logistic Regression and Random Forest, in predicting student grades. Secondly, the study examines the impact of imbalanced datasets on these algorithms' performance metrics and generalization capabilities. Results indicate that the Random Forest (RF) algorithm, with an accuracy of 98%, outperforms Logistic Regression (LR), which achieved 91% accuracy. Furthermore, the performance of both models is significantly impacted by imbalanced datasets. In particular, LR struggles to accurately predict minor classes, while RF also faces difficulties, though to a lesser extent. Addressing class imbalance is crucial, notably affecting model bias and prediction accuracy. This is especially important for higher education institutes aiming to enhance the accuracy of student grade predictions, emphasizing the need for balanced datasets to achieve robust predictive models.

Keywords


Imbalanced dataset; machine learning; higher education institute; multi-class prediction

Full Text:

PDF

References


S. D. A. Bujang et al., “Multiclass Prediction Model for Student Grade Prediction Using Machine Learning,” IEEE Access, vol. 9, pp. 95608–95621, 2021, doi: 10.1109/ACCESS.2021.3093563.

D. Solomon, “Predicting Performance and Potential Difficulties of University Student using Classification : Survey Paper,” Int. J. Pure Appl. Math., vol. 118, no. 18, pp. 2703–2707, 2018.

E. B. Costa, B. Fonseca, M. A. Santana, F. F. de Araújo, and J. Rego, “Evaluating the effectiveness of educational data mining techniques for early prediction of students’ academic failure in introductory programming courses,” Comput. Human Behav., vol. 73, pp. 247–256, 2017, doi: 10.1016/j.chb.2017.01.047.

Y. Zhang, Y. Yun, H. Dai, J. Cui, and X. Shang, “Graphs regularized robust matrix factorization and its application on student grade prediction,” Appl. Sci., vol. 10, no. 5, pp. 1–19, 2020, doi:10.3390/app10051755.

A. Hellas et al., “Predicting academic performance: a systematic literature review,” in Proceedings Companion of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education, in ITiCSE 2018 Companion. New York, NY, USA: Association for Computing Machinery, 2018, pp. 175–199. doi:10.1145/3293881.3295783.

S. T. Jishan, R. I. Rashu, N. Haque, and R. M. Rahman, “Improving accuracy of students’ final grade prediction model using optimal equal width binning and synthetic minority over-sampling technique,” Decis. Anal., vol. 2, no. 1, 2015, doi: 10.1186/s40165-014-0010-2.

I. Khan, A. Al Sideiri, A. Ahmad, and N. Jabeur, “Tracking Student Performance in Introductory Programming by Means of Machine Learning,” Feb. 2019, pp. 1–6. doi:10.1109/ICBDSC.2019.8645608.

M. A. Al-Barrak and M. Al-Razgan, “Predicting Students Final GPA Using Decision Trees: A Case Study,” Int. J. Inf. Educ. Technol., vol. 6, no. 7, pp. 528–533, 2016, doi: 10.7763/IJIET.2016.V6.745.

M. Agaoglu, “Predicting Instructor Performance Using Data Mining Techniques in Higher Education,” IEEE Access, vol. 4, pp. 2379–2387, 2016, doi: 10.1109/ACCESS.2016.2568756.

L. Ismail, H. Materwala, and A. Hennebelle, “Comparative Analysis of Machine Learning Models for Students’ Performance Prediction,” in Advances in Intelligent Systems and Computing, Springer Science and Business Media Deutschland GmbH, 2021, pp. 149–160. doi:10.1007/978-3-030-71782-7_14.

B. Flanagan, R. Majumdar, and H. Ogata, “Early-warning prediction of student performance and engagement in open book assessment by reading behavior analysis,” Int. J. Educ. Technol. High. Educ., vol. 19, no. 1, Dec. 2022, doi: 10.1186/s41239-022-00348-4.

A. Polyzou and G. Karypis, “Grade prediction with models specific to students and courses,” Int. J. Data Sci. Anal., vol. 2, no. 3–4, pp. 159–171, Dec. 2016, doi: 10.1007/s41060-016-0024-z.

F. Ahmad, N. H. Ismail, and A. A. Aziz, “The prediction of students’ academic performance using classification data mining techniques,” Appl. Math. Sci., vol. 9, no. 129, pp. 6415–6426, 2015, doi:10.12988/ams.2015.53289.

T. Anderson, “Applications of Machine Learning to Student Grade Prediction in Quantitative Business Courses,” 2017.

E. C. Abana, “A decision tree approach for predicting student grades in Research Project using Weka,” Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 7, pp. 285–289, 2019, doi:10.14569/ijacsa.2019.0100739.

I. Khan, A. Al Sadiri, A. R. Ahmad, and N. Jabeur, “Tracking Student Performance in Introductory Programming by Means of Machine Learning,” in 2019 4th MEC International Conference on Big Data and Smart City (ICBDSC), 2019, pp. 1–6. doi:10.1109/ICBDSC.2019.8645608.

E. Wakelam, A. Jefferies, N. Davey, and Y. Sun, “The potential for student performance prediction in small cohorts with minimal available attributes,” Br. J. Educ. Technol., vol. 51, no. 2, pp. 347–370, Mar. 2020, doi: 10.1111/bjet.12836.

Y. Pristyanto, N. A. Setiawan, and I. Ardiyanto, “Hybrid resampling to handle imbalanced class on classification of student performance in classroom,” Feb. 2017, pp. 207–212. doi:10.1109/ICICOS.2017.8276363.

X. Zhang, R. Xue, B. Liu, W. Lu, and Y. Zhang, “Grade Prediction of Student Academic Performance with Multiple Classification Models,” Feb. 2018, pp. 1086–1090. doi:10.1109/FSKD.2018.8687286.

A. Saifudin, Ekawati, Yulianti, and T. Desyani, “Forward Selection Technique to Choose the Best Features in Prediction of Student Academic Performance Based on Naïve Bayes,” in Journal of Physics: Conference Series, Institute of Physics Publishing, 2020. doi:10.1088/1742-6596/1477/3/032007.

C. Chen, A. Liaw, and L. Breiman, “Using Random Forest to Learn Imbalanced Data,” Discovery, no. 1999, pp. 1–12.

R. Couronné, P. Probst, and A. L. Boulesteix, “Random forest versus logistic regression: A large-scale benchmark experiment,” BMC Bioinformatics, vol. 19, no. 1, pp. 1–15, 2018, doi: 10.1186/s12859-018-2264-5.

C. Y. J. Peng, K. L. Lee, and G. M. Ingersoll, “An introduction to logistic regression analysis and reporting,” J. Educ. Res., vol. 96, no. 1, pp. 3–14, 2002, doi: 10.1080/00220670209598786.

P. Brous and M. Janssen, “Trusted decision-making: Data governance for creating trust in data science decision outcomes,” Adm. Sci., vol. 10, no. 4, 2020, doi: 10.3390/admsci10040081.

M. Tsiakmaki, G. Kostopoulos, S. Kotsiantis, and O. Ragos, “Implementing autoML in educational data mining for prediction tasks,” Appl. Sci., vol. 10, no. 1, pp. 1–27, 2020, doi:10.3390/app10010090.