Classification of Arabic Documents with Five Classifier Models Using Machine Learning

Esraa Najjar - General Directorate of Education in Najaf Governorate, Najaf, Iraq
Nibras. A. Alkhaykanee - Open Educational College Al-Qadisiya Center, Al-Qadisiya, Iraq
Aqeel Majeed Breesam - Institute of Medical Technology/ Baghdad, Middle Technical University, Baghdad, Iraq


Citation Format:



DOI: http://dx.doi.org/10.62527/joiv.9.1.2539

Abstract


Automated document classification is becoming important and highly required for many applications, particularly in light of the exponential increase of Arabic-language internet documents. Text classification is a big data issue and an essential aspect of our lives; classifying content in a typical Arabic text is a significant and arduous challenge. The process of classifying a document involves placing it in the appropriate class or category. The major goal of this work is to use pre-processing techniques to evaluate the effectiveness of machine learning (ML) algorithms. The inclusion of preprocessing in this research methodology is vital. This study uses machine learning methods to classify different Arabic documents and uses five well-known classification systems' performance in categorizing the documents. This work used a model developed using various algorithms, namely Support Vector Machines, Naive Bayes, Logistic Regression, K-Nearest Neighbors, and Random Forest, for the classification procedures. The findings indicate that SVM achieved the highest performance evaluation, boasting an accuracy of 98%, surpassing all other algorithms employed in this study.


Keywords


Arabic text classification; machine learning; data preprocess; classifier; TF-IDF

Full Text:

PDF

References


M. K. Saad, “The impact of text preprocessing and term weighting on arabic text classification,” Impact Text Preprocessing Term Weight. Arab. Text Classif., 2010.

M. K. Saad and W. M. Ashour, “Arabic morphological tools for text mining,” Arab. Morphol. tools text Min., vol. 18, 2010.

N. Y. Habash, Introduction to Arabic natural language processing. Springer Nature, 2022.

E. Daya, D. Roth, and S. Wintner, Learning to Identify Semitic Roots, chapter 8. Springer, 2007.

S. Feldman, M. A. Marin, M. Ostendorf, and M. R. Gupta, “Part-of-speech histograms for genre classification of text,” in 2009 IEEE international conference on acoustics, speech and signal processing, 2009, pp. 4781–4784.

M. Sayed, R. K. Salem, and A. E. Khder, “A survey of Arabic text classification approaches,” Int. J. Comput. Appl. Technol., vol. 59, no. 3, pp. 236–251, 2019.

Y. A. Alhaj, J. Xiang, D. Zhao, M. A. A. Al-Qaness, M. Abd Elaziz, and A. Dahou, “A study of the effects of stemming strategies on arabic document classification,” IEEE access, vol. 7, pp. 32664–32671, 2019.

S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” 1998.

L. Khreisat, “A machine learning approach for Arabic text classification using N-gram frequency statistics,” J. Informetr., vol. 3, no. 1, pp. 72–77, 2009.

H. T. Himdi, “Classification of Arabic real and fake news based on Arabic textual analysis,” 2022.

S. Al-Saleem, “Associative classification to categorize Arabic data sets,” Int. J. Acm Jordan, vol. 1, no. 3, pp. 113–118, 2010.

M. El Kourdi, A. Bensaid, and T. Rachidi, “Automatic Arabic document categorization based on the Naïve Bayes algorithm,” in proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, 2004, pp. 51–58.

D. Abuaiadah, J. El Sana, and W. Abusalah, “On the impact of dataset characteristics on arabic document classification,” Int. J. Comput. Appl., vol. 101, no. 7, 2014.

O. Einea, A. Elnagar, and R. Al Debsi, “Sanad: Single-label arabic news articles dataset for automatic text categorization,” Data Br., vol. 25, p. 104076, 2019.

Z. Jianqiang and G. Xiaolin, “Comparison research on text pre-processing methods on twitter sentiment analysis,” IEEE access, vol. 5, pp. 2870–2879, 2017.

N. Habash and O. Rambow, “Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop,” in Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05), 2005, pp. 573–580.

S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh, “Automatic Arabic text classification,” 2008.

I. A. El-Khair, “Effects of stop words elimination for Arabic information retrieval: a comparative study,” arXiv Prepr. arXiv1702.01925, 2017.

H. Liu, A. Abraham, and Y. Li, “Nature inspired population-based heuristics for rough set reduction,” Rough Set Theory A True Landmark Data Anal., pp. 261–278, 2009.

Y. Saeys, I. Inza, and P. Larranaga, “A review of feature selection techniques in bioinformatics,” bioinformatics, vol. 23, no. 19, pp. 2507–2517, 2007.

J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of massive data sets. Cambridge university press, 2020.

T. Joachims, Learning to classify text using support vector machines, vol. 668. Springer Science & Business Media, 2002.

J. Vives, “Vibration analysis for fault detection in wind turbines using machine learning techniques,” Adv. Comput. Intell., vol. 2, no. 1, p. 15, 2022.

S. Alsaleem, “Automated Arabic Text Categorization Using SVM and NB.,” Int. Arab. J. e Technol., vol. 2, no. 2, pp. 124–128, 2011.

S. L. Ting, W. H. Ip, and A. H. C. Tsang, “Is Naive Bayes a good classifier for document classification,” Int. J. Softw. Eng. Its Appl., vol. 5, no. 3, pp. 37–46, 2011.

J. A. Anderson and S. C. Richardson, “Logistic discrimination and bias correction in maximum likelihood estimation,” Technometrics, vol. 21, no. 1, pp. 71–78, 1979.

M.-R. Amini and P. Gallinari, “Semi-supervised logistic regression,” in ECAI, 2002, vol. 2, no. 4, p. 11.

A. Bilski, “A review of artificial intelligence algorithms in document classification,” Int. J. Electron. Telecommun., vol. 57, no. 3, pp. 263–270, 2011.

J. Han and M. Kamber, “Data mining concepts and techniques San Francisco Moraga Kaufman,” 2001.

S. Nayak, M. Bhat, N. V. S. Reddy, and B. A. Rao, “Study of distance metrics on k-nearest neighbor algorithm for star categorization,” in Journal of Physics: Conference Series, 2022, vol. 2161, no. 1, p. 12004.

R. Genuer, J.-M. Poggi, C. Tuleau-Malot, and N. Villa-Vialaneix, “Random forests for big data,” Big Data Res., vol. 9, pp. 28–46, 2017.

F. Y. Osisanwo, J. E. T. Akinsola, O. Awodele, J. O. Hinmikaiye, O. Olakanmi, and J. Akinjobi, “Supervised machine learning algorithms: classification and comparison,” Int. J. Comput. Trends Technol., vol. 48, no. 3, pp. 128–138, 2017.

M. Fatourechi, R. K. Ward, S. G. Mason, J. Huggins, A. Schlögl, and G. E. Birch, “Comparison of evaluation metrics in classification applications with imbalanced datasets,” in 2008 seventh international conference on machine learning and applications, 2008, pp. 777–782.

M. Hossin and M. N. Sulaiman, “A review on evaluation metrics for data classification evaluations,” Int. J. data Min. Knowl. Manag. Process, vol. 5, no. 2, p. 1, 2015.

G. M. Foody and M. K. Arora, “An evaluation of some factors affecting the accuracy of classification by an artificial neural network,” Int. J. Remote Sens., vol. 18, no. 4, pp. 799–810, 1997.

D. M. W. Powers, “Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation,” arXiv Prepr. arXiv2010.16061, 2020.

C. Goutte and E. Gaussier, “A probabilistic interpretation of precision, recall and F-score, with implication for evaluation,” in Advances in Information Retrieval: 27th European Conference on IR Research, ECIR 2005, Santiago de Compostela, Spain, March 21-23, 2005. Proceedings 27, 2005, pp. 345–359