Text Classification Using Genetic Programming with Implementation of Map Reduce and Scraping

Wirarama Wedashwara; Budi Irmawati; Heri Wijayanto; I Wayan Agus Arimbawa; Vandha Pradwiyasma Widartha

doi:10.30630/joiv.7.2.1813

Text Classification Using Genetic Programming with Implementation of Map Reduce and Scraping

Wirarama Wedashwara - University of Mataram, Mataram, Indonesia
Budi Irmawati - University of Mataram, Mataram, Indonesia
Heri Wijayanto - University of Mataram, Mataram, Indonesia
I Wayan Agus Arimbawa - Seoul National University, Seoul, Republic of Korea
Vandha Pradwiyasma Widartha - Telkom University, Bandung, Indonesia

Citation Format:

DOI: http://dx.doi.org/10.30630/joiv.7.2.1813

Abstract

Classification of text documents on online media is a big data problem and requires automation. Text classification accuracy can decrease if there are many ambiguous terms between classes. Hadoop Map Reduce is a parallel processing framework for big data that has been widely used for text processing on big data. The study presented text classification using genetic programming by pre-processing text using Hadoop map-reduce and collecting data using web scraping. Genetic programming is used to perform association rule mining (ARM) before text classification to analyze big data patterns. The data used are articles from science-direct with the three keywords. This study aims to perform text classification with ARM-based data pattern analysis and data collection system through web-scraping, pre-processing using map-reduce, and text classification using genetic programming. Through web scraping, data has been collected by reducing duplicates as much as 17718. Map-reduce has tokenized and stopped-word removal with 36639 terms with 5189 unique terms and 31450 common terms. Evaluation of ARM with different amounts of multi-tree data can produce more and longer rules and better support. The multi-tree also produces more specific rules and better ARM performance than a single tree. Text classification evaluation shows that a single tree produces better accuracy (0.7042) than a decision tree (0.6892), and the lowest is a multi-tree(0.6754). The evaluation also shows that the ARM results are not in line with the classification results, where a multi-tree shows the best result (0.3904) from the decision tree (0.3588), and the lowest is a single tree (0.356).

Keywords

Text Classification; Genetic Programming; Web Scraping; Map-reduce

Full Text:

PDF

References

I. Pintye, E. Kail, P. Kacsuk, and R. Lovas, â€œBig data and machine learning framework for clouds and its usage for text classification,â€ Concurr Comput, vol. 33, no. 19, p. e6164, 2021.

M. Abdel-Basset, M. Mohamed, F. Smarandache, and V. Chang, â€œNeutrosophic association rule mining algorithm for big data analysis,â€ Symmetry (Basel), vol. 10, no. 4, p. 106, 2018.

H. U. Rahman, R. U. Khan, and A. Ali, â€œProgramming and Pre-Processing Systems for Big Data Storage and Visualization,â€ in Handbook of Research on Big Data Storage and Visualization Techniques, IGI Global, 2018, pp. 228â€“253.

B. Altinel and M. C. Ganiz, â€œSemantic text classification: A survey of past and recent advances,â€ Inf Process Manag, vol. 54, no. 6, pp. 1129â€“1153, 2018.

I. Alsmadi and G. K. Hoon, â€œTerm weighting scheme for short-text classification: Twitter corpuses,â€ Neural Comput Appl, vol. 31, no. 8, pp. 3819â€“3831, 2019.

S. Du and J. Li, â€œParallel processing of improved KNN text classification algorithm based on Hadoop,â€ in 2019 7th International Conference on Information, Communication and Networks (ICICN), 2019, pp. 167â€“170.

H. Jeong and K. J. Cha, â€œAn efficient mapreduce-based parallel processing framework for user-based collaborative filtering,â€ Symmetry (Basel), vol. 11, no. 6, p. 748, 2019.

H.-N. Dai, H. Wang, G. Xu, J. Wan, and M. Imran, â€œBig data analytics for manufacturing internet of things: opportunities, challenges and enabling technologies,â€ Enterp Inf Syst, vol. 14, no. 9â€“10, pp. 1279â€“1303, 2020.

K. v Ranjitha, B. S. V. Prasad, and others, â€œOptimization Scheme for Text Classification Using Machine Learning Na{"i}ve Bayes Classifier,â€ in ICDSMLA 2019, Springer, 2020, pp. 576â€“586.

A. Tahmassebi and A. H. Gandomi, â€œGenetic programming based on error decomposition: A big data approach,â€ in Genetic programming theory and practice XV, Springer, 2018, pp. 135â€“147.

T. Haryanto, A. Pratama, H. Suhartanto, A. Murni, K. Kusmardi, and J. Pidanic, â€œMultipatch-GLCM for texture feature extraction on classification of the colon histopathology images using deep neural network with GPU acceleration,â€ Journal of Computer Science, vol. 16, no. 3, pp. 280â€“294, 2020.

D. M. Thomas and S. Mathur, â€œData analysis by web scraping using python,â€ in 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA), 2019, pp. 450â€“454.

V. Krotov, L. Johnson, and L. Silva, â€œTutorial: Legality and ethics of web scraping,â€ Communications of the Association for Information Systems, vol. 47, no. 1, 2020, doi: 10.17705/1CAIS.04724.

M. Dogucu and M. Ã‡etinkaya-Rundel, â€œWeb Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities,â€ Journal of Statistics Education, 2020, doi: 10.1080/10691898.2020.1787116.

A. Telikani, A. H. Gandomi, and A. Shahbahrami, â€œA survey of evolutionary computation for association rule mining,â€ Inf Sci (N Y), vol. 524, pp. 318â€“352, 2020.

C. Gakii and R. Rimiru, â€œIdentification of cancer related genes using feature selection and association rule mining,â€ Inform Med Unlocked, vol. 24, 2021, doi: 10.1016/j.imu.2021.100595.

W. Thurachon and W. Kreesuradej, â€œIncremental Association Rule Mining with a Fast Incremental Updating Frequent Pattern Growth Algorithm,â€ IEEE Access, vol. 9, 2021, doi: 10.1109/ACCESS.2021.3071777.

J. Ramsingh and V. Bhuvaneswari, â€œAn efficient Map Reduce-Based Hybrid NBC-TFIDF algorithm to mine the public sentiment on diabetes mellitus--A big data approach,â€ Journal of King Saud University-Computer and Information Sciences, 2018.

A. K. Ngo Ho and F. Yvon, â€œOptimizing Word Alignments with Better Subword Tokenization,â€ Proceedings of Machine Translation Summit XVIII: Research Track, 2021.

K. Sirts and K. Peekman, â€œEvaluating sentence segmentation and word tokenization systems on estonian web texts,â€ in Frontiers in Artificial Intelligence and Applications, 2020, vol. 328. doi: 10.3233/faia200620.

X. Deng, Y. Li, J. Weng, and J. Zhang, â€œFeature selection for text classification: A review.,â€ Multimed Tools Appl, vol. 78, no. 3, 2019.

T. Ma, R. Al-Sabri, L. Zhang, B. Marah, and N. Al-Nabhan, â€œThe Impact of Weighting Schemes and Stemming Process on Topic Modeling of Arabic Long and Short Texts,â€ ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, no. 6, 2020, doi: 10.1145/3405843.

S. S. Samant, N. L. Bhanu Murthy, and A. Malapati, â€œImproving Term Weighting Schemes for Short Text Classification in Vector Space Model,â€ IEEE Access, vol. 7, 2019, doi: 10.1109/ACCESS.2019.2953918.

A. S. Halibas, A. S. Shaffi, and M. A. K. V. Mohamed, â€œApplication of text classification and clustering of Twitter data for business analytics,â€ in 2018 Majan international conference (MIC), 2018, pp. 1â€“7.

P. Koutris, S. Salihoglu, D. Suciu, and others, â€œAlgorithmic aspects of parallel data processing,â€ Foundations and TrendsÂ®in Databases, vol. 8, no. 4, pp. 239â€“370, 2018.

B. Anjum, â€œMapReduce--The Scalable Distributed Data Processing Solution,â€ in Topics in Parallel and Distributed Computing, Springer, 2018, pp. 173â€“190.

S. Oliviandi, A. B. Osmond, and R. Latuconsina, â€œImplementasi Apache Spark Pada Big Data Berbasis Hadoop Distributed File System,â€ e-Proceeding of Engineering, vol. 5, no. 1 Maret, 2018.

N. D. Sapoetra, R. Ridwan, M. A. K. Sahide, and K. Masuda, â€œLocal communityâ€™s perception, attitude, and participation towards different level management of geopark: A comparison Geosite case study, between Muroto Cape and Rammang-rammang Geosite,â€ in IOP Conference Series: Earth and Environmental Science, 2019, vol. 343, no. 1. doi: 10.1088/1755-1315/343/1/012044.

K. Kousalya and S. J. Parvez, â€œEffective processing of unstructured data using python in Hadoop map reduce,â€ International Journal of Engineering & Technology, vol. 7, no. 2.21, pp. 417â€“419, 2018.

A. G. C. de SÃ¡, A. A. Freitas, and G. L. Pappa, â€œAutomated selection and configuration of multi-label classification algorithms with grammar-based genetic programming,â€ in International Conference on Parallel Problem Solving from Nature, 2018, pp. 308â€“320.

L. W. Santoso, B. Singh, S. S. Rajest, R. Regin, and K. H. Kadhim, â€œA Genetic Programming Approach to Binary Classification Problem,â€ EAI Endorsed Transactions on Energy Web, vol. 8, no. 31, 2021, doi: 10.4108/eai.13-7-2018.165523.

F. Viegas et al., â€œA genetic programming approach for feature selection in highly dimensional skewed data,â€ Neurocomputing, vol. 273, pp. 554â€“569, 2018.

Username
Password
Remember me