Automatic Topic-Based Web Page Classification Using Deep Learning

Siti Hawa Apandi - Universiti Malaysia Pahang, 26600 Pekan, Pahang, Malaysia
Jamaludin Sallim - Universiti Malaysia Pahang, 26600 Pekan, Pahang, Malaysia
Rozlina Mohamed - Universiti Malaysia Pahang, 26600 Pekan, Pahang, Malaysia
Norkhairi Ahmad - Universiti Kuala Lumpur, Bandar Baru Bangi, ,43650 Selangor, Malaysia


Citation Format:



DOI: http://dx.doi.org/10.30630/joiv.7.3-2.1616

Abstract


The internet is frequently surfed by people by using smartphones, laptops, or computers in order to search information online in the web. The increase of information in the web has made the web pages grow day by day. The automatic topic-based web page classification is used to manage the excessive amount of web pages by classifying them to different categories based on the web page content. Different machine learning algorithms have been employed as web page classifiers to categorise the web pages. However, there is lack of study that review classification of web pages using deep learning. In this study, the automatic topic-based classification of web pages utilising deep learning that has been proposed by many key researchers are reviewed. The relevant research papers are selected from reputable research databases. The review process looked at the dataset, features, algorithm, pre-processing used in classification of web pages, document representation technique and performance of the web page classification model. The document representation technique used to represent the web page features is an important aspect in the classification of web pages as it affects the performance of the web page classification model. The integral web page feature is the textual content. Based on the review, it was found that the image based web page classification showed higher performance compared to the text based web page classification. Due to lack of matrix representation that can effectively handle long web page text content, a new document representation technique which is word cloud image can be used to visualize the words that have been extracted from the text content web page.

Keywords


Deep learning; document representation technique; machine learning; topic classification; web page classification; web page classifier

Full Text:

PDF

References


J. M. G. Costa, "Web page classification using text and visual features," M.S. thesis, Coimbra Univ., Coimbra, 2014.

A. Osanyin, O. Oladipupo, and I. Afolabi, "A review on web page classification," Covenant Journal of Informatics and Communication Technology, vol. 6, no. 2, pp. 11–28, 2018.

E. Suganya and D. S. Vijayarani, "Web page classification in web mining research-A survey," Int J Innov Res Sci Eng Technol, vol. 6, pp. 17472–17479, 2017.

L. Safae, B. El Habib, and T. Abderrahim, "A review of machine learning algorithms for web page classification," in 2018 IEEE 5th International Congress on Information Science and Technology (CiSt), IEEE, 2018, pp. 220–226.

Z. Dou, I. Khalil, A. Khreishah, A. Al-Fuqaha, and M. Guizani, "Systematization of knowledge (sok): A systematic review of software-based web phishing detection," IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2797–2819, 2017.

X. Qi, "Web page classification and hierarchy adaptation," Ph.D dissertation, Lehigh Univ., Bethlehem, 2012. [Online]. Available: http://wume.cse.lehigh.edu/pubs/qi-dissertation.pdf

P. V. Nainwani and P. Prajapati, "Comparative study of web page classification approaches," Int J Comput Appl, vol. 179, pp. 6–9, 2018.

E. Buber and B. Diri, "Web page classification using RNN," Procedia Comput Sci, vol. 154, pp. 62–72, 2019.

A. K. Nandanwar and J. Choudhary, "Web page categorization based on images as multimedia visual feature using Deep Convolution Neural Network," International Journal on Emerging Technologies, vol. 11, no. 3, pp. 619–625, 2020.

H. Li, Z. Zhang, and Y. Xu, "Web page classification method based on semantics and structure," in 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), IEEE, 2019, pp. 238–243.

Q. Zhao, W. Yang, and R. Hua, "Design and research of composite web page classification network based on deep learning," in 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, 2019, pp. 1531–1535.

D. López-Sánchez, A. G. Arrieta, and J. M. Corchado, "Deep neural networks and transfer learning applied to multimedia web mining," in Distributed Computing and Artificial Intelligence, 14th International Conference, Springer, 2018, pp. 124–131.

D. López-Sánchez, J. M. Corchado, and A. G. Arrieta, "A CBR system for image-based webpage classification: Case representation with Convolutional Neural Networks," in The Thirtieth International Flairs Conference, 2017, pp. 483–488.

A. Chechulin and I. Kotenko, "Application of image classification methods for protection against inappropriate information in the internet," in 2018 IEEE International Conference on Internet of Things and Intelligence System (IOTAIS), IEEE, 2018, pp. 167–173.

M. Du, Y. Han, and L. Zhao, "A heuristic approach for website classification with mixed feature extractors," in 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, 2018, pp. 134–141.

D. López-Sánchez, A. G. Arrieta, and J. M. Corchado, "Visual content-based web page categorization with deep transfer learning and metric learning," Neurocomputing, vol. 338, pp. 418–431, 2019.

M. Hashemi and M. Hall, "Detecting and classifying online dark visual propaganda," Image Vis Comput, vol. 89, pp. 95–105, 2019.

K. Maladkar, "Content based hierarchical URL classification with Convolutional Neural Networks," in 2019 International Conference on Information Technology (ICIT), IEEE, 2019, pp. 263–266.

L. Deng, X. Du, and J. Shen, "Web page classification based on heterogeneous features and a combination of multiple classifiers," Frontiers of Information Technology & Electronic Engineering, vol. 21, no. 7, pp. 995–1004, 2020.

C. He, Y. Hu, A. Zhou, Z. Tan, C. Zhang, and B. Ge, "A web news classification method: Fusion noise filtering and Convolutional Neural Network," in 2020 2nd Symposium on Signal Processing Systems, 2020, pp. 80–85.

R. Rajalakshmi, H. Tiwari, J. Patel, A. Kumar, and R. Karthik, "Design of kids-specific URL classifier using Recurrent Convolutional Neural Network," Procedia Comput Sci, vol. 167, pp. 2124–2131, 2020.

S. Alqaraleh, H. M. N. Sirin, and F. Ozkan, "Performance comparison of Turkish web pages classification," in 2021 Innovations in Intelligent Systems and Applications Conference (ASYU), IEEE, 2021, pp. 1–5.

S. Suleymanzade and F. Abdullayeva, "Full content-based web page classification methods by using deep neural networks," Statistics, Optimization & Information Computing, vol. 9, no. 4, pp. 963–973, 2021.

C.-G. Artene, M. N. Tibeică, and F. Leon, "Using BERT for multi-label multi-language web page classification," in 2021 IEEE 17th International Conference on Intelligent Computer Communication and Processing (ICCP), IEEE, 2021, pp. 307–312.

A. K. Nandanwar and J. Choudhary, "Semantic features with contextual knowledge-based web page categorization using the GloVe model and stacked BiLSTM," Symmetry (Basel), vol. 13, no. 10, p. 1772, 2021.

Z. Li, S. Zhang, J. Yin, M. Du, Z. Zhang, and Q. Liu, "Fighting against piracy: An approach to detect pirated video websites enhanced by third-party services," in 2022 IEEE Symposium on Computers and Communications (ISCC), IEEE, 2022, pp. 1–7.

C.-G. Artene, D.-D. Vecliuc, M. N. Tibeică, and F. Leon, "An experimental study of Convolutional Neural Networks for functional and subject classification of web pages," Vietnam Journal of Computer Science, vol. 9, no. 04, pp. 435–453, 2022.

A. W. Murdiyanto and M. Habibi, "Analysis of deep learning approach based on Convolution Neural Network (CNN) for classification of web page title and description text," Compiler, vol. 11, no. 2, pp. 51–58, 2022.

M. Hashemi, "Web page classification: A survey of perspectives, gaps, and future directions," Multimed Tools Appl, vol. 79, no. 17–18, pp. 11921–11945, 2020.

S. M. Babapour and M. Roostaee, "Web pages classification: An effective approach based on text mining techniques," in 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI), IEEE, 2017, pp. 320–323.

P. Song, C. Geng, and Z. Li, "Research on text classification based on Convolutional Neural Network," in 2019 International conference on computer network, electronic and automation (ICCNEA), IEEE, 2019, pp. 229–232.

A. R. Alharbi, S. D. Alharbi, A. Aljaedi, and O. Akanbi, "Neural networks based on Latent Dirichlet Allocation for news web page classifications," in 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), IEEE, 2020, pp. 1–6.

F. De Fausti, F. Pugliese, and D. Zardetto, "Towards automated website classification by deep learning," Rivista di Statistica Ufficiale, pp. 9–50, 2019.