Evaluating Web Scraping Performance Using XPath, CSS Selector, Regular Expression, and HTML DOM With Multiprocessing Technical Applications

Irfan Darmawan - Telkom University, Bandung, Indonesia
Muhamad Maulana - Siliwangi University, Tasikmalaya, Indonesia
Rohmat Gunawan - Siliwangi University, Tasikmalaya, Indonesia
Nur Widiyasono - Siliwangi University, Tasikmalaya, Indonesia


Citation Format:



DOI: http://dx.doi.org/10.30630/joiv.6.4.1525

Abstract


Data collection has become a necessity today, especially since many sources of data on the internet can be used for various needs. The main activity in data collection is collecting quality information that can be analyzed and used to support decisions or provide evidence. The process of retrieving data from the internet is also known as web scraping. There are various methods of web scraping that are commonly used. The amount of data scattered on the internet will be quite time-consuming if the web scraping is done on a large scale. By applying the parallel concept, the multi-processing approach can help complete a job. This study aimed to determine the performance of the web scraping method with the application of multi-processing. Testing is done by doing the process of scraping data from a predetermined target web. Four web scraping methods: CSS Selector, HTML DOM, Regex, and XPath, were selected to be used in the experiment measured based on the parameters of CPU usage, memory usage, execution time, and bandwidth usage. Based on experimental data, the Regex method has the least CPU and memory usage compared to other methods. While XPath requires the least time compared to other methods. The CSS Selector method is the smallest in terms of bandwidth usage compared to other methods. The application of multi-processing techniques to each web scraping method is proven to save memory usage, reduce execution time and reduce bandwidth usage compared to only using single processing.

Keywords


Multiprocessing; scraping; website; HTML DOM.

Full Text:

PDF

References


B. Zhao, “Encyclopedia of Big Data,†Encycl. Big Data, pp. 3–5, 2019, doi: 10.1007/978-3-319-32001-4.

M. El Asikri1, S. Krit, and H. Chaib, “Using Web Scraping In A Knowledge Environment To Build Ontologies Using Python And Scrapy Article in,†Eur. J. Transl. Clin. Med., vol. 07, no. 03, pp. 433–442, 2020.

S. E. Chasins, M. Mueller, and R. Bodik, “Rousillon: Scraping distributed hierarchical web data,†UIST 2018 - Proc. 31st Annu. ACM Symp. User Interface Softw. Technol., pp. 963–975, 2018, doi: 10.1145/3242587.3242661.

O. ten Bosch, D. Windmeijer, A. Van Delden, and G. Van den Heuvel, “Web scraping meets survey design: Combining forces,†Bigsurv18 Conf., pp. 1–13, 2018.

A. V Saurkar and S. A. Gode, “An Overview On Web Scraping Techniques And Tools,†Int. J. Futur. Revolut. Comput. Sci. Commun. Eng., pp. 363–367, 2018.

A. Priyanto and M. R. Ma’arif, “Implementasi Web Scrapping dan Text Mining untuk Akuisisi dan Kategorisasi Informasi dari Internet (Studi Kasus: Tutorial Hidroponik),†Indones. J. Inf. Syst., vol. 1, no. 1, pp. 25–33, 2018, doi: 10.24002/ijis.v1i1.1664.

C. Slamet, R. Andrian, D. S. Maylawati, Suhendar, W. Darmalaksana, and M. A. Ramdhani, “Web Scraping and Naïve Bayes Classification for Job Search Engine,†IOP Conf. Ser. Mater. Sci. Eng., vol. 288, no. 1, pp. 0–7, 2018, doi: 10.1088/1757-899X/288/1/012038.

I. P. Sonya, “Analisis Web Scraping untuk Data Bencana Alam dengan Menggunakan Teknik Breadth-First Search Terhadap 3 Media Online,†J. Ilm. Inform. Komput. Univ. Gunadarma, vol. 21, no. 3, pp. 69–77, 2016.

I. Dongo, Y. Cadinale, A. Aguilera, F. Martínez, Y. Quintero, and S. Barrios, “Web Scraping versus Twitter API,†pp. 263–273, 2020, doi: 10.1145/3428757.3429104.

J. You, J. Lee, and H. Y. Kwon, “A complete and fast scraping method for collecting tweets,†Proc. - 2021 IEEE Int. Conf. Big Data Smart Comput. BigComp 2021, pp. 24–27, 2021, doi: 10.1109/BigComp51126.2021.00014.

R. M. Awangga, S. F. Pane, and R. D. Astuti, “Implementation of web scraping on GitHub task monitoring system,†Telkomnika (Telecommunication Comput. Electron. Control., vol. 17, no. 1, pp. 275–281, 2019, doi: 10.12928/TELKOMNIKA.v17i1.11613.

D. Maldeniya, C. Budak, L. P. Robert, and D. M. Romero, “Herding a Deluge of Good Samaritans: How GitHub Projects Respond to Increased Attention,†Web Conf. 2020 - Proc. World Wide Web Conf. WWW 2020, pp. 2055–2065, 2020, doi: 10.1145/3366423.3380272.

H. Hata, N. Novielli, S. Baltes, R. G. Kula, and C. Treude, “GitHub Discussions: An exploratory study of early adoption,†Empir. Softw. Eng., vol. 27, no. 1, pp. 1–32, 2022, doi: 10.1007/s10664-021-10058-6.

M. AlMarzouq, A. AlZaidan, and J. AlDallal, “Mining GitHub for research and education: challenges and opportunities,†Int. J. Web Inf. Syst., vol. 16, no. 4, pp. 451–473, 2020, doi: 10.1108/IJWIS-03-2020-0016.

A. Rahmatulloh and R. Gunawan, “Web Scraping with HTML DOM Method for Data Collection of Scientific Articles from Google Scholar,†Indones. J. Inf. Syst., vol. 2, no. 2, pp. 95–104, 2020, doi: 10.24002/ijis.v2i2.3029.

L. C. Dewi, Meiliana, and A. Chandra, “Social media web scraping using social media developers API and regex,†Procedia Comput. Sci., vol. 157, pp. 444–449, 2019, doi: 10.1016/j.procs.2019.08.237.

A. Himawan, A. Priadana, and A. Murdiyanto, “Implementation of Web Scraping to Build a Web-Based Instagram Account Data Downloader Application,†IJID (International J. Informatics Dev., vol. 9, no. 2, pp. 59–65, 2020, doi: 10.14421/ijid.2020.09201.

T. Rizaldi and H. A. Putranto, “Perbandingan Metode Web Scraping Menggunakan CSS Selector dan Xpath Selector,†Teknika, vol. 6, no. 1, pp. 43–46, 2017, doi: 10.34148/teknika.v6i1.56.

R. Gunawan, A. Rahmatulloh, I. Darmawan, and F. Firdaus, “Comparison of Web Scraping Techniques : Regular Expression, HTML DOM and Xpath,†pp. 1–8, 2019, doi: 10.2991/icoiese-18.2019.50.

T. H. E. World, S. L. Web, and D. Site, “CSS Selector Reference,†w3schools.com, 2018. .

M. Ahmed and I. Diab, “Prevent XPath and CSS Based Scrapers by Using Markup Randomizer,†Int. Arab J. e-Technology, vol. 5, no. 2, pp. 78–87, 2018.

O. Uzun, Erdinc; Yerlikaya, Tarik; Kirat, “COMPARISON OF PYTHON LIBRARIES USED FOR WEB DATA EXTRACTION,†Tech. Univ. - Sofia, Plovdiv branch, Bulg., vol. 24, 2018.

Z. A. Aziz, D. Naseradeen Abdulqader, A. B. Sallow, and H. Khalid Omer, “Python Parallel Processing and Multiprocessing: A Rivew,†Acad. J. Nawroz Univ., vol. 10, no. 3, pp. 345–354, 2021, doi: 10.25007/ajnu.v10n3a1145.

J. Kready, S. A. Shimray, M. N. Hussain, and N. Agarwal, “YouTube data collection using parallel processing,†Proc. - 2020 IEEE 34th Int. Parallel Distrib. Process. Symp. Work. IPDPSW 2020, pp. 1119–1122, 2020, doi: 10.1109/IPDPSW50202.2020.00185.

E. Tejedor, Y. Becerra, G. Alomar, and A. Queralt, “PyCOMPSs : Parallel computational workflows in Python,†2016, doi: 10.1177/1094342015594678.

A. Sherman and P. Den Hartog, “DECO : Polishing Python Parallel Programming,†no. May, 2016.

A. M. Price-Whelan and D. Foreman-Mackey, “schwimmbad: A uniform interface to parallel processing pools in Python,†J. Open Source Softw., vol. 2, no. 17, pp. 10–11, Sep. 2017, doi: 10.21105/joss.00357.

A. Malakhov, “Composable Multi-Threading for Python Libraries,†Proc. 15th Python Sci. Conf., no. Scipy, pp. 15–19, 2016, doi: 10.25080/majora-629e541a-002.

A. Malakhov, D. Liu, A. Gorshkov, and T. Wilmarth, “Composable Multi-Threading and Multi-Processing for Numeric Libraries,†Proc. 17th Python Sci. Conf., no. Scipy, pp. 18–24, 2018, doi: 10.25080/majora-4af1f417-003.

T. Schlitt, “XML and XPath with PHP,†w3schools.com, 2018. .

W3C, “What is the Document Object Model?,†w3.org, 2016. .

A. Backurs and P. Indyk, “Which Regular Expression Patterns Are Hard to Match?,†Proc. - Annu. IEEE Symp. Found. Comput. Sci. FOCS, vol. 2016-Decem, pp. 457–466, 2016, doi: 10.1109/FOCS.2016.56.

M. Arif Sazali, M. Syahir Sarkawi, and N. Syazwani Mohd Ali, “Multi-processing implementation for MCNP using Python,†IOP Conf. Ser. Mater. Sci. Eng., vol. 1231, no. 1, p. 012003, 2022, doi: 10.1088/1757-899x/1231/1/012003.

S. K. Abeykoon, M. Lin, and K. K. Van Dam, “Parallelizing X-ray Photon Correlation Spectroscopy Software Tools using Python Multiprocessing,†2017.