Comparison of Apache SparkSQL and Oracle Performance: Case Study of Data Cleansing Process

Ilma Nur Hidayati - School of Industrial and System Engineering, Telkom University, Bandung, Indonesia
Tien Fabrianti Kusumasari - School of Industrial and System Engineering, Telkom University, Bandung, Indonesia
Faqih Hamami - School of Industrial and System Engineering, Telkom University, Bandung, Indonesia


Citation Format:



DOI: http://dx.doi.org/10.30630/joiv.6.1-2.928

Abstract


A dataset with good quality is a valuable asset for a company. The data can be processed into information to help companies improve decision-making. However, the data increased more and more over time to decrease data quality. Thus, good data management is important to keep data quality meeting company standards. One of the efforts that can be done is conducting data cleansing to clean data from errors, inaccuracies, duplication, format discrepancies, etc. Apache Spark is an engine that can analyze large amounts of data. Oracle Database is a database management system used to manage databases. Both have their own reliability and can be used to analyze SQL-shaped data. This study compared Spark and Oracle performance based on query processing time. Both were tested on queries used to perform data cleansing of millions of rows of the dataset. The research focuses on finding out Spark and Oracle's performance through quantitative analysis. The results of this study showed that there were differences in query processing times on both tools. Apache Spark is rated better because it has a relatively faster query processing time than Oracle Database. It can be concluded that Oracle is more reliable in storing complex data models than in analyzing large data. For future research, it is suggested to add other comparison aspects such as memory and CPU usage. The researchers can also consider using query optimization techniques to enrich query experiments.

Keywords


Spark; Oracle; cleansing; processing time; comparison.

Full Text:

PDF

References


I. Taleb and M. A. Serhani, "Big Data Pre-Processing: Closing the Data Quality Enforcement Loop," in IEEE International Congress on Big Data (BigData Congress), Honolulu, 2017.

H. A. Sulistyo, T. F. Kusumasari and E. N. Alam, "Implementation of Data Cleansing Null Method for Data Quality Management Dashboard using Pentaho Data Integration," in International Conference on Information and Communications Technology, Yogyakarta, 2020.

F. Boufarez, A. B. Salem, M. Rehab and S. Correia, "Similar Data Elimination: MFB Algorithm," 2013 International Conference on Control, Decision and Information Technologies (CoDIT), p. 289, 2013.

I. Taleb, M. A. Serhani and R. Dssouli, "Big Data Quality Assessment Model for Unstructured Data," in International Conference on Innovations in Information Technology (IIT), Al Ain, 2018.

A. Juneja and N. N. Das, "Big Data Quality Framework: Pre-Processing Data in Weather Monitoring Application," in 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (Com-IT-Con), Faridabad, 2019.

T. Hongxun, W. Honggang and Z. Kun, "Data Quality Assessment for On-line Monitoring and Measuring System of Power Quality Based on Big Data and Data Provenance Theory," in 3rd IEEE International Conference on Cloud Computing and Big Data Analysis, Chengdu, 2018.

S. Loetpipatwanich and P. Vichitthamaros, "Sakdas: A Python Package for Data Profiling and Data Quality Auditing," in 1st International Conference on Big Data Analytics and Practices, Bangkok, 2020.

S. R. Amethyst, T. F. Kusumasari and M. A. Hasibuan, "Data Pattern Single Column Analysis for Data Profiling using an Open Source Platform," in IOP Conference Series Materials Science and Engineering, 2018.

S. Juddoo and C. George, "A Qualitative Assessment of Machine Learning Support for Detecting Data Completeness and Accuracy Issues to Improve Data Analytics in Big Data for the Healthcare Industry," in 3rd International Conference on Emerging Trends in Electrical, Electronic, and Communications Engineering, Balaclava, 2020.

F. Haneem, N. Kama, R. Ali and S. Basri, "Resolving data duplication, inaccuracy and inconsistency issues using Master Data Management," 2017.

E. Rahm and H. H. Do, "Data Cleaning: Problems and Current Approaches," IEEE Bulletin of the Technical Committee on Data Engineering, vol. 23, pp. 3-13, 2000.

T. F. Kusumasari and Fitria, "Data Profiling for Data Quality Improvement with Openrefine," in International Conference on Information Technology Systems and Innovation , Bali, 2016.

V. Kumar and C. Khosla, "Data Cleaning-A Thorough Analysis and Survey on Unstructured Data," in 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, 2018.

I. F. Ilyas and X. Chu, Data Cleaning, Association for Computing Machinery, 2019.

F. Ridzuan and W. M. N. W. Zainon, "A Review on Data Cleansing Methods for Big Data," in The Fifth Information Systems International Conference 2019, 2019.

J. Yin, J. Zhang, D. Li, T. Wang and K. Jing, "Big data cleaning model of smart grid based on Tensor Tucker decomposition," in International Conference on Big Data & Artificial Intelligence & Software Engineering , Bangkok, 2020.

A. Wakde, P. Shende, S. Waydande, S. Uttarwar and G. Deshmukh, "Comparative Analysis of Hadoop Tools and Spark Technology," in Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, 2018.

X. Li and W. Zhou, "Performance Comparison of Hive, Impala and Spark SQL," in International Conference on Intelligent Human-Machine Systems and Cybernetics, Hangzhou, 2015 .

G. Gousios, "Big Data Software Analytics with Apache Spark," in International Conference on Software Engineering: Companion Proceedings, 2018.

H. Müller and J.-C. Freytag, "Problems, methods, and challenges in comprehensive data cleansing," Humboldt University Berlin, 2003.

C. Li, Y. Hou and Z. Yu, "Research on data cleaning technology based on instance level," Journal of Physics, pp. 1-4, 2019.

M. Smallcombe, "Top Data Cleansing Tools for 2022," integrate.io, 12 January 2021. [Online]. Available: https://www.integrate.io/blog/top-10-data-cleansing-tools/. [Accessed 6 January 2022].

F. Chen and L. Jiang, "A parallel algorithm for data cleansing in incomplete information systems using MapReduce," in International Conference on Computational Intelligence and Security, Kunming, 2014.

S. Padhy and G. M. M. Kumaran, "A Quantitative Performance Analysis between Mongodb and Oracle NoSQL," in International Conference on Computing for Sustainable Global Development, New Delhi, 2019.

A. V. Hazarika, G. J. S. R. Ram and E. Jain, "Performance Comparision of Hadoop and Spark Engine," in International conference on I-SMAC, Palladam, 2017.

R. Poljak, P. Poscic and D. Jaksic, "Comparative Analysis of the Selected Relational Database Management Systems," in International Convention on Information and Communication Technology, Electronics and Microelectronics, Opatija, 2017.

J. Powers, "Apache Spark Performance Compared to a Traditional Relational Database using Open Source Big Data Health Software," Project Paper for CSE8803 Big Data Analytics for Health Care, pp. 1-5, 24 April 2016.