Improving Data Reliability Assessment in ETL Processes through Quality Scoring Technique in Data Analytics

Nor Famiera Atika Razali - Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
Salmi Baharom - Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
Salfarina Abdullah - Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
Novia Indriaty Admodisastro - Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Serdang, Selangor, Malaysia


Citation Format:



DOI: http://dx.doi.org/10.62527/joiv.8.4.3632

Abstract


The foundation of a relevant and accurate data analysis is reliable data. Technique and measurement are essential to evaluate current data quality regarding reliability and establish a baseline for ongoing improvement initiatives. Without tools or visualizations, data engineers may find it challenging to monitor and maintain the reliability of the massive data from the extraction, transformation, and loading (ETL) data load process. Data reliability assessment is a helpful technique in analyzing the quality of data reliability and information on the present state of data before commencing any analytics. The proposed technique hinges on the metric and measurement defining data reliability and the dashboard platform where the integration with the user in dictating the weight of data and the final output, which is the final data reliability score, will be projected. The score obtained affirms whether improvements are needed on the data or if an organization can proceed with data analytics. The technique considers the data extraction, transformation, and loading (ETL) procedures used to gather datasets. Data significance or weight was determined according to the analytics needs and preferences, indicating an acceptable score for generating insights. Ultimately, when utilizing the data reliability assessment metrics technique, we are credited with an overall picture of our data’s reliability aspect, as only one look is offered based on the intended analysis. This new approach boosts the confidence among data practitioners and stakeholders, especially those relying on findings generated from data analysis. Furthermore, the overview assists in enhancing the current state of data, where the derived score helps identify possible areas of improvement in the ETL process. Accuracy and efficiency assessment of the proposed technique also showed positive feedback in measuring the method in measuring the reliability of data.

Keywords


Data reliability; extraction; transformation; loading; data reliability metrics; data weight

Full Text:

PDF

References


L. Cai and Y. Zhu, “The Challenges of Data Quality and Data Quality Assessment in the Big Data Era,” Data Science Journal, vol. 14, no. 0, p. 2, May 2015, doi: 10.5334/dsj-2015-002.

S. Loetpipatwanich and P. Vichitthamaros, “Sakdas: A Python Package for Data Profiling and Data Quality Auditing,” 2020 1st International Conference on Big Data Analytics and Practices (IBDAP), pp. 1–4, Sep. 2020, doi: 10.1109/ibdap50342.2020.9245455.

I. El Alaoui, Y. Gahi, and R. Messoussi, “Big Data Quality Metrics for Sentiment Analysis Approaches,” Proceedings of the 2019 International Conference on Big Data Engineering, Jun. 2019, doi: 10.1145/3341620.3341629.

W. Elouataoui, I. El Alaoui, S. El Mendili, and Y. Gahi, “An Advanced Big Data Quality Framework Based on Weighted Metrics,” Big Data and Cognitive Computing, vol. 6, no. 4, p. 153, Dec. 2022, doi: 10.3390/bdcc6040153.

V. Azzolini et al., “The Data Quality Monitoring Software for the CMS experiment at the LHC: past, present and future,” EPJ Web of Conferences, vol. 214, p. 02003, 2019, doi: 10.1051/epjconf/201921402003.

S. F. Kristyanti, T. F. Kusumasari, and E. N. Alam, “Operational Dashboard Development as A Data Quality Monitoring Tools Using Data Deduplication Profiling Result,” 2020 6th International Conference on Science and Technology (ICST), pp. 1–6, Sep. 2020, doi: 10.1109/icst50505.2020.9732870.

E. Widad, E. Saida, and Y. Gahi, “Quality Anomaly Detection Using Predictive Techniques: An Extensive Big Data Quality Framework for Reliable Data Analysis,” IEEE Access, vol. 11, pp. 103306–103318, 2023, doi: 10.1109/access.2023.3317354.

N. West, J. Gries, C. Brockmeier, J. C. Gobel, and J. Deuse, “Towards integrated Data Analysis Quality: Criteria for the application of Industrial Data Science,” 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI), pp. 131–138, Aug. 2021, doi: 10.1109/iri51335.2021.00024.

A. Kohli and N. Gupta, “Big Data Analytics: An Overview,” 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), pp. 1–5, Sep. 2021, doi: 10.1109/icrito51393.2021.9596417.

Munawar, “Extract Transform Loading (ETL) Based Data Quality for Data Warehouse Development,” 2021 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI), pp. 373–378, Oct. 2021, doi: 10.1109/iccsai53272.2021.9609770.

B. Singhal and A. Aggarwal, “ETL, ELT and Reverse ETL: A business case Study,” 2022 Second International Conference on Advanced Technologies in Intelligent Control, Environment, Computing & Communication Engineering (ICATIECE), pp. 1–4, Dec. 2022, doi: 10.1109/icatiece56365.2022.10046997.

A. P. Pereira, B. P. Cardoso, and R. M. S. Laureano, “Business intelligence: Performance and sustainability measures in an ETL process,” 2018 13th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–7, Jun. 2018, doi: 10.23919/cisti.2018.8399473.

W. Han and M. Jochum, “A Machine Learning Approach for Data Quality Control of Earth Observation Data Management System,” IGARSS 2020 - 2020 IEEE International Geoscience and Remote Sensing Symposium, pp. 3101–3103, Sep. 2020, doi: 10.1109/igarss39084.2020.9323615.

A. Qaiser, M. U. Farooq, S. M. Nabeel Mustafa, and N. Abrar, “Comparative Analysis of ETL Tools in Big Data Analytics,” Pakistan Journal of Engineering and Technology, vol. 6, no. 1, pp. 7–12, Jan. 2023, doi: 10.51846/vol6iss1pp7-12.

R. Ji, H. Hou, G. Sheng, and X. Jiang, “Data Quality Assessment for Electrical Equipment Condition Monitoring,” 2022 9th International Conference on Condition Monitoring and Diagnosis (CMD), pp. 1–4, Nov. 2022, doi: 10.23919/cmd54214.2022.9991385.

M. Al Amin, MD. Jawad-Al-Mursalin Hoque, Z. Nazzum, M. A. Sayed, S. Tanveer Ahmed Rumee, and M. I. Zaber, “Data Quality Assessment of Substation Data in Bangladesh: Insights from Handwritten Data Digitization,” 2023 10th IEEE International Conference on Power Systems (ICPS), pp. 1–6, Dec. 2023, doi: 10.1109/icps60393.2023.10428984.

V. Pattana-Anake, F. J. J. Joseph, and P. Pachaivannan, “Data Wrangling for IoT Based Aquarium Water Quality Management System,” 2022 International Conference on Data Science, Agents & Artificial Intelligence (ICDSAAI), pp. 1–5, Dec. 2022, doi: 10.1109/icdsaai55433.2022.10028891.

X. Zuo, “Research on Data Quality Improvement Program Based on Big Data Application,” 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), pp. 1742–1745, May 2023, doi: 10.1109/iciba56860.2023.10165495.

L. Davidson, "What is data quality and why does it matter?" Springboard, 2019. [Online]. Available: https://www.springboard.com/blog/data-analytics/data-quality/.

P. Zhang, F. Xiong, J. Gao, and J. Wang, “Data quality in big data processing: Issues, solutions and open problems,” 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pp. 1–7, Aug. 2017, doi: 10.1109/uic-atc.2017.8397554.

C. Batini, A. Rula, M. Scannapieco, and G. Viscusi, “From Data Quality to Big Data Quality,” Journal of Database Management, vol. 26, no. 1, pp. 60–82, Jan. 2015, doi: 10.4018/jdm.2015010103.

H. A. Sulistyo, T. F. Kusumasari, and E. N. Alam, “Implementation of Data Cleansing Null Method for Data Quality Management Dashboard using Pentaho Data Integration,” 2020 3rd International Conference on Information and Communications Technology (ICOIACT), pp. 12–16, Nov. 2020, doi: 10.1109/icoiact50329.2020.9332030.

H. Homayouni, “Testing Extract-Transform-Load Process in Data Warehouse Systems,” 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pp. 158–161, Oct. 2018, doi: 10.1109/issrew.2018.000-6.

[24] R. Vaidyambath, J. Debattista, N. Srivatsa, and R. Brennan, "An intelligent linked data quality dashboard," in AICS 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science, Galway, Ireland, 2019, pp. 5-6.

T. Samakit, C. Anutariya, and M. Buranarach, “QUALYST: Data Quality Assessment System for Thailand Open Government Data,” 2023 20th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 196–201, Jun. 2023, doi: 10.1109/jcsse58229.2023.10202060.

Talend, "Data quality Looker Block released for Talend Studio," Talend Community, Nov. 6, 2020. [Online]. Available: https://community.talend.com/s/article/Data-quality-Looker-Block-released-for-Talend-Studio?language=en_US.

Talend, "What is data health? Definition and how to measure," Talend, [Online]. Available: https://www.talend.com/resources/what-is-data-health/.

Ataccama, "Data quality management," Ataccama, [Online]. Available: https://www.ataccama.com/dictionary/data-quality-management.

J. Byabazaire, G. M. P. O’Hare, and D. T. Delaney, “End-to-End Data Quality Assessment Using Trust for Data Shared IoT Deployments,” IEEE Sensors Journal, vol. 22, no. 20, pp. 19995–20009, Oct. 2022, doi: 10.1109/jsen.2022.3203853.

S. McCarthy, A. McCarren, and M. Roantree, “A Method for Automated Transformation and Validation of Online Datasets,” 2019 IEEE 23rd International Enterprise Distributed Object Computing Conference (EDOC), pp. 183–189, Oct. 2019, doi: 10.1109/edoc.2019.00030.

R. Likert, "A technique for the measurement of attitudes," Archives of Psychology, vol. 22, no. 140, pp. 55, 1932.