top of page

lab logo_edited.png

Data Quality Evaluation and Improvement for Machine Learning

Zhou Yuhan
Oct 5, 2024
2 min read

Updated: Jul 17

Figure 1. Quality requirements mapping to applications

Machine learning (ML) has drawn great attention from academics as well as industries during the past decades and continues to achieve impressive human-level performance on nontrivial tasks such as image classification, voice recognition, natural language processing, and auto piloting. Both data and algorithms are critical to ensure the performance, fairness, robustness, reliability, and scalability of ML systems. However, artificial intelligence (AI) researchers and practitioners overwhelmingly concentrate on algorithms while undervaluing the impact of data quality. Due to the limitations of algorithmic solutions in AI success, scholars have proposed data-centric AI, with the initiative to carefully design the datasets, evaluate and improve the data quality for enhancing ML systems.

Figure 2. Workflow of DQ evaluation and improvement

This project focuses on data quality in ML, particular on how to use state-of-the-art technology on assessment, assurance, and improvement of big data for building high-quality ML systems.

Related papers

Zhou, Y., Tu, F., Sha, K., Ding, J., & Chen, H. (2024, July). A survey on data quality dimensions and tools for machine learning invited paper. In 2024 IEEE International Conference on Artificial Intelligence Testing (AITest) (pp. 120-131). IEEE.

Nguyen, H., Chen, H., Chen, J., Kargozari, K., & Ding, J. (2023). Construction and evaluation of a domain-specific knowledge graph for knowledge discovery. Information Discovery and Delivery.

Tran, N., Chen, H., Bhuyan, J., & Ding, J. (2022). Data Curation and Quality Evaluation for Machine Learning-Based Cyber Intrusion Detection. IEEE Access, 10, 121900-121923.

Chen, H., Chen, J., & Ding, J. (2021). Data evaluation and enhancement for quality improvement of machine learning. IEEE Transactions on Reliability, 70(2), 831-847.

Tran, N., Chen, H., Jiang, J., Bhuyan, J., & Ding, J. (2021). Effect of Class Imbalance on the Performance of Machine Learning-based Network Intrusion Detection. International Journal of Performability Engineering, 17(9).

Tang, M., Su, C., Chen, H., Qu, J., & Ding, J. (2020, December). SALKG: a semantic annotation system for building a high-quality legal knowledge graph. In 2020 IEEE International Conference on Big Data (Big Data) (pp. 2153-2159). IEEE.

Chen, H., Cao, G., Chen, J., & Ding, J. (2019). A practical framework for evaluating the quality of knowledge graph. In Knowledge Graph and Semantic Computing: Knowledge Computing and Language Understanding: 4th China Conference, CCKS 2019, Hangzhou, China, August 24–27, 2019, Revised Selected Papers 4 (pp. 111-122). Springer Singapore.

Recent Posts

Utilizing AI/ML to Enhance Public Engagement with Large-scale and Multi-modal GLAM Collections

Utilizing AI/ML to Enhance Public Engagement with Large-scale and Multi-modal GLAM Collections

Automatic Review Generation and Quality Improvement for Scientific Papers using Large Language Models

Automatic Review Generation and Quality Improvement for Scientific Papers using Large Language Models

Quality-aware Multi-source Data Fusion and Enhancement for Medical Concept Normalization using Large Language Models

Quality-aware Multi-source Data Fusion and Enhancement for Medical Concept Normalization using Large Language Models

bottom of page