Data Quality Evaluation and Improvement for Machine Learning
- Zhou Yuhan
- Oct 5, 2024
- 2 min read
Updated: Jul 17

Machine learning (ML) has drawn great attention from academics as well as industries during the past decades and continues to achieve impressive human-level performance on nontrivial tasks such as image classification, voice recognition, natural language processing, and auto piloting. Both data and algorithms are critical to ensure the performance, fairness, robustness, reliability, and scalability of ML systems. However, artificial intelligence (AI) researchers and practitioners overwhelmingly concentrate on algorithms while undervaluing the impact of data quality. Due to the limitations of algorithmic solutions in AI success, scholars have proposed data-centric AI, with the initiative to carefully design the datasets, evaluate and improve the data quality for enhancing ML systems.

This project focuses on data quality in ML, particular on how to use state-of-the-art technology on assessment, assurance, and improvement of big data for building high-quality ML systems.
Related papers
Zhou, Y., Tu, F., Sha, K., Ding, J., & Chen, H. (2024, July). A survey on data quality dimensions and tools for machine learning invited paper. In 2024 IEEE International Conference on Artificial Intelligence Testing (AITest) (pp. 120-131). IEEE.
Nguyen, H., Chen, H., Chen, J., Kargozari, K., & Ding, J. (2023). Construction and evaluation of a domain-specific knowledge graph for knowledge discovery. Information Discovery and Delivery.
Tran, N., Chen, H., Bhuyan, J., & Ding, J. (2022). Data Curation and Quality Evaluation for Machine Learning-Based Cyber Intrusion Detection. IEEE Access, 10, 121900-121923.
Chen, H., Chen, J., & Ding, J. (2021). Data evaluation and enhancement for quality improvement of machine learning. IEEE Transactions on Reliability, 70(2), 831-847.
Tran, N., Chen, H., Jiang, J., Bhuyan, J., & Ding, J. (2021). Effect of Class Imbalance on the Performance of Machine Learning-based Network Intrusion Detection. International Journal of Performability Engineering, 17(9).
Tang, M., Su, C., Chen, H., Qu, J., & Ding, J. (2020, December). SALKG: a semantic annotation system for building a high-quality legal knowledge graph. In 2020 IEEE International Conference on Big Data (Big Data) (pp. 2153-2159). IEEE.
Chen, H., Cao, G., Chen, J., & Ding, J. (2019). A practical framework for evaluating the quality of knowledge graph. In Knowledge Graph and Semantic Computing: Knowledge Computing and Language Understanding: 4th China Conference, CCKS 2019, Hangzhou, China, August 24–27, 2019, Revised Selected Papers 4 (pp. 111-122). Springer Singapore.