top of page

Data Quality Evaluation and Improvement for Machine Learning

Updated: Jul 17

Figure 1. Quality requirements mapping to applications
Figure 1. Quality requirements mapping to applications

Machine learning (ML) has drawn great attention from academics as well as industries during the past decades and continues to achieve impressive human-level performance on nontrivial tasks such as image classification, voice recognition, natural language processing, and auto piloting. Both data and algorithms are critical to ensure the performance, fairness, robustness, reliability, and scalability of ML systems. However, artificial intelligence (AI) researchers and practitioners overwhelmingly concentrate on algorithms while undervaluing the impact of data quality. Due to the limitations of algorithmic solutions in AI success, scholars have proposed data-centric AI, with the initiative to carefully design the datasets, evaluate and improve the data quality for enhancing ML systems.

Figure 2. Workflow of DQ evaluation and improvement
Figure 2. Workflow of DQ evaluation and improvement

This project focuses on data quality in ML, particular on how to use state-of-the-art technology on assessment, assurance, and improvement of big data for building high-quality ML systems.


Related papers


Zhou, Y., Tu, F., Sha, K., Ding, J., & Chen, H. (2024, July). A survey on data quality dimensions and tools for machine learning invited paper. In 2024 IEEE International Conference on Artificial Intelligence Testing (AITest) (pp. 120-131). IEEE.


Nguyen, H., Chen, H., Chen, J., Kargozari, K., & Ding, J. (2023). Construction and evaluation of a domain-specific knowledge graph for knowledge discovery. Information Discovery and Delivery.


Tran, N., Chen, H., Bhuyan, J., & Ding, J. (2022). Data Curation and Quality Evaluation for Machine Learning-Based Cyber Intrusion Detection. IEEE Access, 10, 121900-121923.


Chen, H., Chen, J., & Ding, J. (2021). Data evaluation and enhancement for quality improvement of machine learning. IEEE Transactions on Reliability, 70(2), 831-847.


Tran, N., Chen, H., Jiang, J., Bhuyan, J., & Ding, J. (2021). Effect of Class Imbalance on the Performance of Machine Learning-based Network Intrusion Detection. International Journal of Performability Engineering, 17(9).


Tang, M., Su, C., Chen, H., Qu, J., & Ding, J. (2020, December). SALKG: a semantic annotation system for building a high-quality legal knowledge graph. In 2020 IEEE International Conference on Big Data (Big Data) (pp. 2153-2159). IEEE.


Chen, H., Cao, G., Chen, J., & Ding, J. (2019). A practical framework for evaluating the quality of knowledge graph. In Knowledge Graph and Semantic Computing: Knowledge Computing and Language Understanding: 4th China Conference, CCKS 2019, Hangzhou, China, August 24–27, 2019, Revised Selected Papers 4 (pp. 111-122). Springer Singapore.

 
 

Contact

Department of Data Science

3940 N. Elm St.

Denton, TX 76207

© 2024 University of North Texas

unt logo.png
bottom of page