Quality-aware Multi-source Data Fusion and Enhancement for Medical Concept Normalization using Large Language Models
- Haihua Chen
- Jul 17
- 2 min read
Medical concept normalization (MCN), also commonly termed medical entity normalization (MEN), or biomedical named entity normalization (BNEN), or biomedical entity linking (BEL), involves the process of mapping informal medical terms or phrases from social media, or biomedical documents, or other online platforms to formal medical concepts in a controlled vocabulary, ontology, or standardized medical database.
MCN is a fundamental natural language processing (NLP) task in the medical domain. It is critical for improving the efficiency, accuracy, and effectiveness of various healthcare applications, including electronic health record (EHR) management, clinical decision support (CDS) systems, health information exchange (HIE), precision medicine (PM), drug discovery and development, clinical trial retrieval, and automated patient messaging systems. Ultimately, MCN has a direct impact on both patient care and healthcare administration

While existing studies primarily rely on single-source datasets and overlook data quality issues such as coverage, class imbalance, and comprehensiveness, this porject aims to propose a quality-aware, multi-source data fusion framework that systematically integrates heterogeneous MCN corpora while explicitly assessing their quality. Specifically, we will enhance MCN task performance by evaluating and improving data quality across multiple MCN datasets through data fusion using Large Language Models (LLMs).

This project will collect six MCN datasets built from social media phrases and developed on the same medical ontology (SNOMED-CT), then evaluate them across four data quality dimensions of correctness, class imbalance, comprehensiveness, and coverage. We will apply ChatGPT-based zero-shot and few-shot prompting for data quality improvement with data correction and augmentation. Using the derived quality scores, we propose an adaptive weighting mechanism to reconcile candidate phrase-concept pairs from different data sources.
Our approach will enhance synonym disambiguation, mitigate data imbalance, and reduces MCN dataset development costs. The proposed framework and LLM-based enhancement strategies provide valuable insights for improving data quality in data fusion and deep learning applications.
Related papers:
Chen, H., Li, R., Cleveland, A., & Ding, J. (2025). Enhancing data quality in medical concept normalization through large language models. Journal of Biomedical Informatics, 165, 104812.