top of page

Quality-aware Multi-source Data Fusion and Enhancement for Medical Concept Normalization using Large Language Models

Medical concept normalization (MCN), also commonly termed medical entity normalization (MEN), or biomedical named entity normalization (BNEN), or biomedical entity linking (BEL), involves the process of mapping informal medical terms or phrases from social media, or biomedical documents, or other online platforms to formal medical concepts in a controlled vocabulary, ontology, or standardized medical database.


MCN is a fundamental natural language processing (NLP) task in the medical domain. It is critical for improving the efficiency, accuracy, and effectiveness of various healthcare applications, including electronic health record (EHR) management, clinical decision support (CDS) systems, health information exchange (HIE), precision medicine (PM), drug discovery and development, clinical trial retrieval, and automated patient messaging systems. Ultimately, MCN has a direct impact on both patient care and healthcare administration

Figure 1. The workflow and main components of Medical Concept Normalization.
Figure 1. The workflow and main components of Medical Concept Normalization.

While existing studies primarily rely on single-source datasets and overlook data quality issues such as coverage, class imbalance, and comprehensiveness, this porject aims to propose a quality-aware, multi-source data fusion framework that systematically integrates heterogeneous MCN corpora while explicitly assessing their quality. Specifically, we will enhance MCN task performance by evaluating and improving data quality across multiple MCN datasets through data fusion using Large Language Models (LLMs).

Figure 2. Quality-aware multi-source data fusion and enhancement framework
Figure 2. Quality-aware multi-source data fusion and enhancement framework

This project will collect six MCN datasets built from social media phrases and developed on the same medical ontology (SNOMED-CT), then evaluate them across four data quality dimensions of correctness, class imbalance, comprehensiveness, and coverage. We will apply ChatGPT-based zero-shot and few-shot prompting for data quality improvement with data correction and augmentation. Using the derived quality scores, we propose an adaptive weighting mechanism to reconcile candidate phrase-concept pairs from different data sources.


Our approach will enhance synonym disambiguation, mitigate data imbalance, and reduces MCN dataset development costs. The proposed framework and LLM-based enhancement strategies provide valuable insights for improving data quality in data fusion and deep learning applications.


Related papers:


Chen, H., Li, R., Cleveland, A., & Ding, J. (2025). Enhancing data quality in medical concept normalization through large language models. Journal of Biomedical Informatics, 165, 104812.

Contact

Department of Data Science

3940 N. Elm St.

Denton, TX 76207

© 2024 University of North Texas

unt logo.png
bottom of page