Automatic Review Generation and Quality Improvement for Scientific Papers using Large Language Models
- Haihua Chen
- Jul 17
- 2 min read
The explosive growth of scientific publications, particularly in fields like artificial intelligence and machine learning, has placed unprecedented pressure on traditional peer review systems. Large language models (LLMs) offer promising capabilities for automating aspects of peer review, yet their true effectiveness—especially in tasks requiring critical reasoning and contextual judgment—remains insufficiently understood.

This project proposes to systematically evaluate, validate, and enhance the use of LLMs in peer review generation across disciplinary contexts. We aim to develop and expand a structured, benchmark-driven evaluation framework that goes beyond surface-level textual comparisons to measure deep semantic and conceptual alignment between LLM-generated reviews and human-written critiques. Building on recent work that revealed LLMs’ strengths in summarization but weaknesses in evaluative depth, we will extend this research in two significant directions: (1) domain generalizability and (2) quality-aware LLM augmentation for human reviewers.
The objectives of this projects include (but not limited to):
Benchmark Expansion and Cross-Domain Validation
Adapt the existing evaluation framework to new domains including biomedical science, digital humanities, and social sciences.
Curate multi-domain review datasets with varying review formats and criteria.
Analyze how domain-specific knowledge and review conventions influence LLM performance in summarization, critique, and quality sensitivity.
Development of Review Quality Enhancement Tools
Design and test LLM-based agents that support human reviewers by suggesting constructive feedback, refining clarity, and enhancing conceptual depth.
Construct interpretable knowledge graph (KG)-based indicators (e.g., node diversity, contextual grounding, label entropy) for automatic detection of shallow or biased reviews.
Prototype a diagnostic dashboard that integrates semantic and structural metrics to aid area chairs and editors in assessing review quality.
Evaluation and Impact Assessment
Conduct comparative studies across multiple LLM architectures and prompting strategies to determine the most effective configurations for review augmentation.
Collaborate with academic conferences and journals to pilot test our framework and tools within real-world peer review workflows.
This project addresses a critical need in the scholarly publishing ecosystem: maintaining the quality, fairness, and scalability of peer review in the era of AI-driven research.
Related papers:
Li., R., Zhang, H. #, Gehringer, E., Xiao, T., Ding, J., & Chen, H. 2025. Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers (Long paper). The 25th IEEE International Conference on Data Mining (ICDM). Under review.



