About me

I am an incoming PhD student at Northeastern University, supervised by Prof. Weiyan Shi. I am currently a third-year Master’s student in the Wangxuan Institute of Computer Technology at Peking University supervised by Prof. Xiaojun Wan. Previously, I obtained my Bachelor’s degree in the School of Electronics Engineering and Computer Science at Peking University. I was a visiting student at Yale NLP Lab, supervised by Arman Cohan.

I am interested in the evaluation of NLP and LLMs, and I believe that evaluation has an interdisciplinary nature, including but not limited to human factors, machine learning, and statistics. My work has focused on the evaluation of summarization, generation, and LLMs, exploring various aspects of automatic evaluation, human evaluation, and meta-evaluation.

I believe that evaluation is crucial in current research. Without a more reliable evaluation mechanism, it is difficult to accurately determine whether an innovation is a genuine advancement or merely an illusion, especially in the context of a large amount of incremental research.

Selected Publications

( * indicates equal contribution)

  • LLM-based NLG Evaluation: Current Status and Challenges
    Mingqi Gao*, Xinyu Hu*, Xunjian Yin, Jie Ruan, Xiao Pu, Xiaojun Wan
    Computational Linguistics [pdf]

  • Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
    Mingqi Gao*, Yixin Liu*, Xinyu Hu, Xiaojun Wan, Jonathan Bragg, Arman Cohan
    Findings of NAACL 2025 (To appear) [pdf]

  • Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation
    Mingqi Gao*, Xinyu Hu*, Li Lin, Xiaojun Wan
    NAACL 2025 (To appear) [pdf]

  • Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability
    Xinyu Hu, Li Lin, Mingqi Gao, Xunjian Yin, Xiaojun Wan
    EMNLP 2024 [pdf] [code]

  • Are LLM-based Evaluators Confusing NLG Quality Criteria?
    Xinyu Hu*, Mingqi Gao*, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, Xiaojun Wan
    ACL 2024 [pdf] [code]

  • Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks
    Xiao Pu, Mingqi Gao, Xiaojun Wan
    LREC-COLING 2024 [pdf]

  • Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling
    Jie Ruan, Xiao Pu, Mingqi Gao, Xiaojun Wan, Yuesheng Zhu
    AAAI 2024 [pdf]

  • Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
    Anya Belz, Craig Thomson, Ehud Reiter, and 36 more authors
    Fourth Workshop on Insights from Negative Results in NLP, 2023 [pdf]

  • Reference Matters: Benchmarking Factual Error Correction for Dialogue Summarization with Fine-grained Evaluation Framework
    Mingqi Gao, Xiaojun Wan, Jia Su, Zhefeng Wang, Baoxing Huai
    ACL 2023 [pdf] [code]

  • Evaluating Factuality in Cross-lingual Summarization
    Mingqi Gao*, Wenqing Wang*, Xiaojun Wan, Yuemei Xu
    Findings of ACL 2023 [pdf] [code]

  • DialSummEval: Revisiting Summarization Evaluation for Dialogues
    Mingqi Gao, Xiaojun Wan
    NAACL 2022 [pdf] [code]

Academic Services

Served as a reviewer for:

  • Conferences: AAAI 2023, EMNLP 2023, ACL Rolling Review 2023-2024, ICLR 2025.
  • Workshops: HumEval @ RANLP 2023, LLMAgents @ ICLR 2024.