How to Select Datapoints for Efficient Human Evaluation of NLG Models?
OPEN ACCESS
Loading...
Author / Producer
Date
2025
Publication Type
Journal Article
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Human evaluation is the gold standard for evaluating text generation models. However, it is expensive. In order to fit budgetary constraints, a random subset of the test data is often chosen in practice for human evaluation. However, randomly selected data may not accurately represent test performance, making this approach economically inefficient for model comparison. Thus, in this work, we develop and analyze a suite of selectors to get the most informative datapoints for human evaluation, taking the evaluation costs into account. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. We further develop an approach to distill these selectors to the scenario where the model outputs are not yet available. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts. We demonstrate the efficacy of our selectors in two common NLG tasks, machine translation and summarization, and show that only ∼70% of the test data is needed to produce the same evaluation result as the entire data.
Permanent link
Publication status
published
External links
Editor
Book title
Volume
13
Pages / Article No.
1789 - 1811
Publisher
MIT Press
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
09684 - Sachan, Mrinmaya / Sachan, Mrinmaya
Notes
Funding
201009 - Representation Learning for Arbitrarily Long Richly Formatted Multimedia Documents (SNF)