Dynamic Human Evaluation for Relative Model Comparisons
OPEN ACCESS
Loading...
Author / Producer
Date
2022
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Collecting human judgements is currently the most reliable evaluation method for natural language generation systems. Automatic metrics have reported flaws when applied to measure quality aspects of generated text and have been shown to correlate poorly with human judgements. However, human evaluation is time and cost-intensive, and we lack consensus on designing and conducting human evaluation experiments. Thus there is a need for streamlined approaches for efficient collection of human judgements when evaluating natural language generation systems. Therefore, we present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study. The main results indicate that a decision about the superior model can be made with high probability across different labelling strategies, where assigning a single random worker per task requires the least overall labelling effort and thus the least cost.
Permanent link
Publication status
published
External links
Book title
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Journal / series
Volume
Pages / Article No.
5946 - 5955
Publisher
European Language Resources Association
Event
13th Conference on Language Resources and Evaluation (LREC 2022)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Human Evaluation; Crowdsourcing; Natural Language Generation; Relative Model Comparison
Organisational unit
09588 - Zhang, Ce (ehemalig) / Zhang, Ce (former)
Notes
Funding
184628 - EASEML: Toward a More Accessible and Usable Machine Learning Platform for Non-expert Users (SNF)
197485 - Governance and legal framework for managing artificial intelligence (AI) (SNF)
187132 - Machine‐based Scoring of a Neuropsychological Test: The Rey‐Osterrieth Complex Figure (SNF)
957407 - Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning (EC)
197485 - Governance and legal framework for managing artificial intelligence (AI) (SNF)
187132 - Machine‐based Scoring of a Neuropsychological Test: The Rey‐Osterrieth Complex Figure (SNF)
957407 - Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning (EC)