Performance of the pre-trained large language model GPT-4 on automated short answer grading


Loading...

Author / Producer

Date

2024-07-05

Publication Type

Journal Article

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Automated Short Answer Grading (ASAG) has been an active area of machine-learning research for over a decade. It promises to let educators grade and give feedback on free-form responses in large-enrollment courses in spite of limited availability of human graders. Over the years, carefully trained models have achieved increasingly higher levels of performance. More recently, pre-trained Large Language Models (LLMs) emerged as a commodity, and an intriguing question is how a general-purpose tool without additional training compares to specialized models. We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in addition to the standard task of grading the alignment of the student answer with a reference answer, we also investigated withholding the reference answer. We found that overall, the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.

Publication status

published

Editor

Book title

Volume

4 (1)

Pages / Article No.

47

Publisher

Springer

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Automated short answer grading; Large language model; SciEntsBank; Beetle; GPT

Organisational unit

02219 - ETH AI Center / ETH AI Center

Notes

Funding

Related publications and datasets