Language Models as Science Tutors
METADATA ONLY
Loading...
Author / Producer
Date
2024
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
METADATA ONLY
Data
Rights / License
Abstract
NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations publicly.
Permanent link
Publication status
published
Book title
Proceedings of the 41st International Conference on Machine Learning
Journal / series
Volume
235
Pages / Article No.
8310 - 8335
Publisher
PMLR
Event
41st International Conference on Machine Learning (ICML 2024)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
Notes
Funding
Related publications and datasets
Is variant form of: https://openreview.net/forum?id=WFyolnFZOR