Analyzing Human Questioning Behavior and Causal Curiosity through Natural Queries
OPEN ACCESS
Loading...
Author / Producer
Date
2024-10-24
Publication Type
Other Publication
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Abstract
The recent development of Large Language Models (LLMs) has changed our role in interacting with them. Instead of primarily testing these models with questions we already know the answers to, we now use them to explore questions where the answers are unknown to us. This shift, which hasn't been fully addressed in existing datasets, highlights the growing need to understand naturally occurring human questions - that are more complex, open-ended, and reflective of real-world needs. To this end, we present NatQuest, a collection of 13,500 naturally occurring questions from three diverse sources: human-to-search-engine queries, human-to-human interactions, and human-to-LLM conversations. Our comprehensive collection enables a rich understanding of human curiosity across various domains and contexts. Our analysis reveals a significant presence of causal questions (up to 42%) within the dataset, for which we develop an iterative prompt improvement framework to identify all causal queries, and examine their unique linguistic properties, cognitive complexity, and source distribution. We also lay the groundwork to explore LLM performance on these questions and provide six efficient classification models to identify causal questions at scale for future work.
Permanent link
Publication status
published
Editor
Book title
Journal / series
Volume
Pages / Article No.
2405.20318
Publisher
Cornell University
Event
Edition / version
v2
Methods
Software
Geographic location
Date collected
Date created
Subject
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML); FOS: Computer and information sciences
Organisational unit
09684 - Sachan, Mrinmaya / Sachan, Mrinmaya
09664 - Schölkopf, Bernhard / Schölkopf, Bernhard
Notes
Funding
201009 - Representation Learning for Arbitrarily Long Richly Formatted Multimedia Documents (SNF)
Related publications and datasets
Is supplemented by: https://github.com/roberto-ceraolo/natquest