Causal Estimation of Tokenisation Bias
OPEN ACCESS
Loading...
Author / Producer
Date
2025-07
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser—which maps character-strings to subwords—should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as **tokenisation bias**. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., ⟨ hello ⟩) in a tokeniser’s vocabulary on the probability a trained model assigns to the corresponding characters (i.e., “hello”). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first K to a tokeniser’s vocabulary, where K is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models’ outputs across scales, vocabularies, and tokenisers. Notably, a subword’s presence in a small model’s vocabulary may increase its characters’ probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling.
Permanent link
Publication status
published
Book title
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Journal / series
Volume
Pages / Article No.
28325 - 28340
Publisher
Association for Computational Linguistics
Event
63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
09462 - Hofmann, Thomas / Hofmann, Thomas