The Foundations of Tokenization: Statistical and Computational Concerns
OPEN ACCESS
Loading...
Author / Producer
Date
2024-11-04
Publication Type
Working Paper
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Tokenization - the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary - is a critical step in the NLP pipeline. The use of token representations is widely credited with increased model performance but is also the source of many undesirable behaviors, such as spurious ambiguity or inconsistency. Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood. In particular, the impact of tokenization on statistical estimation has been investigated mostly through empirical means. The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models. Based on the category of stochastic maps, this framework enables us to establish general conditions for a principled use of tokenizers, and most importantly, the necessary and sufficient conditions for a tokenizer model to preserve the consistency of statistical estimators. Additionally, we discuss statistical and computational concerns crucial for designing and implementing tokenizer models, such as inconsistency, ambiguity, tractability, and boundedness. The framework and results advanced in this paper contribute to building robust theoretical foundations for representations in neural language modeling that can inform future empirical research.
Permanent link
Publication status
published
External links
Editor
Book title
Journal / series
Volume
Pages / Article No.
Publisher
Cornell University
Event
Edition / version
v3
Methods
Software
Geographic location
Date collected
Date created
Subject
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); FOS: Computer and information sciences
Organisational unit
09682 - Cotterell, Ryan / Cotterell, Ryan