From Language Models over Tokens to Language Models over Characters


Loading...

Date

2025

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Modern language models are internally—and mathematically—distributions over token strings rather than character strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model's compression rate (bits/byte) is achieved.

Publication status

published

Book title

Proceedings of the 42nd International Conference on Machine Learning

Volume

267

Pages / Article No.

61391 - 61412

Publisher

PMLR

Event

42nd International Conference on Machine Learning (ICML 2025)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Organisational unit

09682 - Cotterell, Ryan / Cotterell, Ryan

Notes

Funding

Related publications and datasets