Tokenisation is NP-Complete
OPEN ACCESS
Loading...
Author / Producer
Date
2024-12-19
Publication Type
Working Paper
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $δ$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).
Permanent link
Publication status
published
Editor
Book title
Journal / series
Volume
Pages / Article No.
2412.1521
Publisher
Cornell University
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); FOS: Computer and information sciences
Organisational unit
09779 - Komm, Dennis / Komm, Dennis