Tokenisation is NP-Complete


Loading...

Date

2024-12-19

Publication Type

Working Paper

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $δ$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).

Publication status

published

Editor

Book title

Journal / series

Volume

Pages / Article No.

2412.1521

Publisher

Cornell University

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); FOS: Computer and information sciences

Organisational unit

09779 - Komm, Dennis / Komm, Dennis check_circle

Notes

Funding

Related publications and datasets