Juan Luis Gastaldi


Loading...

Last Name

Gastaldi

First Name

Juan Luis

Organisational unit

09682 - Cotterell, Ryan / Cotterell, Ryan

Search Results

Publications 1 - 10 of 21
  • Boole’s Untruth Tables
    Item type: Book Chapter
    Gastaldi, Juan Luis (2022)
    Studies in Universal Logic ~ Logic in Question: Talks from the Annual Sorbonne Logic Workshop (2011–2019)
    This paper looks into what can reasonably be regarded as truth-table devices in one of Boole’s late manuscripts, as a way of ad- dressing Boole’s relation to modern propositional logic. A careful in- vestigation of the divergences between those table devices and our cur- rent conception of truth tables offers an opportunity to reassess the singularity of Boole’s logical system, especially concerning the relation between its linguistic and mathematical aspects. The paper explores Boole’s conception of the compositional structure of symbolic expres- sions, the genesis of table devices from his method of development into normal forms, and the non-logical origin of the constants 0 and 1 as dual terms. Boole’s system of logic is in this way shown to be chiefly concerned with the problem of the formal interpretability conditions of symbolic expressions, rather than with the truth conditions of logical propositions.
  • Gastaldi, Juan Luis (2019)
    L'épistémologie historique: Histoire et méthodes
  • Gastaldi, Juan Luis (2024)
    Minds and Machines
    These pages call attention to the images that Western culture has projected onto computing machinery to address the latter's capabilities and shortcomings as well as its promises and dangers. It focuses on a fundamental transformation between the mid-1930s and the mid-1950s, where the sociocultural model prevailing in the 19th century was abandoned in favor of a mental solipsistic one. Against an evolutionary understanding, the paper proposes to see this transformation as a shift in perspective, where new aspects of computing could not be revealed without concealing others. Exposing the incapacity of an individualistic understanding of computers to address current global challenges, it advocates for bringing the culture back to our understanding of what it is that machines do when they compute. These pages introduce the journal's Special Issue of the same name, which follows the 6th biennial HaPoC Conference, held in 2021 in Zurich.
  • Gastaldi, Juan Luis; Pellissier, Luc (2021)
    Interdisciplinary Science Reviews
    The recent success of deep neural network techniques in natural language processing rely heavily on the so-called distributional hypothesis. We suggest that the latter can be understood as a simplified version of the classic structuralist hypothesis, at the core of a programme aiming at reconstructing grammatical structures from first principles and corpus analysis. Then, we propose to reinterpret the structuralist programme with insights from proof theory, especially associating paradigmatic relations and units with formal types defined through an appropriate notion of interaction. In this way, we intend to build original conceptual bridges between computational logic and classic structuralism, which can contribute to understanding the recent advances in NLP.
  • Gastaldi, Juan Luis; Terilla, John; Malagutti, Luca; et al. (2024)
    arXiv
    Tokenization - the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary - is a critical step in the NLP pipeline. The use of token representations is widely credited with increased model performance but is also the source of many undesirable behaviors, such as spurious ambiguity or inconsistency. Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood. In particular, the impact of tokenization on statistical estimation has been investigated mostly through empirical means. The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models. Based on the category of stochastic maps, this framework enables us to establish general conditions for a principled use of tokenizers, and most importantly, the necessary and sufficient conditions for a tokenizer model to preserve the consistency of statistical estimators. Additionally, we discuss statistical and computational concerns crucial for designing and implementing tokenizer models, such as inconsistency, ambiguity, tractability, and boundedness. The framework and results advanced in this paper contribute to building robust theoretical foundations for representations in neural language modeling that can inform future empirical research.
  • Gastaldi, Juan Luis (2024)
    Handbook of the History and Philosophy of Mathematical Practice
    The philosophy and history of mathematical practices have brought the study of mathematical language and signs to the forefront of contemporary mathematical thought. However, despite the fruitfulness of this research trend, a comprehensive and unified account of its various aspects and the diverse approaches taken to explore it remains elusive. Recognizing this gap, we have undertaken the task of editing the present section of the Handbook of the History and Philosophy of Mathematical Practice as a much-needed remedy. Before providing an overview of the various contributions to the section, this introduction provides some context for the subject matter and a few conceptual clarifications..
  • Tokenization and the Noiseless Channel
    Item type: Conference Paper
    Zouhar, Vilém; Meister, Clara Isabel; Gastaldi, Juan Luis; et al. (2023)
    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers
    Subword tokenization is a key part of most NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to improved downstream model performance over others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum entropy of the subword distribution. Nevertheless, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency subwords and very short codes to high-frequency subwords.Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency subwords.We posit that (1) extremely high-frequency subwords are problematic because their meaning is not distinct and (2) that low-frequency subwords may not appear frequently enough for their meaning to be learned properly; encodings that induce unigram distributions with either can harm model performance. In machine translation, we find that across multiple tokenizers, the Rényi entropy has a very strong correlation with BLEU: 0.82 in comparison to just -0.30 for compressed length.
  • Gastaldi, Juan Luis; Moot, Richard; Rétoré, Christian (2024)
    Les concepts fondateurs de la philosophie du langage ~ Le contexte en question
  • Giulianelli, Mario; Malagutti, Luca; Gastaldi, Juan Luis; et al. (2024)
    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
    Language models are widely used in computational psycholinguistics to test theories that relate the negative log probability (the surprisal) of a region of interest (a substring of characters) under a language model to its cognitive cost experienced by readers, as operationalized, for example, by gaze duration on the region. However, the application of modern language models to psycholinguistic studies is complicated by the practice of using tokenization as an intermediate step in training a model. Doing so results in a language model over *token* strings rather than one over character strings. Vexingly, regions of interest are generally misaligned with these token strings. The paper argues that token-level language models should be (approximately) marginalized into character-level language models before they are used in psycholinguistic studies to compute the surprisal of a region of interest; then, the marginalized character-level language model can be used to compute the surprisal of an arbitrary character substring, which we term a focal area, that the experimenter may wish to use as a predictor. Our proposal of marginalizing a token-level model into a character-level one solves this misalignment issue independently of the tokenization scheme. Empirically, we discover various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.
Publications 1 - 10 of 21