Clara Isabel Meister


Loading...

Last Name

Meister

First Name

Clara Isabel

Organisational unit

01259 - Lehre Informatik

Search Results

Publications 1 - 10 of 38
  • Pimentel, Tiago; Meister, Clara Isabel; Salesky, Elizabeth; et al. (2021)
    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
    While there exist scores of natural languages, each with its unique features and idiosyncrasies, they all share a unifying theme: enabling human communication. We may thus reasonably predict that human cognition shapes how these languages evolve and are used. Assuming that the capacity to process information is roughly constant across human populations, we expect a surprisal–duration trade-off to arise both across and within languages. We analyse this trade-off using a corpus of 600 languages and, after controlling for several potential confounds, we find strong supporting evidence in both settings. Specifically, we find that, on average, phones are produced faster in languages where they are less surprising, and vice versa. Further, we confirm that more surprising phones are longer, on average, in 319 languages out of the 600. We thus conclude that there is strong evidence of a surprisal–duration trade-off in operation, both across and within the world’s languages.
  • Meister, Clara Isabel; Stokowiec, Wojciech; Pimentel, Tiago; et al. (2023)
    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
    After just a few hundred training updates, a standard probabilistic model for language generation has likely not yet learnt many semantic or syntactic rules of natural language, making it difficult to estimate the probability distribution over next tokens. Yet around this point, these models have identified a simple, loss-minimising behaviour: to output the unigram distribution of the target training corpus. The use of such a heuristic raises the question: Can we initialise our models with this behaviour and save precious compute resources and model capacity? Here we show that we can effectively endow standard neural language generation models with a separate module that reflects unigram frequency statistics as prior knowledge, simply by initialising the bias term in a model's final linear layer with the log-unigram distribution. We use neural machine translation as a test bed for this simple technique and observe that it: (i) improves learning efficiency; (ii) achieves better overall performance; and perhaps most importantly (iii) appears to disentangle strong frequency effects by encouraging the model to specialise in non-frequency-related aspects of language.
  • Best-First Beam Search
    Item type: Journal Article
    Vieira, Tim; Cotterell, Ryan; Meister, Clara Isabel (2020)
    Transactions of the ACL
  • Is Sparse Attention more Interpretable?
    Item type: Conference Paper
    Meister, Clara Isabel; Lazov, Stefan; Augenstein, Isabelle; et al. (2021)
    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing
    Sparse attention has been claimed to increase model interpretability under the assumption that it highlights influential inputs. Yet the attention distribution is typically over representations internal to the model rather than the inputs themselves, suggesting this assumption may not have merit. We build on the recent work exploring the interpretability of attention; we design a set of experiments to help us understand how sparsity affects our ability to use attention as an explainability tool. On three text classification tasks, we verify that only a weak relationship between inputs and co-indexed intermediate representations exists-under sparse attention and otherwise. Further, we do not find any plausible mappings from sparse attention distributions to a sparse set of influential inputs through other avenues. Rather, we observe in this setting that inducing sparsity may make it less plausible that attention can be used as a tool for understanding model behavior.
  • Pimentel, Tiago; Meister, Clara Isabel; Cotterell, Ryan (2022)
    arXiv
    A good automatic evaluation metric for language generation ideally correlates highly with human judgements of text quality. Yet, there is a dearth of such metrics, which inhibits the rapid and efficient progress of language generators. One exception is the recently proposed Mauve. In theory, Mauve measures an information-theoretic divergence between two probability distributions over strings: one representing the language generator under evaluation; the other representing the true natural language distribution. Mauve's authors argue that its success comes from the qualitative properties of their proposed divergence. Yet in practice, as this divergence is uncomputable, Mauve approximates it by measuring the divergence between multinomial distributions over clusters instead, where cluster assignments are attained by grouping strings based on a pre-trained language model's embeddings. As we show, however, this is not a tight approximation -- in either theory or practice. This begs the question: why does Mauve work so well? In this work, we show that Mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance. In fact, classical divergences paired with its proposed cluster-based approximation may actually serve as better evaluation metrics. We finish the paper with a probing analysis; this analysis leads us to conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators.
  • Meister, Clara Isabel (2024)
    Natural language generation (NLG) systems overwhelmingly rely on probabilistic models to approximate task-specific distributions over natural language strings. The majority of these models are auto-regressive and locally normalized, producing probability distributions over the next token given prior context. At inference time, the user must decide how to use such a distribution to generate natural language strings. Beam search is a widely used approximation algorithm for finding the highest probability string according to such distributions. It has been the go-to tool for decoding probabilistic models in numerous generation tasks, e.g., machine translation, abstractive summarization and constrained decoding. Yet at times, it exhibits notable variability in output quality, computational inefficiency, and lack of diversity. This thesis first aims to better understand beam search's success. We identify an inductive bias inherent in beam search, leading us to propose that its success is due to its implicit enforcement of uniform information density---a property linked to psycholinguistic theories---in generated text. We then address three limitations of standard beam search: its inefficiency, its tendency to produce sets with low diversity, and its deterministic nature. To address the first limitation, we introduce a more efficient variant of beam search, which frames the algorithm as an agenda-based process and employs best-first prioritization; this approach reduces computational cost by eliminating unnecessary path exploration. We next show how each generation step in beam search can be formulated as a subdeterminant maximization problem, and how this framing allows us to optimize for set-level characteristics (such as diversity) in a principled fashion. We further develop a stochastic generalization of beam search, which facilitates the generation of diverse samples and enables the construction of statistically consistent estimators for expectations under the model. We provide empirical evidence for the effectiveness of these new techniques in improving the efficiency, diversity, and adaptability of beam search as a decoding algorithm for NLG tasks. In the last part of this thesis, we use our insights about the properties of effective decoding strategies to propose a new decoding algorithm---one that is designed to produce text that mimics information content patterns in human communication. We observe that this algorithm leads to high-quality text, consistently reducing the degenerate repetitions that probabilistic language generators are known to occasionally produce under other decoding strategies. The methods proposed herein offer valuable tools for researchers and practitioners in their effort to create better probabilistic language generators.
  • Causal Estimation of Tokenisation Bias
    Item type: Conference Paper
    Lesci, Pietro; Meister, Clara Isabel; Hofmann, Thomas; et al. (2025)
    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
    Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser—which maps character-strings to subwords—should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as **tokenisation bias**. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., ⟨ hello ⟩) in a tokeniser’s vocabulary on the probability a trained model assigns to the corresponding characters (i.e., “hello”). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first K to a tokeniser’s vocabulary, where K is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models’ outputs across scales, vocabularies, and tokenisers. Notably, a subword’s presence in a small model’s vocabulary may increase its characters’ probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling.
  • On Homophony and Rényi Entropy
    Item type: Conference Paper
    Pimentel, Tiago; Meister, Clara Isabel; Teufel, Simone; et al. (2021)
    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
    Homophony’s widespread presence in natural languages is a controversial topic. Recent theories of language optimality have tried to justify its prevalence, despite its negative effects on cognitive processing time, e.g., Piantadosi et al. (2012) argued homophony enables the reuse of efficient wordforms and is thus beneficial for languages. This hypothesis has recently been challenged by Trott and Bergen (2020), who posit that good wordforms are more often homophonous simply because they are more phonotactically probable. In this paper, we join in on the debate. We first propose a new information-theoretic quantification of a language’s homophony: the sample Rényi entropy. Then, we use this quantification to revisit Trott and Bergen’s claims. While their point is theoretically sound, a specific methodological issue in their experiments raises doubts about their results. After addressing this issue, we find no clear pressure either towards or against homophony—a much more nuanced result than either Piantadosi et al.’s or Trott and Bergen’s findings.
  • Wilcox, Ethan Gotlieb; Pimentel, Tiago; Meister, Clara Isabel; et al. (2024)
    Cognition
    Regressions, or backward saccades, are common during reading, accounting for between 5% and 20% of all saccades. And yet, relatively little is known about what causes them. We provide an information-theoretic operationalization for two previous qualitative hypotheses about regressions, which we dub reactivation and reanalysis. We argue that these hypotheses make different predictions about the pointwise mutual information or PMI between a regression's source and target. Intuitively, the PMI between two words measures how much more (or less) likely one word is to be present given the other. On one hand, the reactivation hypothesis predicts that regressions occur between words that are associated, implying high positive values of PMI. On the other hand, the reanalysis hypothesis predicts that regressions should occur between words that are not associated with each other, implying negative, low values of PMI. As a second theoretical contribution, we expand on previous theories by considering not only PMI but also expected values of PMI, E[PMI], where the expectation is taken over all possible realizations of the regression's target. The rationale for this is that language processing involves making inferences under uncertainty, and readers may be uncertain about what they have read, especially if a previous word was skipped. To test both theories, we use contemporary language models to estimate PMI-based statistics over word pairs in three corpora of eye tracking data in English, as well as in six languages across three language families (Indo-European, Uralic, and Turkic). Our results are consistent across languages and models tested: Positive values of PMI and E[PMI] consistently help to predict the patterns of regressions during reading, whereas negative values of PMI and E[PMI] do not. Our information-theoretic interpretation increases the predictive scope of both theories and our studies present the first systematic crosslinguistic analysis of regressions in the literature. Our results support the reactivation hypothesis and, more broadly, they expand the number of language processing behaviors that can be linked to information-theoretic principles.
  • Tokenization and the Noiseless Channel
    Item type: Conference Paper
    Zouhar, Vilém; Meister, Clara Isabel; Gastaldi, Juan Luis; et al. (2023)
    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers
    Subword tokenization is a key part of most NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to improved downstream model performance over others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum entropy of the subword distribution. Nevertheless, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency subwords and very short codes to high-frequency subwords.Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency subwords.We posit that (1) extremely high-frequency subwords are problematic because their meaning is not distinct and (2) that low-frequency subwords may not appear frequently enough for their meaning to be learned properly; encodings that induce unigram distributions with either can harm model performance. In machine translation, we find that across multiple tokenizers, the Rényi entropy has a very strong correlation with BLEU: 0.82 in comparison to just -0.30 for compressed length.
Publications 1 - 10 of 38