Mario Giulianelli
Loading...
12 results
Filters
Reset filtersSearch Results
Publications1 - 10 of 12
- On the Proper Treatment of Tokenization in PsycholinguisticsItem type: Conference Paper
Proceedings of the 2024 Conference on Empirical Methods in Natural Language ProcessingGiulianelli, Mario; Malagutti, Luca; Gastaldi, Juan Luis; et al. (2024)Language models are widely used in computational psycholinguistics to test theories that relate the negative log probability (the surprisal) of a region of interest (a substring of characters) under a language model to its cognitive cost experienced by readers, as operationalized, for example, by gaze duration on the region. However, the application of modern language models to psycholinguistic studies is complicated by the practice of using tokenization as an intermediate step in training a model. Doing so results in a language model over *token* strings rather than one over character strings. Vexingly, regions of interest are generally misaligned with these token strings. The paper argues that token-level language models should be (approximately) marginalized into character-level language models before they are used in psycholinguistic studies to compute the surprisal of a region of interest; then, the marginalized character-level language model can be used to compute the surprisal of an arbitrary character substring, which we term a focal area, that the experimenter may wish to use as a predictor. Our proposal of marginalizing a token-level model into a character-level one solves this misalignment issue independently of the tokenization scheme. Empirically, we discover various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself. - Generalized Measures of Anticipation and Responsivity in Online Language ProcessingItem type: Conference Paper
Findings of the Association for Computational Linguistics: EMNLP 2024Giulianelli, Mario; Opedal, Andreas; Cotterell, Ryan (2024)We introduce a generalization of classic information-theoretic measures of predictive uncertainty in online language processing, based on the simulation of expected continuations of incremental linguistic contexts. Our framework provides a formal definition of anticipatory and responsive measures, and it equips experimenters with the tools to define new, more expressive measures beyond standard next-symbol entropy and surprisal. While extracting these standard quantities from language models is convenient, we demonstrate that using Monte Carlo simulation to estimate alternative responsive and anticipatory measures pays off empirically: New special cases of our generalized formula exhibit enhanced predictive power compared to surprisal for human cloze completion probability as well as ELAN, LAN, and N400 amplitudes, and greater complementarity with surprisal in predicting reading times. - A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading BehaviorItem type: Conference Paper
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Re, Francesco Ignazio; Opedal, Andreas; Manaiev, Glib; et al. (2025)Reading is a process that unfolds across space and time, alternating between fixations where a reader focuses on a specific point in space, and saccades where a reader rapidly shifts their focus to a new point. An ansatz of psycholinguistics is that modeling a reader's fixations and saccades yields insight into their online sentence processing. However, standard approaches to such modeling rely on aggregated eye-tracking measurements and models that impose strong assumptions, ignoring much of the spatio-temporal dynamics that occur during reading. In this paper, we propose a more general probabilistic model of reading behavior, based on a marked spatio-temporal point process, that captures not only how long fixations last, but also where they land in space and when they take place in time. The saccades are modeled using a Hawkes process, which captures how each fixation excites the probability of a new fixation occurring near it in time and space. The duration time of fixation events is modeled as a function of fixation-specific predictors convolved across time, thus capturing spillover effects. Empirically, our Hawkes process model exhibits a better fit to human saccades than baselines. With respect to fixation durations, we observe that incorporating contextual surprisal as a predictor results in only a marginal improvement in the model's predictive accuracy. This finding suggests that surprisal theory struggles to explain fine-grained eye movements. - A taxonomy and review of generalization research in NLPItem type: Journal Article
Nature Machine IntelligenceHupkes, Dieuwke; Giulianelli, Mario; Dankers, Verna; et al. (2023)The ability to generalize well is one of the primary desiderata for models of natural language processing (NLP), but what ‘good generalization’ entails and how it should be evaluated is not well understood. In this Analysis we present a taxonomy for characterizing and understanding generalization research in NLP. The proposed taxonomy is based on an extensive literature review and contains five axes along which generalization studies can differ: their main motivation, the type of generalization they aim to solve, the type of data shift they consider, the source by which this data shift originated, and the locus of the shift within the NLP modelling pipeline. We use our taxonomy to classify over 700 experiments, and we use the results to present an in-depth analysis that maps out the current state of generalization research in NLP and make recommendations for which areas deserve attention in the future. - Information Locality as an Inductive Bias for Neural Language ModelsItem type: Conference Paper
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Someya, Taiga; Svete, Anej; DuSell, Brian; et al. (2025)Inductive biases are inherent in every machine learning system, shaping how models generalize from finite data. In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases. Within our framework, we introduce m-local entropy—an information-theoretic measure derived from average lossy-context surprisal—that captures the local uncertainty of a language by quantifying how effectively the preceding symbols disambiguate the next symbol. In experiments on both perturbed natural language corpora and languages defined by probabilistic finite-state automata (PFSA), we show that languages with higher m-local entropy are more difficult for Transformer and LSTM LMs to learn. These results suggest that neural LMs, much like humans, are highly sensitive to the local statistical structure of a language. - From Language Models over Tokens to Language Models over CharactersItem type: Conference Paper
Proceedings of Machine Learning Research ~ Proceedings of the 42nd International Conference on Machine LearningVieira, Tim; LeBrun, Benjamin; Giulianelli, Mario; et al. (2025)Modern language models are internally—and mathematically—distributions over token strings rather than character strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model's compression rate (bits/byte) is achieved. - Towards a Similarity-adjusted Surprisal TheoryItem type: Conference Paper
Proceedings of the 2024 Conference on Empirical Methods in Natural Language ProcessingMeister, Clara Isabel; Giulianelli, Mario; Pimentel, Tiago (2024)Surprisal theory posits that the cognitive effort required to comprehend a word is determined by its contextual predictability, quantified assurprisal. Traditionally, surprisal theory treats words as distinct entities, overlooking any potential similarity between them. Giulianelli et al. (2023) address this limitation by introducing information value, a measure of predictability designed to account for similarities between communicative units. Our work leverages Ricotta and Szeidl’s (2006) diversity index to extend surprisal into a metric that we term similarity-adjusted surprisal, exposing a mathematical relationship between surprisal and information value. Similarity-adjusted surprisal aligns with information value when considering graded similarities and reduces to standard surprisal when words are treated as distinct. Experimental results with reading time data indicate that similarity-adjusted surprisal adds predictive power beyond standard surprisal for certain datasets, suggesting it serves as a complementary measure of comprehension effort. - Beyond Perplexity: Examining Temporal Generalization in Large Language Models via Definition GenerationItem type: Journal Article
Computational Linguistics in the Netherlands JournalLuden, Iris; Giulianelli, Mario; Fernández, Raquel (2024)The advent of large language models (LLMs) has significantly improved performance across various Natural Language Processing tasks. However, the performance of LLMs has been shown to deteriorate over time, indicating a lack of temporal generalization. To date, performance deterioration of LLMs is primarily attributed to the factual changes in the real world over time. However, not only the facts of the world, but also the language we use to describe it constantly changes. Recent studies have indicated a relationship between performance deterioration and semantic change. This is typically measured using perplexity scores and relative performance on downstream tasks. Yet, perplexity and accuracy do not explain the effects of temporally shifted data on LLMs in practice. In this work, we propose to assess lexico-semantic temporal generalization of a language model by exploiting the task of contextualized word definition generation. This in-depth semantic assessment enables interpretable insights into the possible mistakes a model may perpetrate due to meaning shift, and can be used to complement more coarse-grained measures like perplexity scores. To assess how semantic change impacts performance, we design the task by differentiating between semantically stable, changing, and emerging target words, and experiment with T5-base, fine-tuned for contextualized definition generation. Our results indicate that (i) the model’s performance deteriorates for the task of contextualized word definition generation, (ii) the performance deteriorates more for semantically changing words compared to semantically stable words, (iii) the model exhibits significantly lower performance and potential bias for emerging words, and (iv) the performance does not correlate with cross-entropy or (pseudo)-perplexity scores.1 Overall, our results show that definition generation can be a promising task to assess a model’s capacity for temporal generalization with respect to semantic change. - Incremental Alternative Sampling as a Lens into the Temporal and Representational Resolution of Linguistic PredictionItem type: Working Paper
PsyArXivGiulianelli, Mario; Wallbridge, Sarenne; Cotterell, Ryan; et al. (2024)This study presents a new model of processing difficulty rooted in resource allocation theory, Incremental Alternative Sampling (IAS). Differential difficulty for a linguistic unit is estimated with respect to a set of plausible alternatives. Compared to a surprisal-based model, it prescribes a more efficient use of a comprehender's predicted continuations of partial linguistic stimuli thanks to (i) an expressive representation function that captures different levels of linguistic processing and (ii) the bootstrapping of long-horizon prediction error. Our results show that IAS estimates of processing difficulty, computed with autoregressive language models via Monte Carlo estimation, have greater predictive power than surprisal extracted from the same language models for most neural and behavioural responses under analysis—including reading times, event-related brain potentials, cloze and predictability judgements. Perhaps more importantly, IAS estimates provide insight into the nature of the predictive mechanisms that generate those responses during language comprehension. Variability in neural and behavioural responses is well explained by different combinations of the representational and temporal resolution of prediction. Processing difficulty calculated at varying representational domains reflects known relations to lexical, constructional, and structural levels of linguistic processing, and forecast horizons are determined by a combination of experimental task setup and naturalness of the stimulus. Beyond enriching psycholinguistic models, IAS can also provide insights into the information processing mechanisms of computational language models. Our analysis of next-word surprisal under the lenses of IAS reveals that, despite the metric's seemingly narrow focus on the upcoming word, language model surprisal implicitly captures anticipatory processing of multiple future lexical items. - Surprise! Uniform Information Density Isn’t the Whole Story: Predicting Surprisal Contours in Long-form DiscourseItem type: Conference Paper
Proceedings of the 2024 Conference on Empirical Methods in Natural Language ProcessingTsipidi, Eleftheria; Nowak, Franz; Cotterell, Ryan; et al. (2024)The Uniform Information Density (UID) hypothesis posits that speakers tend to distribute information evenly across linguistic units to achieve efficient communication. Of course, information rate in texts and discourses is not perfectly uniform. While these fluctuations can be viewed as theoretically uninteresting noise on top of a uniform target, another explanation is that UID is not the only functional pressure regulating information content in a language. Speakers may also seek to maintain interest, adhere to writing conventions, and build compelling arguments. In this paper, we propose one such functional pressure; namely that speakers modulate information rate based on location within a hierarchically-structured model of discourse. We term this the Structured Context Hypothesis and test it by predicting the surprisal contours of naturally occurring discourses extracted from large language models using predictors derived from discourse structure. We find that hierarchical predictors are significant predictors of a discourse’s information contour and that deeply nested hierarchical predictors are more predictive than shallow ones. This work takes an initial step beyond UID to propose testable hypotheses for why the information rate fluctuates in predictable ways.
Publications1 - 10 of 12