Nora Hollenstein
Loading...
7 results
Filters
Reset filtersSearch Results
Publications 1 - 7 of 7
- Evaluating Bayes Error Estimators on Real-World Datasets with FeeBeeItem type: Conference Paper
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021)Renggli, Cedric; Rimanic, Luka; Hollenstein, Nora; et al. (2021)The Bayes error rate (BER) is a fundamental concept in machine learning that quantifies the best possible accuracy any classifier can achieve on a fixed probability distribution. Despite years of research on building estimators of lower and upper bounds for the BER, these were usually compared only on synthetic datasets with known probability distributions, leaving two key questions unanswered: (1) How well do they perform on realistic, non-synthetic datasets?, and (2) How practical are they? Answering these is not trivial. Apart from the obvious challenge of an unknown BER for real-world datasets, there are two main aspects any BER estimator needs to overcome in order to be applicable in real-world settings: (1) the computational and sample complexity, and (2) the sensitivity and selection of hyper-parameters.In this work, we propose FeeBee, the first principled framework for analyzing and comparing BER estimators on modern real-world datasets with unknown probability distribution. We achieve this by injecting a controlled amount of label noise and performing multiple evaluations on a series of different noise levels, supported by a theoretical result which allows drawing conclusions about the evolution of the BER. By implementing and analyzing 7 multi-class BER estimators on 6 commonly used datasets of the computer vision and NLP domains, FeeBee allows a thorough study of these estimators, clearly identifying strengths and weaknesses of each, whilst being easily deployable on any future BER estimator. - Dynamic Human Evaluation for Relative Model ComparisonsItem type: Conference Paper
Proceedings of the Thirteenth Language Resources and Evaluation ConferenceThorleiksdóttir, Thórhildur; Renggli, Cedric; Hollenstein, Nora; et al. (2022)Collecting human judgements is currently the most reliable evaluation method for natural language generation systems. Automatic metrics have reported flaws when applied to measure quality aspects of generated text and have been shown to correlate poorly with human judgements. However, human evaluation is time and cost-intensive, and we lack consensus on designing and conducting human evaluation experiments. Thus there is a need for streamlined approaches for efficient collection of human judgements when evaluating natural language generation systems. Therefore, we present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study. The main results indicate that a decision about the superior model can be made with high probability across different labelling strategies, where assigning a single random worker per task requires the least overall labelling effort and thus the least cost. - Decoding EEG Brain Activity for Multi-Modal Natural Language ProcessingItem type: Journal Article
Frontiers in Human NeuroscienceHollenstein, Nora; Renggli, Cedric; Glaus, Benjamin; et al. (2021)Until recently, human behavioral data from reading has mainly been of interest to researchers to understand human cognition. However, these human language processing signals can also be beneficial in machine learning-based natural language processing tasks. Using EEG brain activity for this purpose is largely unexplored as of yet. In this paper, we present the first large-scale study of systematically analyzing the potential of EEG brain activity data for improving natural language processing tasks, with a special focus on which features of the signal are most beneficial. We present a multi-modal machine learning architecture that learns jointly from textual input as well as from EEG features. We find that filtering the EEG signals into frequency bands is more beneficial than using the broadband signal. Moreover, for a range of word embedding types, EEG data improves binary and ternary sentiment classification and outperforms multiple baselines. For more complex tasks such as relation detection, only the contextualized BERT embeddings outperform the baselines in our experiments, which raises the need for further research. Finally, EEG data shows to be particularly promising when limited training data is available. - The ZuCo benchmark on cross-subject reading task classification with EEG and eye-tracking dataItem type: Journal Article
Frontiers in PsychologyHollenstein, Nora; Tröndle, Marius; Plomecka, Martyna; et al. (2023)We present a new machine learning benchmark for reading task classification with the goal of advancing EEG and eye-tracking research at the intersection between computational language processing and cognitive neuroscience. The benchmark task consists of a cross-subject classification to distinguish between two reading paradigms: normal reading and task-specific reading. The data for the benchmark is based on the Zurich Cognitive Language Processing Corpus (ZuCo 2.0), which provides simultaneous eye-tracking and EEG signals from natural reading of English sentences. The training dataset is publicly available, and we present a newly recorded hidden testset. We provide multiple solid baseline methods for this task and discuss future improvements. We release our code and provide an easy-to-use interface to evaluate new approaches with an accompanying public leaderboard: www.zuco-benchmark.com. - ZuCo 2.0: A Dataset of Physiological Recordings During Natural Reading and AnnotationItem type: Conference Paper
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)Hollenstein, Nora; Troendle, Marius; Zhang, Ce; et al. (2020)We recorded and preprocessed ZuCo 2.0, a new dataset of simultaneous eye-tracking and electroencephalography during natural reading and during annotation. This corpus contains gaze and brain activity data of 739 English sentences, 349 in a normal reading paradigm and 390 in a task-specific paradigm, in which the 18 participants actively search for a semantic relation type in the given sentences as a linguistic annotation task. This new dataset complements ZuCo 1.0 by providing experiments designed to analyze the differences in cognitive processing between natural reading and annotation. The data is freely available here: https://osf.io/2urht/. - Ease. ML: A Lifecycle Management System for Machine LearningItem type: Conference Paper
Proceedings of the Annual Conference on Innovative Data Systems Research (CIDR), 2021Aguilar Melgar, Leonel; Dao, David; Gan, Shaoduo; et al. (2021)We present Ease.ML, a lifecycle management system for machine learning (ML). Unlike many existing works, which focus on improving individual steps during the lifecycle of ML application development, Ease.ML focuses on managing and automating the entire lifecycle itself. We present user scenarios that have motivated the development of Ease.ML, the eight-step Ease.ML process that covers the lifecycle of ML application development; the foundation of Ease.ML in terms of a probabilistic database model and its connection to information theory; and our lessons learned, which hopefully can inspire future research. - Multilingual language models predict human reading behaviorItem type: Conference Paper
Proceedings 2021 Conference of the North American Chapter of the Association for Computational LinguisticsHollenstein, Nora; Pirovano, Federico; Zhang, Ce; et al. (2021)We analyze if large language models are able to predict patterns of human reading behavior. We compare the performance of language-specific and multilingual pretrained transformer models to predict reading time measures reflecting natural human sentence processing on Dutch, English, German, and Russian texts. This results in accurate models of human reading behavior, which indicates that transformer models implicitly encode relative importance in language in a way that is comparable to human processing mechanisms. We find that BERT and XLM models successfully predict a range of eye tracking features. In a series of experiments, we analyze the cross-domain and cross-language abilities of these models and show how they reflect human sentence processing.
Publications 1 - 7 of 7