Show simple item record

dc.contributor.author
Amariucai, Theodor
dc.contributor.supervisor
Warstadt, Alexander
dc.contributor.supervisor
Cotterell, Ryan
dc.date.accessioned
2023-09-21T14:32:08Z
dc.date.available
2023-09-21T07:13:33Z
dc.date.available
2023-09-21T14:32:08Z
dc.date.issued
2023
dc.identifier.uri
http://hdl.handle.net/20.500.11850/632754
dc.identifier.doi
10.3929/ethz-b-000632754
dc.description.abstract
The current trend of pretraining language models often solely relies on text as the input, which is argued by some to limit an artificial learner’s ability to understand the complexities of human communication. In response, researchers have turned to multimodal input, which enables models to better capture the complex relationship between linguistic symbols (words) and the physical world. Guided by previous psycholinguistic research on symbol grounding and the recent developments in multimodal language model architecture, we investigate whether combining information from complementary modalities (text and vision) can improve text-only aspects such as grammaticality, natural language understanding, or data efficiency. This hypothesis has only been tested in limited pretraining paradigms or incidental ablation studies with restricted analyses. To this end, we build an open-source multimodal input ablation platform for pretraining vision-and-language models in a controlled environment. Using this platform, we pretrain eight randomly initialized copies of a state-of-the-art model, FLAVA, under independently varying amounts of text and visual input subsampled from the Wikipedia-based Image Text dataset. We show that during pretraining, in the presence of visual cues and for up to 10M words, FLAVA gains multimodal capabilities (as measured by an increase in MTR accuracy) but performs roughly similar on grammar-oriented tasks. For a larger text volume of 100M words and additional paired images, however, we notice a slight decrease in performance for the vision-and-language models (as the MTR accuracy keeps increasing). Given the reported results and considering the small data scales at which we train our models in this study, we deem it unlikely that the cross-situational learning mechanism has had the opportunity to manifest itself. Although in this experimental setup, the vision-and-language models have failed to outperform their text-only counterparts, we allow the possibility that this conclusion will change with more extensive hardware resources and better architectures and techniques for integrating the two complementary modalities at training time. Further work is needed if multimodal pretraining is to be pursued as a means to explain the data efficiency gap between language models and humans, to help artificial learners understand language in all the ways that people do (e.g., through embodied experiences or social cues), and to improve NLP in general.
en_US
dc.format
application/pdf
en_US
dc.language.iso
en
en_US
dc.publisher
ETH Zurich
en_US
dc.rights.uri
http://rightsstatements.org/page/InC-NC/1.0/
dc.title
Acquiring Linguistic Knowledge from Multimodal Input
en_US
dc.type
Master Thesis
dc.rights.license
In Copyright - Non-Commercial Use Permitted
dc.date.published
2023-09-21
ethz.size
93 p.
en_US
ethz.publication.place
Zurich
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09682 - Cotterell, Ryan / Cotterell, Ryan
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02150 - Dep. Informatik / Dep. of Computer Science::02661 - Institut für Maschinelles Lernen / Institute for Machine Learning::09682 - Cotterell, Ryan / Cotterell, Ryan
en_US
ethz.date.deposited
2023-09-21T07:13:33Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2023-09-21T14:32:09Z
ethz.rosetta.lastUpdated
2024-02-03T03:58:05Z
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Acquiring%20Linguistic%20Knowledge%20from%20Multimodal%20Input&rft.date=2023&rft.au=Amariucai,%20Theodor&rft.genre=unknown&rft.btitle=Acquiring%20Linguistic%20Knowledge%20from%20Multimodal%20Input
 Search print copy at ETH Library

Files in this item

Thumbnail

Publication type

Show simple item record