Abstractive Document Summarization in High and Low Resource Settings
OPEN ACCESS
Loading...
Author / Producer
Date
2020-05
Publication Type
Doctoral Thesis
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Automatic summarization aims to reduce an input document to a compressed version that captures only its salient parts. It is a topic with growing importance in today's age of information overflow.
There are two main types of automatic summarization. Extractive summarization only selects salient sentences from the input, while abstractive summarization generates a summary without explicitly re-using whole sentences, resulting in summaries are often more fluent.
State-of-the-art approaches to abstractive summarization are data-driven, relying on the availability of large collections of paired articles with summaries. The pairs are typically manually constructed, a task which is costly and time-consuming. Furthermore, when targeting a slightly different domain or summary format, a new parallel dataset is often required. This large reliance on parallel resources limits the potential impact of abstractive summarization systems in society.
In this thesis, we consider the problem of abstractive summarization from two different perspectives: high-resource and low-resource summarization.
In the first part, we compare different methods for data-driven summarization, focusing specifically on the problem of generating long, abstractive summaries, such as an abstract for a scientific journal article. We discuss the difficulties that come with abstractive generation of long summaries and propose methods for alleviating them.
In the second part of this thesis, we develop low-resource methods for abstractive text rewriting, first focusing on individual sentences and then on whole summaries. Our methods do not rely on parallel data, but instead utilize raw non-parallel text collections.
In overall, this work makes a step towards data-driven abstractive summarization for the generation of long summaries, without having to rely on vast amounts of parallel, manually curated data.
Permanent link
Publication status
published
External links
Editor
Contributors
Examiner : Hahnloser, Richard H.R.
Examiner : Volk, Martin
Examiner : Filippova, Katja
Book title
Journal / series
Volume
Pages / Article No.
Publisher
ETH Zurich
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Natural Language Processing; Artificial Intelligence; Machine Learning
Organisational unit
03774 - Hahnloser, Richard H.R. / Hahnloser, Richard H.R.