
Open access
Author
Date
2020-05Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Automatic summarization aims to reduce an input document to a compressed version that captures only its salient parts. It is a topic with growing importance in today's age of information overflow.
There are two main types of automatic summarization. Extractive summarization only selects salient sentences from the input, while abstractive summarization generates a summary without explicitly re-using whole sentences, resulting in summaries are often more fluent.
State-of-the-art approaches to abstractive summarization are data-driven, relying on the availability of large collections of paired articles with summaries. The pairs are typically manually constructed, a task which is costly and time-consuming. Furthermore, when targeting a slightly different domain or summary format, a new parallel dataset is often required. This large reliance on parallel resources limits the potential impact of abstractive summarization systems in society.
In this thesis, we consider the problem of abstractive summarization from two different perspectives: high-resource and low-resource summarization.
In the first part, we compare different methods for data-driven summarization, focusing specifically on the problem of generating long, abstractive summaries, such as an abstract for a scientific journal article. We discuss the difficulties that come with abstractive generation of long summaries and propose methods for alleviating them.
In the second part of this thesis, we develop low-resource methods for abstractive text rewriting, first focusing on individual sentences and then on whole summaries. Our methods do not rely on parallel data, but instead utilize raw non-parallel text collections.
In overall, this work makes a step towards data-driven abstractive summarization for the generation of long summaries, without having to rely on vast amounts of parallel, manually curated data. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000425533Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Natural Language Processing; Artificial Intelligence; Machine LearningOrganisational unit
03774 - Hahnloser, Richard H.R. / Hahnloser, Richard H.R.
More
Show all metadata
ETH Bibliography
yes
Altmetrics