Open access

Author

Date

2017Type

- Doctoral Thesis

ETH Bibliography

yes
Altmetrics

Abstract

Data summarization, a central challenge in machine learning, is the task of finding a representative subset of manageable size out of a large dataset. It has found numerous applications, including image summarization, document and corpus summarization, recommender systems, and non-parametric learning, to name a few. A general recipe to obtain a faithful summary is to turn the problem into selecting a subset of data elements optimizing a utility function that quantifies “representativeness” of the selected set.
Often times, the choice of utility functions used for summarization exhibit submodularity, a natural diminishing returns property. In words, submodularity implies that the added value of any element from the dataset decreases as we include more data points to the summary. Thus, the data summarization problem can be naturally reduced to that of a constrained submodular maximization, or a submodular cover problem.
Although, there are efficient centralized algorithms for the aforementioned problems, they are highly impractical for massive datasets, as sequentially selecting elements on a single machine is heavily constrained in terms of speed and memory. Hence, in order to solve the above submodular optimization problems at scale, we need to make use of MapReduce-style parallel computation models, or resort to streaming algorithms.
In this Thesis, we develop large scale algorithms for submodular summarization. In particular, we present a simple, parallel protocol, called GreeDi for distributed (not-necessarily monotone) submodular maximization subject to cardinality, and other general types of constraints, including matroid and knapsack constraints. In addition, we develop a distributed algorithm, DisCover, for the submodular cover problem, as well as a fast distributed algorithm, FastCover, that enables us to solve the more general problem of covering multiple submodular functions in one run of the algorithm.
We then consider the streaming setting, where at any point of time, the algorithm has access only to a small fraction of data stored in primary memory. We present a single pass streaming procedure, Streaming Local Search, for maximizing a (not-necessarily monotone) submodular function subject to a collection of independence systems and d knapsack constraints. Furthermore, we introduce the dynamic deletion- robust submodular maximization problem, and propose a resilient streaming algorithm, called Robust-Streaming, that is able to produce concise real-time summaries in the face of data deletion requested by users.
Lastly, as a natural complementary goal to the aforementioned works, we consider developing fast centralized algorithms for submodular maximization that can be integrated into distributed frameworks to provide even more scalable algorithms. In particular, we first develop a randomized linear-time algorithm, Stochastic-Greedy, for maximizing a monotone submodular function subject to a cardinality constraint. We then propose a practical and fast algorithm, Fantom, for maximizing a (not necessarily monotone) submodular function subject to the intersection of a p-system and d knapsacks constraints, and show how we can use Fantom to produce personalized summarization.
In addition to providing algorithms and theoretical analyses, we present extensive empirical evaluation of our approaches on several large-scale and real-world summarization problems. These include, but not limited to, summarizing 80 million Tiny Images, more than 45 million user visits from the Featured Tab of the Today Module on Yahoo! Front Page, finding dominating set in Friendster social network with more than 65 million nodes and 1.8 billion edges, and movie recommendation based on 20 million users’ ratings from 138,493 users of the MovieLens database. --> Data summarization, a central challenge in machine learning, is the task of finding a representative subset of manageable size out of a large dataset. It has found numerous applications, including image summarization, document and corpus summarization, recommender systems, and non-parametric learning, to name a few. A general recipe to obtain a faithful summary is to turn the problem into selecting a subset of data elements optimizing a utility function that quantifies “representativeness” of the selected set. Often times, the choice of utility functions used for summarization exhibit submodularity, a natural diminishing returns property. In words, submodularity implies that the added value of any element from the dataset decreases as we include more data points to the summary. Thus, the data summarization problem can be naturally reduced to that of a constrained submodular maximization, or a submodular cover problem. Although, there are efficient centralized algorithms for the aforementioned problems, they are highly impractical for massive datasets, as sequentially selecting elements on a single machine is heavily constrained in terms of speed and memory. Hence, in order to solve the above submodular optimization problems at scale, we need to make use of MapReduce-style parallel computation models, or resort to streaming algorithms. In this Thesis, we develop large scale algorithms for submodular summarization. In particular, we present a simple, parallel protocol, called GreeDi for distributed (not-necessarily monotone) submodular maximization subject to cardinality, and other general types of constraints, including matroid and knapsack constraints. In addition, we develop a distributed algorithm, DisCover, for the submodular cover problem, as well as a fast distributed algorithm, FastCover, that enables us to solve the more general problem of covering multiple submodular functions in one run of the algorithm. We then consider the streaming setting, where at any point of time, the algorithm has access only to a small fraction of data stored in primary memory. We present a single pass streaming procedure, Streaming Local Search, for maximizing a (not-necessarily monotone) submodular function subject to a collection of independence systems and d knapsack constraints. Furthermore, we introduce the dynamic deletion- robust submodular maximization problem, and propose a resilient streaming algorithm, called Robust-Streaming, that is able to produce concise real-time summaries in the face of data deletion requested by users. Lastly, as a natural complementary goal to the aforementioned works, we consider developing fast centralized algorithms for submodular maximization that can be integrated into distributed frameworks to provide even more scalable algorithms. In particular, we first develop a randomized linear-time algorithm, Stochastic-Greedy, for maximizing a monotone submodular function subject to a cardinality constraint. We then propose a practical and fast algorithm, Fantom, for maximizing a (not necessarily monotone) submodular function subject to the intersection of a p-system and d knapsacks constraints, and show how we can use Fantom to produce personalized summarization. In addition to providing algorithms and theoretical analyses, we present extensive empirical evaluation of our approaches on several large-scale and real-world summarization problems. These include, but not limited to, summarizing 80 million Tiny Images, more than 45 million user visits from the Featured Tab of the Today Module on Yahoo! Front Page, finding dominating set in Friendster social network with more than 65 million nodes and 1.8 billion edges, and movie recommendation based on 20 million users’ ratings from 138,493 users of the MovieLens database. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000217967Publication status

publishedExternal links

Search print copy at ETH Library
Publisher

ETH ZurichOrganisational unit

03908 - Krause, Andreas / Krause, Andreas
More

Show all metadata
ETH Bibliography

yes
Altmetrics