
Open access
Author
Date
2019Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
The technological advancements for storing and querying data have led to the rise of a society driven by digital information and communication technologies. The ever-growing interest for information and the advantages of cloud computing have motivated companies and people to move their data into cloud storage services. As a result, cloud providers now maintain a large percentage of "big data" coming from various sources.
The fact that cloud providers use the same infrastructure for maintaining multiple tenants' data leads to advantages ranging from economies of scale to determining workload-specific optimizations. However, it also poses new challenges as applications' traditional assumptions now have to be adapted to cope with this new scenario. This dissertation analyzes the challenges of dealing with large shared data repositories deployed in the cloud. Moreover, we propose improvements regarding data quality and efficiency of query processing for these shared repositories.
The first part of this dissertation describes how to incorporate an important feature that traditional databases had but is not currently present in cloud-based data repositories, namely integrity constraints. They are important as they can help to avoid data inconsistencies and provide guarantees about the quality of data. More specifically, we study how to support multiple integrity constraints in a single (logical) data repository that is shared among many different applications, e.g., a cloud-based data lake. Additionally, we describe the trade-offs for building a system supporting multiple integrity constraints and evaluate the proposed solution against the traditional approaches for supporting integrity constraints in data warehousing.
The second part of this dissertation studies how to optimize data movement when processing entire query workloads that analyze data in a cloud-based data lake. To accomplish this, we apply shared-workload techniques to group multiple queries accessing common relations, rewrite them, and execute them as a single query. This part of the dissertation also presents an evaluation of our proposed optimizations when applied to query-as-a-service systems.
Finally, the last part of this dissertation analyzes how to leverage the large parallelism available through serverless computing for querying cloud-based data repositories in a performant yet cost-effective manner. Thus, we show that serverless computing can be used for data analytics on cold-data in an efficient and cost-effective manner while maintaining interactive response times. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000428429Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Alonso, Gustavo
Examiner: Zhang, Ce
Examiner: Kossmann, Donald
Examiner: McSherry, Frank
Publisher
ETH ZurichSubject
DISTRIBUTED DATABASES (COMPUTER SYSTEMS); DATABASES + DATABASE MANAGEMENT SYSTEMS (SOFTWARE PRODUCTS); Cloud ComputingOrganisational unit
02150 - Dep. Informatik / Dep. of Computer Science03506 - Alonso, Gustavo / Alonso, Gustavo
More
Show all metadata
ETH Bibliography
yes
Altmetrics