Producing building blocks for data analytics


Loading...

Author / Producer

Date

2019-09

Publication Type

Master Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

The ever increasing diversity of data analytics and AI applications has had a tremendous impact on the number of tools that were developed during the past few years. The developers of these tools usually do not spend a lot of time thinking which are the building blocks that lie in their core. As a result, they sometimes have to produce many slightly different versions of the same code fragments. Instead, they could reduce their implementation effort by designing reusable and recomposable building blocks. Then, they could simply orchestrate them in a different order across execution plans. In this thesis, we study the level of granularity of these building blocks. We start with a state-of-the-art high-performance distributed hash join, which we split into smaller operators that have a single functionality. We explore different levels of granularity and study their impact on reusability and performance. Our proposed granularity level yields operators that are reusable and have almost no performance overhead. We present a variety of use cases where we can apply them in modern ML and data analytics scenarios. By using the same operators, the original join algorithm has similar performance and it is even faster in some cases.

Publication status

published

External links

Editor

Contributors

Examiner : Alonso, Gustavo
Examiner: Müller, Ingo
Examiner : Marroquín, Renato

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Organisational unit

03506 - Alonso, Gustavo / Alonso, Gustavo check_circle

Notes

Funding

Related publications and datasets