Ingo Müller


Loading...

Last Name

Müller

First Name

Ingo

Organisational unit

Search Results

Publications1 - 10 of 19
  • Müller, Ingo; Marroquín, Renato; Koutsoukos, Dimitrios; et al. (2020)
    Proceedings of the 16th International Workshop on Data Management on New Hardware, DaMoN '20
    Getting the best performance from the ever-increasing number of hardware platforms has been a recurring challenge for data processing systems. In recent years, the advent of data science with its increasingly numerous and complex types of analytics has made this challenge even more difficult. In practice, system designers are overwhelmed by the number of combinations and typically implement a single analytics type on one platform, leading to repeated implementation effort---and a plethora of semi-compatible tools for data scientists. In this paper, we propose the "Collection Virtual Machine" (or CVM)---an extensible compiler framework designed to keep the specialization process of data analytics systems tractable. It can capture at the same time the essence of a large span of low-level, hardware-specific implementation techniques as well as high-level operations of different types of analyses. At its core lies a language for defining nested, collection-oriented intermediate representations (IRs). Frontends produce programs in their IR flavors defined in that language, which get optimized through a series of rewritings (possibly changing the IR flavor multiple times) until the program is finally expressed in an IR of platform-specific operators. While reducing the overall implementation effort, this also improves the interoperability of both analyses and hardware platforms. We have used CVM successfully to build specialized backends for platforms as diverse as multi-core CPUs, RDMA clusters, and serverless computing infrastructure in the cloud and expect similar results for many more frontends and hardware platforms in the near future.
  • Müller, Ingo; Marroquín, Renato; Alonso, Gustavo (2020)
    The massive, instantaneous parallelism of serverless functions has created a lot of excitement for interactive batch applications. However, due to fundamental limitations of current offerings, there was no consensus yet as to whether or not this architecture is suitable for data processing. We present Lambada, a data analytics framework designed for serverless functions that overcomes the current limitations and simultaneously achieves better performance and lower cost than commercial alternatives.
  • Graur, Dan; Müller, Ingo; Proffitt, Mason; et al. (2021)
    Proceedings of the VLDB Endowment
    In the domain of high-energy physics (HEP), query languages in general and SQL in particular have found limited acceptance. This is surprising since HEP data analysis matches the SQL model well: the data is fully structured and queried using mostly standard operators. To gain insights on why this is the case, we perform a comprehensive analysis of six diverse, general-purpose data processing platforms using an HEP benchmark. The result of the evaluation is an interesting and rather complex picture of existing solutions: Their query languages vary greatly in how natural and concise HEP query patterns can be expressed. Furthermore, most of them are also between one and two orders of magnitude slower than the domain-specific system used by particle physicists today. These observations suggest that, while database systems and their query languages are in principle viable tools for HEP, significant work remains to make them relevant to HEP researchers.
  • Müller, Ingo; Marroquín, Renato; Alonso, Gustavo (2020)
    SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
    Serverless computing has recently attracted a lot of attention from research and industry due to its promise of ultimate elasticity and operational simplicity. However, there is no consensus yet on whether or not the approach is suitable for data processing. In this paper, we present Lambada, a serverless distributed data processing framework designed to explore how to perform data analytics on serverless computing. In our analysis, supported with extensive experiments, we show in which scenarios serverless makes sense from an economic and performance perspective. We address several important technical questions that need to be solved to support data analytics and present examples from several domains where serverless offers a cost and performance advantage over existing solutions.
  • Graur, Dan; Müller, Ingo; Proffitt, Mason; et al. (2025)
    The VLDB Journal
    Nested data is valuable and ubiquitous. It is being generated in ever-increasing volumes across industrial and research environments and frequently contains valuable information that is extracted through analytical workloads. Despite its popularity and value, there is no clear-cut understanding of the status quo in analytical workloads for nested data in high-energy physics (HEP). In this paper, we seek to define the landscape of nested data processing in HEP by evaluating 10 systems and their query languages on the IRIS HEP ADL benchmark, a popular and representative HEP benchmark. We attempt not only to understand how well these systems perform from a query latency and scalability point of view but also from a query language usability perspective. The result of our evaluation paints an interesting and rather complex picture of existing solutions. Many of the evaluated systems are between one and two orders of magnitude slower than the domain-specific system used in HEP today, while a few of the commodity systems provide on-par performance at greater costs. Moreover, the evaluated query languages and dialects vary greatly in how naturally and concisely they can express nested query patterns. These observations suggest that while commodity data management systems and their query languages are viable tools for nested data processing, significant work remains to make them competitive with domain-specific solutions like those used by the HEP community.
  • Marroquín, Renato; Müller, Ingo; Makreshanski, Darko; et al. (2018)
    SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
    Cloud-based data analysis is nowadays common practice because of the lower system management overhead as well as the pay-as-you-go pricing model. The pricing model, however, is not always suitable for query processing as heavy use results in high costs. For example, in query-as-a-service systems, where users are charged per processed byte, collections of queries accessing the same data frequently can become expensive. The problem is compounded by the limited options for the user to optimize query execution when using declarative interfaces such as SQL. In this paper, we show how, without modifying existing systems and without the involvement of the cloud provider, it is possible to significantly reduce the overhead, and hence the cost, of query-as-a-service systems. Our approach is based on query rewriting so that multiple concurrent queries are combined into a single query. Our experiments show the aggregated amount of work done by the shared execution is smaller than in a query-at-a-time approach. Since queries are charged per byte processed, the cost of executing a group of queries is often the same as executing a single one of them. As an example, we demonstrate how the shared execution of the TPC-H benchmark is up to 100x and 16x cheaper in Amazon Athena and bigquery than using a query-at-a-time approach while achieving a higher throughput.
  • Barthels, Claude; Müller, Ingo; Schneider, Timo; et al. (2017)
    Proceedings of the VLDB Endowment
    Traditional database operators such as joins are relevant not only in the context of database engines but also as a building block in many computational and machine learning algorithms. With the advent of big data, there is an increasing demand for efficient join algorithms that can scale with the input data size and the available hardware resources. In this paper, we explore the implementation of distributed join algorithms in systems with several thousand cores connected by a low-latency network as used in high performance computing systems or data centers. We compare radix hash join to sort-merge join algorithms and discuss their implementation at this scale. In the paper, we explain how to use MPI to implement joins, show the impact and advantages of RDMA, discuss the importance of network scheduling, and study the relative performance of sorting vs. hashing. The experimental results show that the algorithms we present scale well with the number of cores, reaching a throughput of 48.7 billion input tuples per second on 4,096 cores.
  • Müller, Ingo; Arteaga, Andrea; Hoefler, Torsten; et al. (2018)
    2018 IEEE 34th International Conference on Data Engineering (ICDE)
  • Graur, Dan; Müller, Ingo; Proffitt, Mason; et al. (2023)
    Journal of Physics: Conference Series
    In the domain of high-energy physics (HEP), general-purpose query languages have found little adoption in analysis. This is surprising regarding SQL-based systems, as HEP data analysis matches SQL's processing model well: the data is fully structured and makes use of predominantly standard operators. To better understand the situation, we select six general-purpose query engines, from both the SQL and NoSQL domain, and analyze their performance, scalability, and usability in HEP analysis, employing standard HEP tools as baseline. We also identify a set of core language features needed to support HEP data analysis. Our results reveal an interesting and complex picture: several query languages provide a rich and natural query development experience, while others fall short. In terms of performance, our results reveal that many of the database systems are one or two orders of magnitude slower than the standard HEP analysis tools, while others manage to scale and perform well. These conclusions suggest that while the existing data processing systems are viable candidates for HEP analysis, there is still work to be done to improve their performance and ability to succinctly express HEP queries.
  • Koutsoukos, Dimitrios; Müller, Ingo; Marroquín, Renato; et al. (2021)
    Proceedings of the VLDB Endowment
    The enormous quantity of data produced every day together with advances in data analytics has led to a proliferation of data management and analysis systems. Typically, these systems are built around highly specialized monolithic operators optimized for the underlying hardware. While effective in the short term, such an approach makes the operators cumbersome to port and adapt, which is increasingly required due to the speed at which algorithms and hardware evolve. To address this limitation, we present Modularis, an execution layer for data analytics based on sub-operators, i.e., composable building blocks resembling traditional database operators but at a finer granularity. To demonstrate the feasibility and advantages of our approach, we use Modularis to build a distributed query processing system supporting relational queries running on an RDMA cluster, a serverless cloud platform, and a smart storage engine. Modularis requires minimal code changes to execute queries across these three diverse hardware platforms, showing that the sub-operator approach reduces the amount and complexity of the code to maintain. In fact, changes in the platform affect only those sub-operators that depend on the underlying hardware (in our use cases, mainly the sub-operators related to network communication). We show the end-to-end performance of Modularis by comparing it with a framework for SQL processing (Presto), a commercial cluster database (SingleStore), as well as Query-as-a-Service systems (Athena, BigQuery). Modularis outperforms all these systems, proving that the design and architectural advantages of a modular design can be achieved without degrading performance. We also compare Modularis with a hand-optimized implementation of a join for RDMA clusters. We show that Modularis has the advantage of being easily extensible to a wider range of join variants and group by queries, all of which are not supported in the hand-tuned join.
Publications1 - 10 of 19