
Open access
Author
Date
2020Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Industries across all walks of life from healthcare, finance, government and e-commerce rely on datacenters to offer their revenue-generating services. However, efficiently operating a datacenter is a Sisyphean task: the sheer number and diversity of components and their complex, ever-evolving interactions turn correct and efficient operation into a challenge. Despite efforts to automate, many operational tasks still rely on operator expertise and human intervention to manually correlate events across fragmented traces and pinpoint faulty configurations in siloed datasets which only shed a partial view of the infrastructure. Constant changes in offered load, upgrades and component failures mean that the scant telemetry signals available rapidly lose their ability to explain, measure or predict system behaviour.
We argue that a vital requirement of monitoring and diagnosing such systems lies in the ability to query up-to-date information about distinct events and connect these insights into coherent stories. In this dissertation, we present novel applications which fuse cross-layer traces with streaming dataflows in the context of three unexplored domains: real-time provenance tracking, sessionization and simulation. First, we present a general framework for interactively explaining the outputs of modern data-parallel computations, including iterative data analytics. Second, we present an online system that exploits the structural information in data center logs and maintaining user sessions in real-time at rates of several gigabits per second. Third, we propose using cross-layer information to build faithful in-memory representations of system state and illustrate the power of online "what-if" simulation as a technique to support management and planning decisions.
Recent developments mean that telemetry once envisaged for post-hoc debugging have been repurposed for resource attribution, system validation, workload modelling and more. These scenarios bolster the need for a principled foundation to continually store, manage and query diagnostic data and in this thesis we demonstrate how a programmable framework permits specialized debugging and profiling tools to be recast as reusable pipelines with simplified management and efficient execution. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000503615Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
data analytics; log analysis; data provenance; stream processing; Datacenter network; network simulationOrganisational unit
03757 - Roscoe, Timothy / Roscoe, Timothy
More
Show all metadata
ETH Bibliography
yes
Altmetrics