Beyond reproducibility: Knocking on sustainability's door


Loading...

Date

2022

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

The following thesis presents three independent studies which were carried out as part of the author's doctoral studies in the Computational Biology Group at the Department of Biosystems Science and Engineering at ETH Zurich in Basel. These projects deal with the development of statistical methods for the detection of pathway dysregulations, and the processing and analysis of next-generation sequencing data with a particular focus on the importance of benchmarking the methods' performances in a sustainable way. The first two studies are based on the fact that cancer is a heterogeneous disease where the same phenotype can arise from different mutational patterns and propose novel methods for the computation of pathway enrichments. The first study takes a causal approach and computes edge-specific pathway dysregulations while the second study computes global pathway dysregulation scores while accounting for term-term relations. Both studies include an extensive benchmark workflow which tests both the performance on synthetic and real data sets as well as runs exploratory analyses. The third study describes the development of a pipeline for the analysis of viral high-throughput sequencing data and an extensive benchmark of global haplotype reconstruction methods. The dissertation is organized in the following way. The first chapter provides an overview of different workflow management systems which can be used to create reproducible benchmarking workflows, a comment on the distinction between reproducible and sustainable data science, and their relevance in the fields of cancer genomics as well as virology. The second chapter presents \emph{dce}, a computational method for the edge-specific detection of pathway dysregulations using a causal framework. The third chapter presents \emph{pareg}, a regression-based method which addresses the issue of large and redundant pathway databases by incorporating term-term relations into the enrichment computation. It accomplishes this goal by adding regularization terms to the loss function of a generalized linear model. The fourth chapter presents a scalable, reproducible and transparent pipeline for the analysis of viral sequencing data as well as a benchmark of global haplotype reconstruction methods. The fifth chapter concludes the thesis by summarizing its findings as well as suggesting potential future directions.

Publication status

published

Editor

Contributors

Examiner : Bühlmann, Peter
Examiner : Uhler, Caroline

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Organisational unit

03790 - Beerenwinkel, Niko / Beerenwinkel, Niko check_circle

Notes

Funding

Related publications and datasets