Representation Learning for Dimensionality Reduction, Irregularly-Sampled Sequences and Graphs


Author / Producer

Date

2023

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Machine learning has the potential to revolutionize the fields of biology and healthcare by providing new tools to help scientists and clinicians do research and decide what would be the right treatment for patients. However, while recent approaches in representation learning give the impression of being universal black-box solutions to all problems, research has shown that this is not generally true. Even though models can perform well in a black-box fashion, they often suffer from low generalization and are sensitive to distribution shifts. This highlights the need for developing approaches that are informed by their downstream application and tailored to incorporate symmetries of the problem into the model architecture. These inductive biases are essential for performance on new data and for models to remain robust even when the data distribution changes. Nevertheless, constructing good models is only half of the solution. To be sure that models would translate well into clinical applications they also need to be evaluated appropriately with this goal in mind. In this thesis, I address the above points while taking a detailed look at structured data types present at the intersection of biology, medicine, and machine learning. In terms of algorithmic contributions, I first present a new non-linear dimensionality reduction algorithm that aims to preserve multi-scale relations. The cost reduction of genome sequencing and the ability to sequence individual cells has led to exponentially increasing high-dimensional data in the life sciences. Such data cannot be intuitively understood, making dimensionality reduction approaches, which can capture the nested relationships present in biology, essential. Second, I develop methods for clinical applications where irregularly-sampled data are present. Conventional machine learning models either require the conversion of such data into fixed-size representations or the imputation of missing values prior to their application. I present two approaches tailored for irregularly-sampled data that do not require such preprocessing steps. The first is a new kernel for peaks derived from MALDI-TOF spectra, whereas the second is a deep learning model that can be applied to irregularly-sampled time series by phrasing them as sets of observations. Third, I present an extension to graph neural networks that allow the models to account for global information instead of requiring nodes to only exchange information with their neighbors. Graphs are an important data structure for pharmacology as they are often used to represent small molecules. In order to address the appropriate evaluation of such models, I present a detailed study of medical time series models with a focus on their capability to transfer to other datasets in the context of a sepsis early prediction task. Further, I show that the conventional approach for the evaluation of graph generative models is highly sensitive to the selection of hyperparameters which can lead to biased performance estimates. Summarizing, my thesis addresses many problems at the intersection of machine learning, healthcare, and biology. It demonstrates how models can be improved by including more (domain-specific) knowledge and where to pay attention when evaluating said models.

Publication status

published

Editor

Contributors

Examiner : Borgwardt, Karsten M.
Examiner : Vogt, Julia
Examiner : Müller, Christian L.

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Machine Learning; Dimensionality reduction; Time Series; Graphs; Healthcare

Organisational unit

09486 - Borgwardt, Karsten M. (ehemalig) / Borgwardt, Karsten M. (former) check_circle

Notes

Funding

Related publications and datasets