Embargoed until 2021-11-02
- Doctoral Thesis
Background. As time passes, the field of biology is constantly revolutionised by the rapid emergence of technologies that have been providing larger and more diverse datasets. The availability of these large datasets enables in return discoveries of biological mechanisms and the development of new fields such as personalised medicine. Analysing these large datasets remain however challenging, because of their size and diversity, and of the underlying complex biological mechanisms. Unravelling these mechanisms requires the development of new data analysis methods, coming from domains such as pattern mining or machine learning. Among the various challenges and questions that arise from biological data, a core problem concerns how to handle biological interactions. Biological interactions are extremely diverse and appear indispensable in studies of molecular or macroscopic phenotypes. Transcription factor binding to DNA sequences are examples of core physical interactions, while indirect interactions can also exist, such as proteins operating in the same disease pathway. Due to the diversity in interaction types, a large number of models for interactions have been proposed throughout the years. In this thesis, we will examine several ways to model such interactions in two types of datasets and closely related problems to these dataset types. Contributions. We focused on two dataset types, genome-wide association studies and large sequence-function datasets, to explore the potential of modelling interactions for better understanding and prediction of biological mechanisms. In the first chapter of this thesis, we will focus on applications to genome-wide association study (GWAS) data, namely finding groups of genetic variants whose interaction would be responsible for a phenotype of interest. The relevance of this application lies in the fact that it is possible that a group of genetic variants is responsible for a phenotype while none of its subgroups would alter the phenotype. Additionally, GWAS datasets are typically confounded, as its samples can have different origins or covariates such as age or height. Performing association testing in confounded datasets without any adequate correction is highly at risk as it can result in many spurious associations. Therefore, only with the ability to correct for covariate factors, can algorithms that account for interactions be widely applicable to GWAS datasets. In the first chapter of the thesis, we present two algorithms that are able to find statistically significant interactions of genetic variants in the presence of a categorical covariate. Two types of interactions are studied, first all higher-order interactions, which, as their number scales exponentially with the number of genetic variants, generate computational and statistical challenges, and second, all contiguous genomic regions potentially at the origin of genetic heterogeneity. In the second chapter of this thesis, we will focus on applications to functional genomics, in particular on function prediction of DNA-regulatory sequences in bacteria. Being able to accurately predict the function of regulatory sequences is highly relevant in field such as synthetic biology or bioengineering. To this end, we build a deep learning model in order to accurately predict the functions of the regulatory sequences of interest training on a large-scale sequence-function dataset. We additionally provide reliable uncertainty estimates for the predicted values in order understand which predictions the model is confident about, so that the corresponding sequences could be used in downstream biological tasks. Finally, we compare several interpretability methods and show that the model is able to detect sequence determinants and to measure their position-dependent influence. Conclusion. We show that the methods introduced in these two chapters are able to leverage non-linear interactions to improve feature selection or prediction performance, respectively. We also provide software package and webserver in order to participate openly to the community’s effort and advances. It would be possible to further extend the concepts and models presented in this thesis, either to weaken assumptions, incorporate domain knowledge or tackle related but different problems of similar and crucial importance, such as data integration or molecular design. We believe that the recent advances in machine learning, bioinformatics and biology greatly hold promise in the years to come. Show more
External linksSearch print copy at ETH Library
Subjectmachine learning; data mining; statistics; computational biology; deep learning
Organisational unit09486 - Borgwardt, Karsten M. / Borgwardt, Karsten M.
155913 - Significant Pattern Mining (SNF)
MoreShow all metadata