Causality in Unsupervised Learning: Methods and Applications in Cancer Genomics
Open access
Author
Date
2024Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Unsupervised learning deciphers the beautifully complex patterns embedded within vast amounts of data. As one of the main branches of machine learning, it seeks to discover hidden patterns in unlabelled data. One of the prevalent techniques within unsupervised learning is clustering, which groups data into distinct subsets of shared characteristics. In cancer research, clustering offers promising avenues to stratify patients based on their unique genomic and clinical characteristics, which is crucial for developing personalised treatment strategies and improving prognostic evaluations. However, the increasing complexity of the acquired data presents several challenges for unsupervised learning, ranging from the integration of different data types to the prevalence of incomplete datasets. Furthermore, as these computational tools increasingly affect many areas of life, it is essential that they are developed with care to prevent discrimination against any protected groups.
This thesis presents novel methods that enhance the efficiency and fairness of unsupervised learning by exploiting causal knowledge about the data. The main contributions are detailed in three separate studies, each presenting a novel methodological approach.
The first study introduces a novel network-based clustering method that enables the stratification of cancer patients based on their individual genomic and clinical characteristics. This approach leverages the causal relationships inherent in the data to effectively integrate genomic and clinical information. When applied to myeloid malignancies -- a group of aggressive cancers with overlapping genomic and clinical characteristics -- this method identified novel cancer subgroups that are highly predictive of survival and reveal distinct genomic and clinical patterns. This novel clustering approach sheds light on the interconnected landscape of the genomic and clinical features across myeloid malignancies and paves the way for improved patient stratification.
The second study presents a novel method for inferring the marginal probability in high-dimensional Bayesian networks for categorical variables. This task is essential for analysing incomplete datasets, which is a common challenge in cancer genomics. Using the graphical structures of Bayesian networks, this study shows how to integrate exact and approximate inference in order to efficiently compute the marginal probability in Bayesian networks. This advancement is essential for making the results from the first study applicable to a wide range of new datasets.
The final study provides a causal perspective on algorithmic fairness in unsupervised learning. Algorithmic fairness aims to identify and mitigate discriminatory biases in machine learning. This study presents a novel clustering approach that optimises causal notions of fairness, offering a more precise approach to fairness in unsupervised learning. In contrast to existing methods for fair clustering, this method models the causal structures that may underlie discriminatory patterns in the data. It also provides the flexibility to specify which causal notions of fairness should be mitigated. This method allows for a more granular alignment of unsupervised learning with ethical standards, legal requirements, and societal obligations.
The studies presented in this thesis are unified by adopting a causal perspective on complex computational challenges that arise in unsupervised learning. Applied to cancer genomics, the methods presented offer new pathways for patient stratification that are not only accurate but also unbiased. This progress points towards a future in which all patients, regardless of their background, receive accurate diagnosis and treatment. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000658760Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Beerenwinkel, Niko
Examiner: Kuipers, Jack
Examiner: Spang, Rainer
Examiner: Balabanov, Stefan
Publisher
ETH ZurichSubject
Unsupervised learning; Clustering; Causality; Bayesian networks; Genomics; CancerOrganisational unit
03790 - Beerenwinkel, Niko / Beerenwinkel, Niko
More
Show all metadata
ETH Bibliography
yes
Altmetrics