Intra-tumor heterogeneity and clonal exclusivity in renal cell carcinoma

Tumorigenesis is an evolutionary process in which different clones evolve over time. Interactions between clones can affect tumor evolution and hence disease progression and treatment outcome. We analyzed 178 tumor samples in 89 clear cell renal cell carcinoma patients and found high intra-tumor heterogeneity with 62% of mutations detected in only one of two biopsies per patient. We developed a novel statistical test to identify gene pairs that are altered in co-occurring clones of the same tumor, including the pairs TP53 and MUC16, as well as BAP1 and TP53. The mutations in these gene pairs are clonally exclusive meaning that they occurred in different branches of the tumor phylogeny, suggesting a synergistic effect between the two clones carrying these mutations. Our analysis sheds new light on tumor development and implies that clonal interactions are common within tumors, which may eventually open up novel treatment strategies to improve cancer treatment.


Introduction
Tumors generally consist of genetically and phenotypically distinct cancer cell populations that evolve over time through a process that involves mutation and selection (Nowell, 1976). The different cancer cell populations, also called clones, accumulate mutations, and those with a selective advantage will expand and possibly outcompete the other cancer cell populations (Nowell, 1976). This process of mutation and selection can create a diverse set of clones within the tumor, a phenomenon called intra-tumor heterogeneity (Beerenwinkel et al., 2016, Dexter et al., 1978, Heppner, 1984. Intra-tumor heterogeneity poses a challenge to the treatment of cancer, because the number of different somatic mutations may be undervalued when taking just a single biopsy from a solid tumor or if there are different clones in the metastases (Gerlinger et al., 2014, Beerenwinkel et al., 2016. Identifying subclonal mutations is important because it has been shown that even low-frequency clones can carry markers of poor prognosis (Gerlinger et al., 2014), and they drive the process of metastasis. It has also been observed that the number of clones in a neoplasm can predict the probability of progression cooperation is an important evolutionary force and it is crucial to elucidate its underlying mechanisms. Uncovering the principles of how cancer cell populations interact, will shed light on how to disrupt this process and potentially open up novel treatment strategies to improve personal cancer treatment. In order to investigate how clones co-exist and possibly interact, a systematic screening of subclonal mutation compositions is necessary. If a combination of two or more clones is beneficial, these clones will likely co-exist stably over time and not outcompete each other (Axelrod et al., 2006, Cleary et al., 2014. The analysis of subclonal mutations in several independent tumors allows for systematically assessing whether specific subclonal mutation combinations exist that occur at a higher rate than expected. Here, we investigate the intra-tumor heterogeneity and subclonal mutation composition of clear cell renal cell carcinoma (ccRCC). Clear cell RCC is the most common type of kidney cancer and is known to be difficult to treat, as it is, for instance, mostly resistant to chemotherapy when metastasized (Choueiri and Motzer, 2017, Cancer Genome Atlas Research, 2013, Penticuff and Kyprianou, 2015. Several studies revealed that ccRCC is a genetically very heterogeneous disease (Gerlinger et al., 2012, Gerstung et al., 2012. The analysis of 25 single cells from one ccRCC patient revealed a large extent of genetic heterogeneity between the different cancer cells of the same tumor (Xu et al., 2012). A study of multiple tumor samples from eight ccRCC cases demonstrated that the diversity within tumors is in some cases as high as the diversity between patients (Martinez et al., 2013). Gerlinger and colleagues analyzed the intra-tumor diversity of 10 ccRCC patients, and also found that most driver mutations were subclonal (Gerlinger et al., 2014). This finding prompts a systematic analysis of the combinations of subclonal mutations in co-occurring clones in ccRCC. For this purpose, we investigated a cohort of 89 ccRCC patients (Fig. 1A). In the initial discovery phase, we analyzed two spatially separated biopsies and a matched normal sample from each of 16 ccRCC patients to provide an overview of the diversity and to inform the selection of genes for the second in-depth follow-up analysis. In the second phase, we used the constructed gene panel to sequence 826 genes at high coverage in 178 paired tumor samples and 89 matched normal samples from 89 ccRCC patients. The two paired biopsies per tumor and the high coverage allow a detailed investigation of the subclonal mutation composition. We developed a customized statistical framework, called GeneAccord, to examine the combinations of clones that co-exist in these tumors. We detected several pairs of mutated genes that are mutually exclusive within the clones of tumors and referred to as clonally exclusive (Fig. 1B). Such subclonal mutations are co-occurring in the same tumor, but occur in different clones.
Several methods have been proposed that detect co-occurrence of mutated genes or pathways in cancer patients (Jiang et al., 2016, Zhang et al., 2014, Dao et al., 2017. For instance, Dao et al. (Dao et al., 2017) proposed a framework that detects groups of genes that are united by a certain property such as co-occurrence, mutual exclusivity or functional relatedness, while exhibiting a different characteristic between the groups of genes. However, the aforementioned approaches do not refer to the individual clones within a tumor. They only distinguish between mutations being present or absent in a certain patient. Therefore, they do not allow for a distinction between epistatic effects from having two pathways mutated in the same cell, or having two different pathways mutated in two different clones that may complement each other's abilities and possibly cooperate. Jiang and colleagues performed clonal analysis only for one sample and found that the mutations in the two co-occurring pathways are actually in the same clone (Jiang et al., 2016). In this work, we specifically seek to find mutations that are assigned to different clones in order to uncover novel principles of the mechanisms of clonal co-existence and potential cooperation. We present an approach that systematically assesses subclonal mutation patterns in a large cohort of patients.

Figure 1. (A)
Experimental design. The first phase includes 16 clear cell renal cell carcinoma (ccRCC) patients of which two spatially separated biopsies from the primary tumor and a matched normal sample were collected. Whole-exome sequencing and transcriptome sequencing was performed and the detected mutations informed the selection of genes for the panel of the second phase. The second phase includes an extended cohort of patients and the selected genes were targeted with higher coverage. From a total of 89 patients, we analyzed two spatially separated tumor biopsies and a matched normal sample per patient. Fourteen of the patients in this panel data set were also among the 16 from the first phase. (B) Intra-tumor heterogeneity and co-occurring tumor clones. Schematic representation of tumor development, co-existing clones, and their shared and private mutations. Left: The tumor initiates from a single cell that acquires a driver alteration (orange). A subsequent driver event (purple) leads to the spread of a new clone that eventually outcompetes the first one. Finally, the yellow and light blue driver alterations lead to clones that stably co-exist and possibly benefit from each other. The biopsies obtained consist of a heterogeneous collection of clones. The mutations show different variant allele frequencies (VAF). Right: The inference of clones from bulk sequencing data and their phylogenetic relationship uncovers the mutational history of the tumor. The mutation ordering can be represented in a mutation tree as depicted here. In particular, each mutation is assigned to one or more clones, depending on where in the tree it occurred. The yellow and light blue mutations have a mutually exclusive clone assignment, and hence are referred to as being clonally exclusive.

Results
We first provide an overview of the detected intra-tumor heterogeneity in the ccRCC cohort, and compare the findings to previous studies. Next, GeneAccord, a novel statistical test to detect co-occurring clones in tumors, is introduced and applied to the panel data set obtained from 89 ccRCC patients and also to the whole-exome data set in order to detect clonally exclusive pairs of pathways in these tumors.

Genetic and transcriptomic diversity of ccRCC
The analysis of intra-tumor heterogeneity and subclonal mutation patterns was performed on a cohort of a total of 89 ccRCC patients and it comprised of two phases (Fig. 1A). In an initial discovery phase, whole-exome and transcriptome sequencing data from paired tumor biopsies from 16 ccRCC patients plus one matched normal sample per patient were analyzed. The coverage of the whole-exome sequencing (WES) data was on average 85x. The mutations were called from the WES data set (Methods section), and between 29 to 130 single-nucleotide variants (SNVs) and insertions and deletions (indels) were detected per patient ( Fig. 2A). The fraction of mutations that was only detected in one of the two biopsies from the same tumor was on average 40%, which indicates high levels of intra-tumor genetic diversity. These mutations are referred to as private, whereas mutations detected in both tumor samples of a patient are called shared.
From the RNA-sequencing (RNA-seq) data, the differentially expressed genes were called (Methods section). We found an average of 6,364 genes per patient to be upregulated and 6,598 genes downregulated (Fig. 2B). On the transcriptomic level, the intra-tumor heterogeneity was slightly reduced with an average of 31% of differentially expressed genes being detected only in one of the two biopsies. Pathway overrepresentation analysis was performed with the set of differentially expressed genes using the Reactome pathway database (Fabregat et al., 2018). Among the most overrepresented pathways are many pathways related to translation, signal transduction and growth factors (Fig. 2C). For instance, the signaling pathways involving the growth factors PDGF, VEGF, SCF, or the growth factor receptor EGFR are deregulated in all patients. In ccRCC, the vascular endothelial growth factor (VEGF) is often activated which is important for angiogenesis, cell growth, and survival (Choueiri and Motzer, 2017). In fact, VEGFA, is upregulated in all patients of this data set. The most overrepresented pathways related to translation are highly overrepresented in patients 4, 15, and 16. They are all enriched only privately in one tumor sample of patient 3 and patient 15, indicating that these deregulated processes are subclonal in these tumors. In the second phase, the cohort was extended to 89 ccRCC patients. From each, two spatially separated biopsies of the primary tumor and a matched normal sample were collected. This data set includes 14 of the patients from the WES data set, and 75 additional patients. A selection of 826 genes was compiled, which was informed by the mutated genes detected in the WES data set, as well as from the frequently mutated genes in a large ccRCC study from The Cancer Genome Atlas Research Network (TCGA) (Cancer Genome Atlas Research, 2013). Using this customized gene panel, we sequenced the 178 tumor and 89 normal samples at high depth. The deep coverage enables detection of low-frequency mutations, and the larger cohort provides increased statistical power such that rare subclonal mutation patterns can be detected. A large cohort is especially important for the analysis of co-occurring mutated genes, because most genes are mutated in only a small fraction of patients (Garraway and Lander, 2013), and therefore the chance of the same pair of mutated genes occurring in multiple patients is generally small. The cohort includes two spatially separated tumor biopsies from each patient in order to facilitate more robust inference of clones and their phylogenetic relationships. The sequenced reads contain unique molecular identifiers (UMI), which allows for the correction of potential sequencing errors (Methods section). The coverage of the panel sequencing (panel seq) data set was on average 933x, and after UMI consensus building and read filtering it was 93x.
The number of SNVs as well as indels in the panel seq data set was on average 22 per patient (Fig. 3, top panel). Pairwise comparison of the two biopsies from the same tumor revealed that on average 62% of the mutations in a patient were private to one of the two samples. This is in line with the observations made in another study, in which the intra-tumor heterogeneity in four renal cell carcinoma patients was analyzed, and where between 63 and 69% of mutations were not detected in all samples from the same tumor (Gerlinger et al., 2012). A previous study of TCGA including 499 ccRCC showed that the gene VHL is hit by a mutation in 51% of ccRCC cases (Cancer Genome Atlas Research, 2013). Furthermore, chromatin remodeling genes such as PBRM1, SETD2 and BAP1 are, with frequencies of 36%, 13%, and 10%, often mutated in ccRCC (Cancer Genome Atlas Research, 2013). These four most commonly mutated genes in ccRCC are also among the most frequently mutated genes in our data set (Fig. 3, bottom panel), and the frequencies of VHL, SETD2, and BAP1 are with 54%, 12%, and 13% comparable to the ones from the TCGA study. The gene PBRM1 occurs in 19% of the patients in our cohort, which is less than the frequency reported from TCGA, which is 36% (Cancer Genome Atlas Research, 2013). Another ccRCC study by Sato et al. including 106 samples reported PBRM1 to occur in 29% of cases (Sato et al., 2013), which is closer to the frequency in the present cohort. The mutations in VHL are known to occur early in tumor development (Pena-Llopis et al., 2013), which is in line with our observation that in 39 of 48 cases, the VHL mutations are shared between both tumor biopsies of a patient.
Among the most frequently mutated genes in this data set are also the mucins MUC6, MUC16, and MUC3A. In the TCGA study, the mucins MUC4, and MUC16 were among the seven most frequently mutated genes (Cancer Genome Atlas Research, 2013). In a study of 67 ccRCCs, MUC16 was among the five most recurrently mutated genes (Arai et al., 2014), and Sato et al. found MUC16 to be the fourth most frequently mutated gene (Sato et al., 2013). Also, MUC4 was among the significantly mutated genes in their study (Sato et al., 2013). MUC4 is not in this gene panel, but the mutation frequencies of the mucins MUC6, MUC16, and MUC3A are with 42%, 38%, and 18% higher than those reported in TCGA (Cancer Genome Atlas Research, 2013). The majority of mutations in these genes in our cohort has a low variant allele frequency (VAF). Specifically, more than three quarters of the mutations in MUC6, MUC16, and MUC3A have a VAF below 10%, and almost half below 5%. We conjecture that the low frequency may be the reason why many mutations in these genes are missed in studies with lower read coverage, or only one biopsy per tumor. The analysis of 25 single cells from one ccRCC patient also showed that most mutations were only found in a small fraction of the tumor cells, suggesting that rare mutations may be missed with low read coverage in a bulk sample (Xu et al., 2012). Mucins may actually play an important role in ccRCC, as, for instance, several studies in the past years highlighted that the expression level of certain mucins is predictive of clinical outcome in ccRCC patients (Fu et al., 2016, NguyenHoang et al., 2016, Bai et al., 2015. In the present cohort, the survival data and the expression status of the three mucins MUC6, MUC16, and MUC3A is available for eleven patients, and a log-rank test was performed to assess potential correlations between survival and expression status of these three genes. MUC6 downregulation showed a significant correlation with decreased survival (p = 0.01; Log-rank test).
Previous studies of intra-tumor heterogeneity in ccRCC reported patterns of convergent phenotypic evolution in several genes including BAP1, SETD2, PBRM1, PIK3CA, PTEN, and KDM5C (Gerlinger et al., 2012, Gerlinger et al., 2014, where the genes were affected by multiple distinct mutations across the clones in the tumor. In this cohort, such a pattern of convergent phenotypic evolution can also be found. In patient 88, the gene PBRM1 is hit by different mutations in the two tumor samples. One tumor sample, TU1, has a frameshift deletion, and a missense mutation, while the other tumor sample, TU2, has a missense mutation at a different loci in PBRM1. Both subclonal missense mutations of PBRM1 are predicted to be deleterious according to the SIFT annotation (Sim et al., 2012). The heatmap highlights the mutations that were detected in the most frequently mutated genes. The four most frequently altered genes in ccRCC are VHL, PBRM1, BAP1, and SETD2 (Cancer Genome Atlas Research, 2013) and are highlighted in bold. If a gene was hit by multiple mutations that all have the same status (Shared, Private TU1, or Private TU2), the following ordering is applied to prioritize which color is shown in the heatmap, starting with the highest priority: Stop gained, Start gained, Frameshift indel, Inframe indel, Missense, Splice site, Five prime UTR, Three prime UTR, Synonymous. That means, if a gene has, e.g., a missense and a synonymous mutation, the missense mutation will be displayed in the heatmap. The variants were annotated with SnpEff (Cingolani et al., 2012b) (Supplementary Table S1). Mutations in non-coding regions are omitted from the heatmap.

GeneAccord algorithm
In order to systematically analyze subclonal mutation combinations in co-occurring clones, we developed a statistical framework called GeneAccord to identify pairs of mutated genes or pathways that co-occur in the tumor but in different clones. That is, the pairs of mutated genes or pathways are mutated in the same patient, but their subclonal mutation profiles are mutually exclusive. The underlying rationale is that if two clones co-exist in a tumor and cooperate, for example, by sharing diffusible factors, they may have acquired a complimentary set of mutations to benefit each other, namely, one clone has mutations that the other one does not have, and vice versa. We refer to this situation as clonal exclusivity. More precisely, two mutated genes A and B, are called clonally exclusive, if one clone has a mutation in gene A, but no mutation in gene B, and another clone of the same tumor has a mutation in gene B, but no mutation in gene A (Fig. 1B).
Such a mutually exclusive pattern within the clone assignment of a tumor's set of mutated genes may occur at random, and therefore our statistical framework assesses whether a pattern of clonal exclusivity occurs more often than expected by chance alone across a cohort of patients. It makes use of a likelihood ratio test to compare the number of observed clonal exclusivities of a gene pair of interest across the cohort to the background distribution of the expected frequencies of clonal exclusivity ( Supplementary Fig. S1, Methods section). The null hypothesis is that the gene pair of interest exhibits the pattern of clonal exclusivity as often as expected by chance when assuming independent occurrence of mutations. The alternative hypothesis is that the pattern occurs at a rate greater than random chance. The magnitude and type of this difference in the rates is quantified with a parameter ∆ (Methods section). A positive parameter ∆ indicates that the pair tends to be mutated in different clones more often than expected, whereas if it is negative, the gene pair tends to co-occur more often together in the same clone. All pairs with a positive parameter ∆ are possible candidates for being significantly clonally exclusive. For those pairs the p-values of testing the alternative delta>0 versus the null delta=0 are computed. Multiple testing correction is done using the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995). If a pair is significantly clonally exclusive, it suggests that this specific clone configuration may confer a selective advantage, possibly through cooperation between the clones. GeneAccord is implemented as an R package available at https://github.com/cbg-ethz/GeneAccord and submitted to Bioconductor.
The input data of the algorithm are the mutated gene-to-clone assignments from a cohort of cancer patients, which were obtained by running phylogenetic tree inference methods. Reconstructing the evolutionary history of a tumor and detecting the clones is challenging (Beerenwinkel et al., 2015). For non-deterministic algorithms, repeated tree inference runs may lead to slightly different mutation-to-clone assignments. Therefore, our algorithm was designed to allow as input multiple gene-to-clone assignments per patient. They may have been generated by repeated tree reconstruction runs, or by sampling from the posterior distribution of trees. The tree inference methods designate the mutations to individual clones. The mutations can then be mapped to genes or pathways using existing pathway databases. Hence our statistical framework can be applied on the gene level, or on the pathway level to detect clonally exclusive pairs of pathways.

Clonal exclusivity in 89 ccRCC patients
The program Cloe (Marass et al., 2016) was applied to infer the phylogenetic history of each tumor and to obtain the mutated gene-to-clone assignments. The input to Cloe are mutations in copy number neutral regions. After filtering out mutations that are in potential copy number regions (Methods section), there were on average 17 mutations per patient. We performed 20 repeated runs of Cloe with different seeds in order to incorporate uncertainty in tree inference. The reconstruction of the phylogenetic history of each tumor maps each mutation to one or several clones. Mutations were mapped to genes and only those mutated genes were kept that were annotated to have a likely effect, that is, for instance, synonymous or intronic mutations were filtered out (Methods section). As a result, a total of 82 patients were left for which the mutated gene-to-clone assignments contained between two and 68 mutated genes per patient ( Supplementary Fig. S2A). The average rates of clonal exclusivity, reflecting how often the clonal exclusivity pattern occurs in general, were computed for each patient (Methods section). The rates were between 0.0 and 0.675 ( Supplementary Fig. S2B), and patients had between two to five clones. The distribution of the test statistic under the null hypothesis was computed ( Supplementary Fig. S3, Methods section). A total of 362 mutated genes were among the gene-to-clone assignments. Across all patients, there were 6,270 pairs of genes, most of which were mutated in just one patient. A total of 325 pairs of genes were mutated together in two or more patients. The parameter ∆ was computed for these 325 pairs, and 16 pairs had a positive parameter ∆, which were tested for significance of clonal exclusivity.
A total of five pairs were significant at the 5% level after correction for multiple testing (Supplementary Table S2), including the gene pairs {MUC16, TP53} (adjusted p-value = 0.01), and {BAP1, TP53} (adjusted p-value = 0.04). The most significant gene pair, MUC16 and TP53, is clonally exclusive in both patients in which it is mutated, providing strong evidence for clonal exclusivity. The other top gene pairs are also mutated in two patients, but clonally exclusive only in one of them, while co-occurring in the same clone in the other patient. Therefore, they should be interpreted more carefully: The patients in which they are clonally co-occurring all have an average rate of clonal exclusivity of zero, indicating that the inferred phylogeny did not have a single branching. Hence, also the two mutated genes from a gene pair are in the same branch, and therefore clonally co-occurring in these patients. In such cases, it is also insightful to examine the variant allele frequencies (VAF) as explained below.
The most striking gene pair consists of MUC16 and TP53, which is clonally exclusive in patients 5 and 81 (Fig. 4A). Both mutated genes were assigned to different branches of the phylogenetic tree. This is underlined by considering the VAF of the mutations in these two genes. The VAF of a mutation is computed as the fraction of reads that contain the mutation out of all reads covering the locus of the mutation. Assuming copy number neutrality, this quantity reflects the number of cells in the biopsy that contain the mutation. For instance, a VAF of around 50% indicates that the mutation may be heterozygous in all cells of the biopsy, or homozygous in half of the cells. Subclonal mutations generally have a lower VAF, since those mutations only occur in a subset of the cancer cells. The VAF of cancer mutations inform the clonal decomposition of the tumor. The mutations in MUC16 and TP53 were assigned to different clones, because their VAFs change into different directions between the two samples ( Fig. 4B). For instance, one tumor sample of patient 5 with VAFMUC16 = 3.5% and VAFTP53 = 0% has a higher VAF in MUC16, while the other tumor sample with VAFMUC16 = 0% and VAFTP53 = 14.7% shows a higher VAF in TP53.
The second most significant pair is BAP1 and TP53, which is clonally exclusive in one of two patients (Fig. 4C). More precisely, it is clonally exclusive in patient 84, and co-occurring in patient 30. The latter one has an average rate of clonal exclusivity of zero, which means that the inferred phylogenetic tree was always linear in all repeated runs of tree inference. In general, the tree inference favors parsimonious trees (Marass et al., 2016). If there is not enough evidence for a branching of the phylogenetic tree, the inferred tree will be linear. For patient 30, the change in the VAF between the two tumor samples is different for the two mutations in BAP1 and TP53 (Fig. 4D), supporting that they are actually in different clones. However, the coverage of the BAP1 mutation locus is only 25x in the second tumor sample. The average coverage in the two samples of patient 30 is with 75x and 67x lower than the average coverage across all samples in general (93x). Additionally, patient 30 had only six mutations for the tree inference, and hence there may not be enough evidence to infer a tree with a branching that separates the two genes. A larger number of biopsies of the tumor would allow to infer the clonal composition with more certainty. BAP1 and TP53 were found to be mutated in co-existing clones of a ccRCC case before (Gerlinger et al., 2014).
The two more speculative clonally exclusive gene pairs {MUC16, TP53} and {BAP1, TP53} are examined in more detail in the discussion section, including possible interpretations on how the clones with these mutations could be synergistic. The change in the VAF between the two samples is different for the two genes, which is an indication that they are in different clones. (C) BAP1 and TP53 are mutated in patients 30 and 84. They are clonally exclusive in patient 84 as indicated by the heatmap and phylogenetic tree. Patient 30 has an average rate of clonal exclusivity of 0, which reflects a linear tree, and hence no genes can be in different clones. (D) The VAF in the paired samples of patients 30 and 84. The change in the VAF between the two samples for the genes BAP1 and TP53 is different in both patients. The inferred tree is linear in patient 30, which places the two genes BAP1 and TP53 in the same clone. However, there is a difference in the VAF of BAP1 and TP53 also in the two samples of patient 30, which could reflect that they are in fact in different clones.

Pathway-level clonal exclusivity in 16 ccRCC patients
The WES data from paired tumor biopsies and matched normal samples of 16 ccRCC patients enables a clone pattern analysis on the pathway level. That is, the mutated genes can be mapped to pathways, and we can detect pathway pairs that are affected in several patients. The pathway-level analysis is feasible since, with the whole exome, all genes are covered and it is possible to determine whether a pathway is affected or not, unlike for the panel, where only a subset of genes is targeted. The mutations detected in the WES data set (Methods section) were assigned to clones with Cloe (Marass et al., 2016), mapped to genes, and subsequently, the genes were mapped to pathways using the Reactome pathway database (Fabregat et al., 2018). This procedure resulted in a total of 877 affected pathways. Across all patients, there were 71,422 pairs of pathways, most of which were affected in just one patient. A total of 5,433 pairs of pathways were simultaneously affected in two or more patients. Among these, when applying GeneAccord, 211 pairs had a parameter ∆ that was greater zero and were tested for significance.
The two most striking clonally exclusive pathway pairs are {"Major pathway of rRNA processing in the nucleolus and cytosol" (referred to as pathway 1), "O-glycosylation of TSR domaincontaining proteins" (pathway 2)}, as well as {pathway 1, "Defective B3GALTL causes Petersplus syndrome (PpS)" (pathway 3)}, which are clonally exclusive in both patients in which they are affected (Fig. 5A, 5B, Supplementary Table S3). They are clonally exclusive in patient 8 and patient 14. Both patients have a very low average rate of clonal exclusivity of 0.012 and 0.008, respectively, on the pathway level, and hence this clone pattern is highly significant (p < 10 -5 ). Pathway 1 belongs to the category "Metabolism of RNA", while pathway 2 falls into the class "Metabolism of proteins", and pathway 3 is a disease pathway related to diseases of glycosylation (Fabregat et al., 2018). Pathway 1 was also significantly enriched among the differentially expressed genes in 13 of 16 patients of this cohort. The altered metabolism of RNA appears indeed to be important in ccRCC: Recently, gene expression analysis revealed that deregulation of genes in the RNA metabolic process was a significant and distinctive feature in ccRCC when compared to other subtypes of RCC (Ricketts et al., 2018). Also, mRNA processing was among the significantly affected pathways in a study of 240 ccRCC (Sato et al., 2013). The pathway pairs {1, 2} and {1, 3} are affected in the two patients in a different subset of genes (Fig. 5A). Hence this clonal exclusivity pattern is only detectable on the pathway level. Potential synergies between the two clones with these affected pathways are discussed below. The data set also includes RNA-seq data from each sample. Among the differentially expressed genes in each sample, the pathways are significantly overrepresented in some of the samples.

Discussion
GeneAccord provides a statistical framework for detecting pairs of genes or pathways that are more often mutated in different clones of the same tumor than expected by chance. There are several possible biological explanations for such a mutational pattern. The two clones can have complementing phenotypes and cooperate or benefit from each other, for instance, by sharing of diffusible factors. Another possibility is that the two genes of a clonally exclusive pair are synthetically lethal and that both genes being mutated in the same cell would lead to a disadvantageous phenotype. Lastly, the two clones may also be the result of parallel convergent evolution where both clones exhibit the same phenotype by different mutations. In order to gain a better understanding of possible reasons for the clonally exclusive pattern, it is important to examine the biological functions of these genes and pathways in more detail.

Clonal exclusivity of MUC16 and TP53
TP53 is a well-known tumor suppressor and plays a role in growth arrest, apoptosis, and DNA repair (The UniProt, 2017). According to the COSMIC database, TP53 was found to be mutated in ccRCC in 6% of 1,483 ccRCC cases (Forbes et al., 2017). TP53 was among the significantly mutated genes in several ccRCC studies (Sato et al., 2013, Cancer Genome Atlas Research, 2013, Chen et al., 2016, and was associated with reduced survival (Ricketts et al., 2018, Girgin et al., 2001, which is confirmed in the cohort presented here (p = 0.01; Log-rank test). The p53 signaling pathway in general was also reported to be significantly affected in ccRCC (Sato et al., 2013). A previous study of the intra-tumor diversity in ten ccRCC cases revealed that TP53 was one of the most extreme examples of genes being detected more often when sequencing multiple biopsies per tumor instead of one biopsy (Gerlinger et al., 2014). This observation suggests that TP53 mutations may be a crucial subclonal event in ccRCC. Studies with RCC cell lines showed that p53 was repressed (Ku et al., 2013) and not functional, even when the gene TP53 was not mutated (Gurova et al., 2004).
MUC16 is also called CA 125 and encodes a transmembrane glycoprotein Swanson, 2004, Lucarelli et al., 2014). It is mutated in 8% and 14% of ccRCC cases in the TCGA study, and the 106 samples from Sato et al., respectively (Cancer Genome Atlas Research, 2013, Sato et al., 2013. MUC16 is upregulated in lung cancer (Lakshmanan et al., 2017), pancreatic cancer (Haridas et al., 2011) and ovarian cancer, where it is used as a biomarker (Felder et al., 2014) and plays a crucial role in promoting tumorigenesis and metastases (Theriault et al., 2011). Moreover, several RCC studies highlighted that increased levels of CA 125 were associated with poor prognosis and more advanced stage, suggesting the potential use of this serum as a biomarker also in RCC (Grankvist et al., 1997, Lucarelli et al., 2014, Bamias et al., 2003. Lakshmanan and colleagues recently demonstrated that the overexpression of MUC16 is associated with increased tumor cell growth, cancer cell migration, and resistance to cytotoxic drugs in lung cancer (Lakshmanan et al., 2017). Furthermore, they discovered that MUC16 activates the signaling proteins JAK2, STAT3, and glucocorticoid receptor (GR), which in turn upregulate TSPYL5 (Lakshmanan et al., 2017). TSPYL5 was previously reported to be a suppressor of TP53 (Epping et al., 2011). Hence Lakshmanan et al. suggested that MUC16 exerts its chemoresistant, anti-apoptotic, and metastatic function by downregulating TP53 and its target genes via the signaling cascade of JAK2, STAT3, GR, and TSPYL5 (Lakshmanan et al., 2017). In light of this fact, the tumors with MUC16 and TP53 mutated clones may have undergone convergent evolution where both clones developed their own mechanism of shutting down TP53, one with a mutation in TP53 directly, and the other one via a mutation in MUC16 that targets TP53 indirectly via other signaling genes. This explanation is in line with the fact that they were found to be mutually exclusive in breast cancer and endometrial carcinoma (Dao et al., 2017).
However, mutual exclusivity is not necessarily an indication that two genes are functionally redundant. This was exemplified by a study in which BAP1 and PBRM1 were found to be mutually exclusive in ccRCC (Pena-Llopis et al., 2013), and these genes seem to be involved in different processes: The tumors with either mutation were reported to have different gene expression signatures, and the patients showed different survival rates (Pena-Llopis et al., 2013). Therefore, alternative interpretations for the clonally exclusive pattern of MUC16 and TP53 are possible. MUC16 is a transmembrane protein, and might act as a cell-surface receptor that transfers external signals into the cell (Hollingsworth and Swanson, 2004). This could influence various processes such as cell-adhesion, which is important for metastasis and invasion (Hollingsworth and Swanson, 2004). Furthermore, MUC16 activates the signaling proteins JAK2 and STAT3 (Lakshmanan et al., 2017), which in turn are known to orchestrate angiogenesis by upregulating VEGF and basic fibroblast growth factor (bFGF) (Zhao et al., 2011). On the other hand, TP53 is involved in various signaling pathways such as Wnt signaling, MAPK signaling, and signaling by Interleukins (Kanehisa andGoto, 2000, Fabregat et al., 2018). Moreover, TP53 is known to interact with growth factors and in particular to be a regulator of several growth factors and growth factor receptors (Asschert et al., 1998). Hence, if the MUC16 clone is engaged in metastasis, invasion, and angiogenesis, while the TP53 clone promotes growth factors, the two clones could be complementary.

Clonal exclusivity of BAP1 and TP53
BAP1 is a tumor suppressor that is involved in DNA repair and among the most frequently mutated genes in ccRCC (Pena-Llopis et al., 2013, Cancer Genome Atlas Research, 2013. It is a histone deubiquitinase and associated with poor survival prognosis (Cancer Genome Atlas Research, 2013). In this cohort, BAP1 mutations were also found to be associated with highest tumor grade (p < 0.01; Fisher's exact test) and decreased survival (p = 0.06; Log-rank test). BAP1 and TP53 were found to be mutually exclusive in ccRCC (Bi et al., 2016). According to the gene ontology resource, TP53 and BAP1 both perform chromatin and protein binding, and both are associated to biological processes such as "negative regulation of cell proliferation", "regulation of cell cycle", and "protein localization to mitochondrion" (The Gene Ontology, 2017, Ashburner et al., 2000). Both proteins are members of the Reactome pathways "Deubiquitination" and "DNA double strand break response" (Fabregat et al., 2018). Following the same reasoning as before, the clonally exclusive pattern may also have other reasons than similar functions. TP53 and BAP1 actually share interacting partners, such as BRCA1 , Nishikawa et al., 2009) and BARD1 (Irminger-Finger et al., 2001, Nishikawa et al., 2009, which could imply an indirect signaling connection between the two genes and possibly between the clones. Gerlinger and colleagues also found TP53 and BAP1 to be mutated in coexisting clones in a ccRCC case and hypothesized that it represents parallel and divergent evolution within the tumor (Gerlinger et al., 2014).

Clonally exclusive pathway pairs
The two most striking pathway pairs include "Major pathway of rRNA processing in the nucleolus and cytosol" (pathway 1), which is clonally exclusive with "O-glycosylation of TSR domain-containing proteins" (pathway 2) and "Defective B3GALTL causes Peters-plus syndrome (PpS)" (pathway 3), respectively (Fig. 5A). The pathway pairs {1, 2} and {1, 3} are affected in the two patients in a different subset of genes (Fig. 5A), namely WDR18 and THSD4 in patient 8, and DDX49 and ADAMTS14 in patient 14. ADAMTS14 belongs to the ADAMTS protein family, which are secreted zinc metalloproteases that play a role in the extracellular matrix related to angiogenesis and cancer (Apte, 2009). THSD4 is also referred to as ADAMTS-like protein 6 and is also secreted to the extracellular matrix (The UniProt, 2017). Both, ADAMTS14 and THSD4 contain the thrombospondin type 1 repeat (TSR) domain (Matthews et al., 2009). The proteins with TSR domains can undergo O-fucosylation, a protein modification that plays a role in angiogenesis and Notch signaling (Adams and Tucker, 2000, Matthews et al., 2009, Moremen et al., 2012. The mutated genes in the pathways 2 and 3, ADAMTS14 and THSD4, are both secreted to the extracellular matrix, and therefore the clones with the affected pathways 2 and 3 may have an effect on other clones in the tumor. Whether there is a potential crosstalk between the clones with the pathway pairs {1, 2} and {1, 3} should be verified experimentally. To gain additional insight, it is useful to consider the RNA-seq data of these biopsies. Aside from inferring the mutation-to-clone assignment, the Cloe software also estimates the fractions of the clones in each sample (Fig. 5C). This is important in order to interpret possible changes on the transcriptomic level in the bulk samples. The RNA-seq data from the bulk samples allows to detect differential gene expression within all cells of the biopsy combined. Hence there is no clonal assignment of the transcriptomic changes. However, the clonal fractions estimated by Cloe hint towards the clones that are most abundant in the bulk sample. Among the differentially expressed genes in patients 8 and 14, the two most striking clonally exclusive pathways pairs are also enriched in some of the samples (Fig. 5D). The pathway 1 was mapped to clones 1 and 2 in patient 8 (Fig. 5B), which together have a clonal faction of 47.5% and 30.7% in samples TU1, and TU2, respectively (Fig. 5C). The pathway enrichment analysis shows that this pathway is also highly overrepresented in these two samples on the transcriptomic level (Fig. 5D). The pathways 2 and 3 were assigned to clone 3, which is, with 16.6%, most abundant in sample TU2 of patient 08 (Fig. 5C). Indeed, these pathways are also enriched among the differentially expressed genes in this sample (Fig. 5D). Pertaining to patient 14, no enrichment of pathway 1 can be found in either sample (Fig. 5D). Pathway 1 was assigned to clone 2, which was estimated to have a clonal fraction of only 8.6% and 0.2% in samples TU1 and TU2, respectively (Fig. 5C), which may explain why there is no signal detectable in the bulk transcriptome samples for this pathway. Pathways 2 and 3, however, were assigned to clone 3 (Fig. 5B), which has an estimated clonal fraction of 0.5% and 12.8% in samples TU1 and TU2, respectively. In both bulk RNA samples, pathways 2 and 3 are enriched. The fact that pathways 2 and 3 are also enriched in sample TU1 despite the small clonal fraction of 0.5% in this bulk sample shows that clone 3, which contains the affected pathways 2 and 3, may have an effect also on the other clones in this tumor.
In summary, we analyzed the intra-tumor heterogeneity in a cohort comprising a total of 178 tumor biopsies from 89 ccRCC patients. The investigation of the WES and RNA-seq data of 16 ccRCC patients revealed that the intra-tumor heterogeneity is very pronounced on the genetic level, while the transcriptomic level tends to show less heterogeneity. Yet still 31% of differentially expressed genes were detected in only one of the two biopsies from a tumor, and some pathways, especially related to translation, were only enriched subclonally in some patients. The extended cohort of 178 tumor biopsies from 89 ccRCC patients enabled a more detailed analysis. The high degree of intra-tumor heterogeneity commonly observed in ccRCC was also found in this cohort, where 62% of mutations were only detected in one of the two biopsies from a tumor. The four most frequently mutated genes in ccRCC, namely VHL, PBRM1, SETD2, and BAP1, were also among the most frequently mutated genes in this cohort. We found a pattern of convergent phenotypic evolution in the gene PBRM1, which was hit by two different deleterious missense mutations in each of the two biopsies of a patient. Furthermore, many mutations were identified in the mucins MUC6, MUC16, and MUC3A, which may indicate that these genes also play an important role in ccRCC development.
We have also introduced a novel approach to systematically analyze the subclonal mutation patterns in this large cohort of tumor patients. While many cancer-related mutually exclusive gene pairs have been previously identified across different tumors, finding such pairs within a single tumor has been elusive. Our statistical framework, GeneAccord, reveals pairs of mutated genes in co-existing clones. The co-existence in the same tumor suggests the possibility of a selective advantage of the combined clones possibly through cooperative interactions between them. A definite proof of such interactions will require future experimental studies in vitro or in vivo. Here, we have developed and applied a new computational approach to identify such gene (or pathway) pairs that are unlikely to be generated by random chance alone. As such, the GeneAccord method will also be useful in finding candidate cooperative tumor clones in other cancer patient cohorts.
Using our approach, the analysis of subclonal mutation patterns in the 89 ccRCC cases revealed several promising pairs of co-existing clones that harbor mutations in known cancer related genes. The most striking pairs of mutated genes in co-existing clones include {MUC16, TP53} and {BAP1, TP53}. Moreover, we applied the clonal exclusivity test to 16 ccRCC patients on the pathway level, and identified co-existing clones with affected pathways related to the metabolism of proteins and RNA. The identification of co-existing clones and their drivers may inform the design of combination treatments that target all clones within a tumor and will potentially improve treatment outcome.
Given the observation from a previous analysis of ten ccRCC that more biopsies lead to the detection of more subclonal driver alterations in almost all cases (Gerlinger et al., 2014), we emphasize that our analysis probably still underestimates the genetic diversity of ccRCC. We would obtain a clonal decomposition at an even higher resolution if more than two biopsies were included per patient. In this sense, the intra-tumor genetic heterogeneity and clonal exclusivity described in this study, most likely still represents only an improved lower bound of the true amount present in ccRCC.
A possible extension of GeneAccord would be to directly work with the posterior distribution of the inferred tree given the observed data, rather than taking as input a fixed collection of trees per patient. In that case, GeneAccord would need to be directly combined with the tree inference method. Another extension would be to allow also the detection of multiple genes rather than just pairs. That way, groups of genes could be identified that occur together in the same clone but are clonally exclusive with another group of genes. However, since most genes are only mutated in a small fraction of patients, this would require a very large cohort size.
Pertaining to the applicability of our framework, we note that the mutation-to-clone assignment, which is the input to our method, depends on the variant allele frequency, and therefore it is important to have sufficient depth of read coverage in the samples as well as ideally several samples per tumor. Also, a large cohort of patients is advantageous to identify rare clonal compositions. Such data sets are currently not abundant, but with decreasing sequencing costs (Muir et al., 2016), the cohorts of sequenced cancer patients will also increase in size, as well as breadth and depth of the sequenced regions, and number of biopsies per patient. This development will likely soon enable whole-exome or whole-genome sequencing also in large cohorts which will allow for simultaneous pathway and gene pair analyses on the same cohorts.
Additionally, the methods to infer the clonal genotypes of tumors improve constantly. For instance, with the advent of single-cell sequencing technologies (Wang and Navin, 2015), and methods to infer the phylogeny from single-cell data , the mutation-toclone assignment will be even more reliable. In fact, combined single-cell sequencing of the whole exome and the transcriptome will even allow to assign differentially expressed genes to specific clones. This would enable a GeneAccord analysis on the transcriptomic level. For our algorithm, it does not matter from which omics layer the alterations come, it is only crucial to have a clonal assignment of all altered players. Therefore, with single-cell data, one could then also perform a combined analysis using various omics layers including differentially expressed genes, as well as copy number and epigenetic changes. This would allow obtaining a more holistic portrait of the subclonal alteration profiles and potentially reveal more synergies between clones that could inform the design of future treatments.
Finally, we reported a systematic analysis of subclonal mutation patterns in a large cohort of cancer patients including 178 ccRCC biopsies and revealed promising pairs of co-existing clones. In general, our statistical framework opens up a novel avenue for systematically analyzing the clone constellations in large cohort of patients and to pinpoint pairs of genes that are altered in co-existing clones. The analyses this approach enables will contribute towards a better understanding of the evolutionary forces beyond mutation and selection that drive tumor evolution and will help to improve treatment strategies to eradicate the tumor as a whole with all its clones and key players.

Clonal exclusivity test for mutated genes in different clones
We developed a statistical framework to find clonally exclusively altered gene or pathway pairs in cancer clones across patients. A gene or pathway pair is referred to as clonally exclusive in a tumor, if both members are mutated, but not in the same clone, i.e. the respective mutations occurred in different branches of the tumor phylogenetic tree. In the example of Figure 1B, the yellow and light blue mutations are assigned to a disjoint set of clones. Let us assume the yellow mutation occurs in gene A, the light blue mutation in gene B, and the dark blue mutation in gene C. The gene pair { , } is clonally exclusive, and so is { , }, but not { , }. This pattern can of course also occur by chance. To decide whether two genes are clonally exclusive at a higher rate than expected, we developed the clonal exclusivity test, which is implemented as an R package called GeneAccord.
The required input data is a table with altered genes or pathways and their clone assignments in a binary fashion from a cohort of cancer patients (Supplementary Fig. S1A). For each patient, one may run the phylogenetic tree inference several times or take multiple samples from the posterior tree distribution, as this step involves uncertainty. Genes that were only assigned to clones in a subset of the collection of tree inferences of a patient may be excluded from further analysis to reduce the noise in the gene-to-clone assignments. We use the phylogenetic tree inference tool Cloe (Marass et al., 2016) and run it 20 different times with different seeds in order to account for uncertainty. Genes that are assigned to clones in less than 90% of the tree inference runs were removed for the GeneAccord analysis. After having computed one or several tree inferences for all patients, the following two steps are carried out for each patient in the cohort separate: First, the number of gene pairs that are mutated in different branches of the tree, i.e., that are clonally exclusive, is computed for each tree in the collection of tree inferences. Two genes being assigned to different branches of the phylogenetic tree is equivalent to them having a mutually exclusive clone mutation profile (Fig.  1B). We refer to = # of gene pairs on different branch # of all gene pairs as the rate of clonal exclusivity. This quantity depends on the tree structure and the number of clones in total, and hence may be different for each patient and for each tree. Let be the mean of the clonal exclusivity rates across all trees from the collection for a patient . Hence ,{ , } contains the average rate of the patient, which is weighted by the fraction of trees, in which the specific pair was clonally exclusive. Therefore this overall rate takes into account the gene pair's clone constellation across the collection of trees. After the rates and the values for all pairs and all patients have been generated, the entire cohort is considered to compute the null distribution of the test statistic by decoupling gene pair labels between patients. The test statistic is the likelihood ratio of the null hypothesis and the alternative hypothesis. The null hypothesis 0 assumes independence of mutations, and therefore states that the gene pair of interest occurs across the cohort in the clonal exclusivity pattern as often as expected according to the overall rates . The likelihood under the null hypothesis for a pair { , } is defined as Where the product runs over all patients in which the pair is mutated. The alternative hypothesis 1 is that the gene pair is clonally exclusive at a rate which is different from the rates . The alternative likelihood for the pair { , } can be written as where the logit transformation is used to ensure that the rates are always between zero and one.
The parameter ∆ quantifies the deviation from the expected rate for the pair of interest. It is estimated from the data using maximum likelihood. The test statistic is defined as .
In order to obtain the expected distribution of the test statistic under the null hypothesis, its empirical cumulative distribution function (ECDF) is built using the following procedure. For each number of patients that a pair can be mutated in, the test statistic under the null hypothesis is computed 100,000 times. To compute the test statistic under the null hypothesis, patients from the cohort are sampled. The sampling is random but proportional to the number of pairs in each patient. From each of these patients, a gene pair is randomly chosen from which the quantity ,{ , } ,{ , } as well as the average clonal exclusivity rate is taken. The rate is distorted by a beta distribution with mean , and = + = 1000, in order to obtain a smooth distribution function. Next, the overall rate of clonal exclusivity ,{ , } is computed. From the different rates from the patients, the test statistic is computed. The distribution of the test statistic under the null hypothesis is computed separately for each number of patients that a pair can be mutated in. This is owing to the fact that the null distribution of the test statistic may be different depending on in how many patients a gene pair is mutated (Supplementary Fig. S3).
Next, the observed test statistic of a gene pair of interest can be compared to the distribution of the test statistic under the null hypothesis to obtain a p-value. That is, the p-value of an observed test statistic for a gene pair of interest is defined as where is the ECDF of the test statistic under the null hypothesis which was previously computed. We only test if a gene pair is mutated in more than one patient, since we do not have enough power otherwise. When comparing an observed test statistic to the null distribution, the number of patients in which the pair is mutated determines which of the previously generated null distributions is used.
We assessed the p-value distribution under the null hypothesis in order to verify that the pvalues are uniformly distributed. To this end, we generated 5,000 simulated pairs by using the same strategy as described above for the generation of the ECDF. This was done for different numbers of patients that the pairs are mutated in ( = 2, 3, … , 7). The p-values are distributed uniformly (Supplementary Fig. S4).
The sign of the parameter ∆ in equation (1) signifies the type of the clone pattern for a pair. That is, if the parameter ∆ is greater zero, the pair tends to be clonally exclusive more often than the expected rate. Conversely, if the parameter ∆ is less than zero, the pair tends to occur more often together in the same clone. Here, we are interested in pairs with a clonally exclusive mutation pattern, and hence we only test pairs where the parameter ∆ is greater zero.

Data generation and analysis
We applied the clonal exclusivity test to analyze two data sets from a cohort of ccRCC patients. The tumors of these ccRCC patients have been classified according to the 2016 WHO classification (Moch et al., 2016).
The first data set encompasses two spatially separated primary tumor biopsies and one matched normal sample from each of sixteen clear cell renal cell carcinoma (ccRCC) patients. The whole exome was sequenced using the Illumina HiSeq 2000 system to obtain 101-base paired-end reads. The computational pipeline to analyze the data was a customized version of the NGS-Pipe framework  that included the following steps: Adapter clipping and trimming of low-quality bases with Trimmomatic (Bolger et al., 2014), alignment of the reads to the human reference genome version hg19 using bwa (Li and Durbin, 2009), as well as read processing with SAMtools (Li et al., 2009b), Picard tools (Picard), and bamreadcount. Reads were realigned locally around indels, and base qualities were recalibrated with the Genome Analysis Toolkit (GATK) (Van der Auwera et al., 2013). The single-nucleotide variants (SNVs) were called using the rank-combination  of deepSNV (Gerstung et al., 2012), JointSNVMix2 (Roth et al., 2012), MuTect (Cibulskis et al., 2013), SiNVICT (Kockan et al., 2017), Strelka (Saunders et al., 2012), and VarScan2 (Koboldt et al., 2012). The rank-combination is a method that combines the results of different variant callers by integrating the ranked lists of variants to generate a combined ranking . P-values of deepSNV were corrected for multiple testing with the R package IHW (Ignatiadis et al., 2016). Indels were called using SiNVICT (Kockan et al., 2017), Strelka (Saunders et al., 2012), VarDict (Lai et al., 2016), and VarScan2 (Koboldt et al., 2012), and combining them with the rank-combination . For copy number variant detection, the tool Sequenza (Favero et al., 2015) was employed. Mutations in copy number neutral regions were selected as input for Cloe (Marass et al., 2016), a tool to reconstruct the evolutionary history of a tumor and to assign the mutations to different clones. In order to account for the uncertainty in the phylogenetic tree inference, Cloe was run 20 times with different seeds.
Additionally, for the 48 samples from the initial 16 ccRCC, paired-end and single-end RNAsequencing was performed on the Illumina HiSeq 2000 system to generate 101-base pairedend reads, and 51-base single-end reads for each sample. For the computational analysis, the NGS-Pipe framework was adapted . Reads were clipped and trimmed using Trimmomatic (Bolger et al., 2014), and alignment was performed with STAR (Dobin et al., 2013). Read counts for the genes were obtained with the program featureCounts (Liao et al., 2014). Differential gene expression analysis was done using DESeq2 (Love et al., 2014). The R package WebGestaltR (Wang et al., 2017) was applied to perform enrichment analysis.
The second data set is a panel sequencing data set. It comprises an extended cohort of patients from which a selected set of genes was sequenced at higher depth. The selection of 826 genes was informed by the mutated genes detected in the WES data set, as well as from the frequently mutated ccRCC genes in TCGA (Cancer Genome Atlas Research, 2013). We generated panel seq data from 89 ccRCC patients, including two spatially separated primary tumor biopsies, and a matched normal sample per patient. This data set comprises 14 of the patients from the WES data set, and 75 additional patients. The data was sequenced using the Illumina HiSeq 2500 system. The sequenced reads contain unique molecular identifiers (UMI), and this allows for the correction of potential sequencing errors. Reads with identical UMIs, which are mapped to the same genomic position, come from the same DNA molecule, and therefore, the consensus sequence can be built, and the variants can be called with higher confidence.
The pipeline for analyzing the sequencing data was again a customized version of the NGS-Pipe framework  including the following steps: Raw reads were clipped and trimmed using the tool SeqPurge (Sturm et al., 2016). Reads were aligned to the human reference genome version hg19 with the aligner bwa (Li and Durbin, 2009). Reads were further processed using SAMtools (Li et al., 2009a), Picard tools (Picard) and bam-readcount. Local realignment around indels was done with GATK (Van der Auwera et al., 2013). We used the software UMI-tools (Smith et al., 2017) to group reads with identical UMI and identical mapping position together, and an in-house tool to build the consensus sequence and thereby correct sequencing errors. Our in-house tool takes the grouped reads with identical UMI and identical mapping position and attempts to generate the consensus sequence from these grouped reads. If the reads contain contradicting bases at a nucleotide position, it is masked with the base 'N'. The SNV and indel calling was similar as for the WES data set. Some of our samples are from formalin fixed paraffin embedded (FFPE) material. FFPE samples are known to harbor artificial C>T and G>A alterations (Wong et al., 2014, Oh et al., 2015. They occur mostly at lower frequencies in the range between 1-10% variant allele frequency (VAF), since the DNA damage occurs at different genomic positions in different cells (Wong et al., 2014, Yost et al., 2012. To remove potential FFPE artifacts, we filtered out C>T and G>A mutations that had a VAF < 10%. The tool Cloe (Marass et al., 2016) for the tree inference requires as input mutations in copy number neutral regions. In order to filter out mutations that are in potential copy number variant regions, mutations that are within 4000 bps of an imbalanced heterozygous germline SNP were filtered out. An imbalanced heterozygous germline SNP is a SNP that has VAF between 40-60% in the normal sample, but in the tumor sample the VAF is out of these bounds, indicating a potential copy number change.
In the WES and panel seq data sets, in order to perform quality control we used the programs Qualimap (Okonechnikov et al., 2016) and FastQC (Andrews, 2010). For annotation of the variants, SnpSift (Cingolani et al., 2012a) and SnpEff (Cingolani et al., 2012b) as well as the data bases COSMIC (Forbes et al., 2016) version 80, and dbSNP (Sherry et al., 2001) version 138 were used. For the subsequent GeneAccord analysis, only genes with a potential impact were kept. While mutations such as synonymous or intronic variants are informative for the tree inference, they were not of interest for the clonal exclusivity test. More precisely, the annotation program SnpEff classifies variants into four categories based on the potential impact of the mutation (Cingolani et al., 2012b). These are, in descending order of importance, the categories 'HIGH', 'MODERATE', 'LOW', and 'MODIFIER'. Examples of the category 'HIGH' would be frameshift indel. Missense mutations and inframe indels are classified as 'MODERATE'. The category 'LOW' includes synonymous and splice region mutations. Variants that are annotated as 'upstream', 'intronic', or 'UTR region' fall into the class 'MODIFIER'. For the GeneAccord analysis, we keep mutations that are in the category 'HIGH', 'MODERATE', and from the class 'LOW' we keep all variants with the exception of: synonymous variants, or mutations that are annotated as the case where a start codon mutates into another start codon, or analogous for stop codon. To sum up, we keep variants such as missense, frameshift or inframe indel or variants in splice regions, but filter out variants that are synonymous, intronic or in the UTR regions.

Availability of data and material
The generated read data from the clear cell renal cell carcinoma samples and the matched normal samples have been deposited in the European Nucleotide Archive under accession numbers ERP108328 and ERP108326. GeneAccord is implemented as an R package available at https://github.com/cbg-ethz/GeneAccord and is submitted to Bioconductor.

Ethics approval
This study was approved by the cantonal commission of ethics of Zurich (KEK-ZH-nos. 2013-0629 and 2014-0604). The retrospective use of sequencing data obtained from normal and tumor tissues of 91 ccRCC patients is in accordance with the Swiss Law ("Humanforschungsgesetz"), which, according to Article 34, allows the use of biomaterial and patient data for research purposes without informed consent under certain conditions that include the present cases. Law abidance of this study was reviewed and approved by the ethics commission of the Canton Zurich.

Competing Interests
The authors declare that they have no competing interests.

Supplementary Material
Supplementary Figure S1. GeneAccord algorithm. Schematic overview of the procedure of GeneAccord to detect clonally exclusive pairs of mutated genes. (A) The input data are the mutated gene-clone assignments from a cohort of cancer patients and from a collection of trees to take into account uncertainty in the phylogenetic tree inference. The first step is to compute the overall rates of clonal exclusivity for all gene pairs for each patient separately. These rates reflect the expected prevalence of the clonal exclusivity pattern in each patient and for each gene or pathway pair. From these, the distribution of the test statistic under the null hypothesis of the likelihood ratio test is computed. For each gene pair, the parameter ∆ is then estimated using maximum likelihood. (B) A positive parameter ∆ indicates that the pair tends to be mutated in different clones more often than expected. We filter for such pairs and test them for significance. Figure S2. (A) Numbers of mutated genes per patient that were used as input for GeneAccord. The mutations were assigned to clones through Cloe, and were mapped to genes afterwards. For GeneAccord, only mutations with a potential impact were considered, that is, for instance, synonymous or intronic mutations were filtered out (Methods section). Seven patients had only zero or one gene-to-clone assignment (patients 03, 34, 51, 53, 58, 65, 87), and therefore the GeneAccord analysis was done with the remaining 82 patients. (B) Average rates of clonal exclusivity per patient. The average rate and number of clones with at least one nonsynonymous mutation are computed from all 20 trees. The average number of clones ranges from 2.0 to 4.95. The rate of clonal exclusivity of a patient is the fraction of gene pairs that are in different branches of the phylogenetic tree. For instance, if the rate is zero, it means that the tree was linear without any branches, and that all genes assigned to the clones are in the same lineage, as is the case for patient 30. The variants were annotated to genes with the program SnpEff (Cingolani et al., 2012b). It concatenates annotations with '&' in case several effects are true for the same mutation and transcript. The same gene can be hit by several mutations, or the same mutation can have several annotations for different transcripts. Therefore, in order to visualize the mutations in a heatmap like in Fig. 3, an order of importance needed to be defined, which is the following starting with highest priority: 'Start lost', 'Stop lost', 'Stop gained', 'Start gained', 'Frameshift indel', 'Inframe indel', 'Missense', 'Splice site', 'Protein interaction loci', '5 prime UTR', '3 prime UTR', 'Synonymous', 'Sequence feature'. This means that, if a gene was hit by two mutations, where one is synonymous and the other one is a missense mutation, only the missense mutation will be shown in the heatmap. The 16 gene pairs from the panel sequencing data set whose estimate of the parameter delta is greater zero. The column 'Mutated in (rate)' lists the patients in which the pair was mutated in. The average rate of clonal exclusivity of the respective patient is shown in parenthesis. The column 'Clonally exclusive in' lists the patients in which the pair was also clonally exclusive in the majority of the trees. The ten most striking pathway pairs from the WES data set whose estimate of the parameter delta is greater zero. The column 'Affected in (rate)' lists the patients in which the pair was affected in. The average rate of clonal exclusivity of the respective patient is shown in parenthesis. The column 'Clonally exclusive in' lists the patients in which the pair was also clonally exclusive in the majority of the tree inference runs. In case a gene could not be mapped to a pathway, as is the case for PRDM12 and MUC2, the gene identifiers were retained.