Systematic detection of functional proteoform groups from bottom-up proteomic datasets

To a large extent functional diversity in cells is achieved by the expansion of molecular complexity beyond that of the coding genome. Various processes create multiple distinct but related proteins per coding gene – so-called proteoforms – that expand the functional capacity of a cell. Evaluating proteoforms from classical bottom-up proteomics datasets, where peptides instead of intact proteoforms are measured, has remained difficult. Here we present COPF, a tool for COrrelation-based functional ProteoForm assessment in bottom-up proteomics data. It leverages the concept of peptide correlation analysis to systematically assign peptides to co-varying proteoform groups. We show applications of COPF to protein complex co-fractionation data as well as to more typical protein abundance vs. sample data matrices, demonstrating the systematic detection of assembly- and tissue-specific proteoform groups, respectively, in either dataset. We envision that the presented approach lays the foundation for a systematic assessment of proteoforms and their functional implications directly from bottom-up proteomic datasets.


Introduction
Human cells are known to perform thousands of different biochemical functions and the central dogma of biology states that proteins that catalyze the vast majority of these functions arise from the transcription and translation of the information contained in the respective genome. The International Human Genome Sequencing Consortium reported ~ 20,000 protein-coding genes in the human genome 1 and surprisingly, the number of protein coding genes does not scale with the complexity of functions of eukaryotic organisms 2 . These findings have led to the notion that the protein coding information of the genome is substantially diversified structurally and functionally along the axis of gene expression 3 . Specific mechanisms that catalyze this diversification include alterative splicing of transcripts, post-translational processing and modification of proteins, and the variable association of proteins in functional protein complexes. Consequently, protein coding genes frequently give rise to multiple distinct protein species -proteoforms -that have a unique primary amino acid sequence and localized post-translational modifications (PTMs) 4,5 and that might, in turn, partition into different functional complexes or show functional differences. Currently it is estimated that the ~ 20,000 coding genes generate more than a million different proteoforms 6 that can differ between individual cells, tissues and disease phenotypes 4,[7][8][9] . This increase in complexity beyond the directly translated genomic sequence information hampers genotype-based phenotype inference and highlights the importance of capturing proteome diversity to increase the mechanistic understanding of biochemical processes in basic and translational research.
Over the last decades, mass spectrometry (MS) has emerged as the key technology for proteomic analyses 10,11 . The large array of mass spectrometric techniques can be grouped into two main approaches: top-down and bottom-up proteomics. In top-down workflows, samples containing intact proteins are chromatographically separated, ionized and analyzed in a mass spectrometer. Recorded spectra of both the intact and fragmented proteins determine the unique primary protein sequence and PTMs of individual proteoforms 12 . Recent top-down proteomic studies reported the identification of more than 3,000 unique proteoforms originating from up to ~ 1,000 individual genes 13,14 . Gaining deeper proteoform coverage by top-down proteomics is challenged by the limitations of current separation techniques, the MS and MS/MS analysis of large ions, and the interpretation of the resulting spectra by available analysis software 12,15 . However, bottom-up proteomic workflows suffer from the principle limitation that the connectivity between identified peptides and their proteins of origin is lost during the enzymatic digestion step. This necessitates an in silico inference step that maps measured peptide signals back to individual proteins. This is a challenging task in general 16 and is particularly hard for resolving different proteoforms 3 . Recent advances in instrumentation, data acquisition and data analysis, especially the development of data-independent acquisition (DIA/SWATH-MS) strategies, have enabled the measurement of large bottom-up proteomic datasets at high proteome coverage, combined with consistent and accurate quantification [17][18][19] . Based on these developments peptide-level bottomup proteomic data became more reliable both on the qualitative and quantitative level, as demonstrated in several of our previous studies 20,21 . Thus, useful information about the presence of individual modifications or sequence variants on the peptide level can be readily obtained. However, the possibilities to systematically assign and distinguish unique proteoforms from bottom-up proteomics datasets remains a mostly unexplored area to date.
Nevertheless, researchers in the early days of bottom-up proteomics already observed that peptides of the same protein might follow distinct quantitative patterns across a dataset and that peptide co-variation analysis can be leveraged to improve proteomic analyses on different levels. The predominant focus of previous work has been to use peptide correlation analysis for the purpose of filtering out dissimilarly behaving peptides in an effort to improve protein quantification 22,23 or protein inference 24 . It has also been recognized that some of the determined 'outlier' peptides could indeed contain valuable biological information e.g. by originating from different proteoforms and previous work explored the possibility to use peptide correlation patterns for proteoform assignment 23,[25][26][27] . In this manuscript we describe COPF, a novel strategy for COrrelation based functional ProteoForm assessment in bottom-up proteomics data that extends the concept of peptide correlation analysis towards establishing a generic workflow with the main purpose to systematically assign peptides to covarying proteoform groups (also see Glossary in Supplementary Table 1) in different types of bottom-up proteomics datasets.
Even more demanding than the identification of a proteoform as a distinct chemical entity is the assessment of its biological context. Although some proteoforms have successfully been annotated with molecular functions and implicated phenotypic traits 7 , the systematic assessment of proteoform-specific functions remains challenging. A shift of focus from the mere enumeration of various proteoforms detected from a cell towards establishing direct links between proteoform species and their functional significance would be a major advance in the field.
Alternative proteoforms from the same gene can functionally diverge by different mechanisms. These include i) proteoform-specific association with protein complexes, ii) different biochemical function of proteoforms and iii) different functional state of proteoforms. The COPF strategy presented here was designed to directly link proteoform group detection to specific functional contexts based on bottom-up proteomics data. Key properties of applicable datasets include both high sequence coverage and quantitative accuracy as well as a sufficient number of replicates with respect to the expected effect size of abundance differences between proteoforms across the investigated biological conditions. We recently presented a novel strategy for detecting and quantifying protein complexes from co-fractionation MS datasets 21,28,29 . The method consists of native cell lysis to extract protein complexes, followed by size-exclusion chromatography (SEC) and analysis by DIA/SWATH-MS. Because of the rich biological context, high sequence coverage and accurate quantitative peptide-level information inherent to SEC-SWATH-MS, we argued that it provides a unique opportunity to evaluate proteoform groups that are relevant for protein complex organization. Therefore, we first demonstrate the application of COPF to a SEC-SWATH-MS dataset where native complexes across two different cell cycle stages were resolved and analyzed 28 . In this dataset, COPF determined both assembly-and cell cycle-specific proteoform groups. We further argue that the general concept of peptide covariation based proteoform assignment should also be broadly applicable to large bottom-up proteomic studies with a high degree of biological variation. As a second application, we therefore applied COPF to assign functional proteoform groups in a typical bottom-up proteomic cohort study consisting of five tissue samples from the mouse BXD genetic reference panel 30 . In this dataset, COPF could determine several tissue-specific proteoform groups.
Overall, proteoform groups detected in either dataset could be linked to distinct molecular mechanisms including proteolytic cleavage, alternative splicing and phosphorylation. Because the strategy is agnostic to different types of proteoforms we expect that the range of applications will expand in the future and that this approach, therefore, lays the foundation for the systematic assessment of proteoform groups across large bottom-up proteomic datasets and for linking these groups to biological function.

Principle of the method and implementation
The assignment of peptides to unique proteoforms is a challenging task in bottomup proteomic workflows, because the majority of detected peptides are frequently shared between multiple proteoforms and multiple diverging peptides cannot be uniquely assigned during protein inference. Here we propose a novel, data-driven strategy to assign peptides to unique functional proteoform groups based on peptide correlation patterns across large bottom-up proteomic datasets (COrrelation based functional ProteoForm assessment, COPF). We define a functional proteoform group as a group of peptides, derived from the same gene, that co-vary across a large multi-condition dataset. A proteoform group can, but does not have to represent a unique, specific proteoform (also see Glossary in Supplementary Table 1). The COPF strategy is based on the following considerations: i) In case only one proteoform is expressed or all proteoforms of a protein have similar characteristics across a multi-condition dataset, all sibling peptides (i.e. peptides originating from the same parental gene/protein) should display a similar quantitative profile, as schematically illustrated in Figure 1, left panel; ii) if a gene generates multiple distinct proteoforms that differ between the analyzed samples across a multi-condition dataset, sibling peptides of that protein can be separated into groups of highly correlated peptides, which we accordingly assign to distinct proteoform groups, schematically illustrated in Figure 1, right panel.
Conceptually, the proteoform detection workflow in COPF can be divided into four steps ( Figure 1, computational data analysis). First, the intensities of peptides assigned to the same gene or protein identifier are determined from the corresponding MS signals across all measured samples. Second, all pairwise peptide correlations within a protein are calculated based on the determined intensity values across samples. Third, the peptides of a protein are subjected to hierarchical clustering, using one minus the previously calculated correlation as the distance metric. The tree is then cut into two clusters, minimally containing two peptides each (see Method section for details). Fourth, a proteoform score is calculated for each protein. The score is calculated as the mean peptide correlation across clusters minus the within-cluster correlation. A higher proteoform score thus indicates a higher within-cluster versus across-cluster correlation. The assumption of our COPF strategy is that proteins with multiple distinct proteoforms, that behave differentially across a dataset will have higher proteoform scores than proteins without differentially behaving proteoforms.
The COPF strategy is implemented and openly available as an extended version of our previously published CCprofiler R package 21,29 at: https://github.com/CCprofiler/CCprofiler.

Benchmark
A particular challenge for benchmarking the COPF strategy for assigning functional proteoform groups is the lack of available biological ground truth data.
To evaluate COPF performance, we performed the following analyses: First, we conducted a sensitivity analysis based on an in silico generated benchmarking dataset. Second, we simulated the impact of different dataset properties on correct proteoform detection and grouping. Finally, we determined a set of biologically motivated criteria to evaluate the credibility of assigned proteoform groups in any given dataset.

Benchmarking by in silico sensitivity analysis
As a first performance evaluation step we assessed the sensitivity of the proteoform detection workflow in COPF based on an in silico generated benchmarking dataset. The basis for this dataset is our previously published complex co-fractionation dataset of the HEK293 cell line. The dataset consists of native complexes isolated from a single biological condition that were separated by size-exclusion chromatography (SEC) and then analyzed by bottom-up proteomic analysis using data-independent acquisition mass spectrometry (DIA/SWATH-MS) (details on the data structure are provided in the reference and further down) 21 . We assumed that the 6299 proteins detected in this study exist as single forms and do not include any regulated proteoform groups (see below for further discussion of this assumption); we used these as a negative reference set. Exemplary peptide profiles for the NIBL1 protein are shown in Figure 2A, top panel. To obtain a positive protein reference set that COPF should classify as having multiple proteoforms, we generated an additional 3000 'mixed' proteins by assigning peptides of two different parental proteins to the same pseudo mixed protein. Exemplary peptide profiles for a mixed protein based on 50% peptides of each the NADAP and the RL14 protein are shown in Figure 2A, bottom panel. The resulting set of single and mixed proteins was used to evaluate the ability of our algorithm to correctly distinguish single and mixed proteins as well as to assign peptides correctly to their corresponding parental proteins. Figure 2B shows the score distribution for the single and mixed proteins. The receiver operating characteristic (ROC) curve ( Figure 2C) illustrates the dependence of true and false positive rates on the proteoform score threshold (marker color) and the percentage of mixed proteins for which all peptides were correctly assigned to their parental proteins (marker size). Overall the results of the in silico sensitivity analysis reveal that our hypothesis that mixture proteins can be differentiated from single proteins by means of the proteoform score holds true. At a falsepositive rate of 5.6%, 61.7% of the mixed proteins were correctly detected. For 76.8% of these mixed proteins, all peptides were completely and correctly assigned to their parental proteins. It is important to keep in mind that these are conservative estimates for the method performance for the following reasons: first, there will likely be true proteoform groups in the negative background dataset of the HEK293 sample, which are here treated as false positives. Second, some of the proteins that were grouped into one mixed protein are likely to be involved in protein interactions or having a similar elution profile by chance 29 .
Peptides of such mixed proteins would therefore not be detectable as separate proteoform groups based on the correlation based COPF approach, making some of the 'positive' mixture proteins impossible to detect. Despite these limitations, the underlying real proteomics data used for this in silico benchmark has the benefit of having realistic general properties related to the number of peptides per protein as well as the measurement noise between samples. However, while we set the ratio of peptides in the mixed proteins to 50% in this analysis, this ratio might vary for actual proteoforms.

Benchmarking simulation study
In addition to the in silico sensitivity analysis, we further performed simulation studies to evaluate the impact of different dataset properties on correct proteoform detection and grouping. The simulation was performed purely based on random sampling from normal distributions without any underlying experimental data. In contrast to the in silico benchmark described above, this allowed us to control all parameters and to evaluate their impact on COPF performance. First, we simulated intensity profiles for 1000 'single' proteins. For each protein, intensities across 100 samples were generated by random sampling from a normal distribution (mean = 10, standard deviation = 1). The resulting protein profiles were used as basis to simulate 10 peptides for each protein by adding peptide-specific technical noise, sampling from a normal distribution (mean = protein intensity and standard deviation (SD) = 1.5, 2, 2.5, 3, 3.5 or 4).
Exemplary simulated peptide intensity profiles for a single protein (SD = 1.5) are shown in Figure 2D, top panel. In addition to the 1000 single proteins, we further  Figure 2F). The results of the simulation study demonstrate the boundaries of the COPF approach. As expected, increasing the technical variation between samples at a given proteoform-specific fold change decreases the sensitivity of proteoform detection. For such cases, proteoform group detectability is better for proteoform groups with multiple distinguishing peptides as compared to proteoforms with only two distinguishing peptides.

Benchmarking by means of auxiliary data
To finally assess and characterize the proteoform groups determined in a specific dataset, we propose evaluation criteria based on auxiliary data on three different levels: (1) agreement with prior knowledge, (2) sequence proximity analysis and (3) dataset specific criteria ( Figure 2G). We demonstrate these evaluations in the two applications (below). Prior knowledge can be derived from public databases or prior publications. Here, we assessed the agreement of proteoform groups derived by the purely data driven COPF approach with annotated sequence variants in UniProt 31 . We further evaluated overlapping information with previously reported regulated phosphorylation sites (see below). For the sequence proximity analysis, we assume that many proteoforms differ by extended sequence stretches, e.g. when generated by alternative splicing or proteolytic cleavage. The sequence proximity analysis evaluates if peptides assigned to the same proteoform group are in closer relative sequence proximity than expected for random peptide grouping. Figure 2G (center) schematically illustrates the sequence proximity analysis strategy (for details also see Methods section). Finally, a key promise of the COPF approach is that the derived proteoforms can directly be linked to functional or phenotypic characteristics of the studied dataset. The reason for this is that proteoform groups are only detectable by COPF if they show different behavior across the studied dataset.
Detected proteoform groups can therefore be assessed for their agreement with known underlying technical or biological structures of the given dataset, e.g. different biological conditions.
The three presented evaluation criteria can be applied to each dataset analyzed by COPF to further prioritize proteoform candidates to find promising candidates for follow-up analyses. Simulation studies were performed to evaluate the parameters that influence the success of our approach. Data for 1000 proteins was simulated by random sampling from a normal distribution across 100 samples (mean = 10, standard deviation = 1).
The resulting protein profiles were used as basis to simulate 10 peptides for each protein by adding peptide-specific technical noise, sampling from a normal distribution (mean = protein intensity and standard deviation (SD) = 1.5, 2, 2.5, 3, 3.5 or 4). Exemplary simulated peptide intensity profiles for a single protein (SD = 1.5) are shown in the top panel. In addition to the 1000 single proteins, we further simulated 1000 'mixed' proteins. Mixed proteins were generated by multiplying the intensity values of

SWATH-MS dataset
We applied the COPF strategy to our previously published native complex cofractionation dataset of Hela CCL2 cells synchronized in interphase and mitosis 28 to identify cell-cycle and assembly-specific proteoforms. In this case the dataset consists of native complexes isolated from cells in two cell cycle states, separated by size-exclusion chromatography (SEC), and then analyzed by bottom-up proteomic analysis using data-independent acquisition mass spectrometry (DIA/SWATH-MS) 21,29 . The workflow with individual steps is schematically illustrated in Figure 3A. First, cells are lysed under close to native conditions to keep protein complexes intact. Second, the protein complex mixture is separated by SEC into 65 fractions. Third, each sampled fraction is separately processed for bottom-up proteomic measurements by SWATH-MS 17 followed by peptidecentric analysis [32][33][34] . In a fourth step, peptide elution profiles across the SEC fractions are evaluated to infer protein complex assemblies by CCprofiler as described before 21,29 and additionally by the COPF method to identify proteoforms differentially associating with complexes within or across cell cycle states.
For the COPF analysis, we took the original peptide-level elution profiles from Heusel and Bludau et al. 28 as a starting point. We imported the data into the R-CCprofiler framework for subsequent data preprocessing as follows. First, we annotated peptides with their start and end position in the canonical protein sequence. Second, peptides with overlapping start and end positions (e.g. peptides resulting from missed cleavages) were reduced to one representative peptide, selected based on the highest overall intensity across the dataset. Third, missing values were imputed for fractions with a valid value in both the preceding and following SEC fraction. Fourth, peptides detected in fewer than three consecutive fractions and peptides with zero variance along the SEC dimension were removed.
Finally, only proteins with two or more remaining peptides were considered for further analysis by COPF.
The proteoform score distribution for the remaining 5451 proteins of the SEC-SWATH-MS dataset is shown in Figure 3B. We manually selected a conservative score threshold of 0.25 based on the criterion that the threshold precedes the exponential increase in the number of proteoforms observed at scores ≤ 0.2 Figure 1A  were significantly closer in sequence proximity than expected by chance (p-value ≤ 10%) and proteoforms for an additional 43 proteins (37%) scored among the lowest 10% of possible p-values, given the number of peptides in the protein (for details also see Method section). We further analyzed the dataset with respect to proteoforms that associate with different complexes (assembly-specific) and to proteoforms that differ between cell-cycle states (cell cycle specific). Proteoform groups for 41 proteins (36%) could be classified as assembly-specific, since we observed them in multiple distinct assembly states resolved along the SEC dimension (for details see Method section). Additionally, COPF predicted proteoform groups of 53 proteins (46%) that were significantly differentially expressed between the two cell cycle stages (log2-fold change ≥ 1 and Benjamini-

(Supplementary
A summary of the proteoform characterization is provided in Figure 3C. As a final assessment of the global characteristics of the set of proteins detected as having multiple proteoform groups, we performed an enrichment analysis ( Figure 3D). We observed that the protein set is highly enriched in phosphoproteins. We also detected a significant enrichment in proline and polyserine sequence features. This is particularly interesting because many cyclin dependent kinases act on sequence motifs including proline and serine [36][37][38][39][40] .
Additionally, the protein set was enriched for proteins related to rearrangement of the cytoskeleton. This result is in agreement with cellular rearrangements required during mitosis. Finally, an enrichment of proteins with UniProt annotated sequence-(e.g. known polymorphisms) and splice-variants is observed. In addition to these global insights, our dataset provided a rich source of new biological information. In the following we present selected examples that highlight different mechanisms generating proteoforms as well as different functional associations of proteoforms with either the cell cycle or protein assembly.
The first example is the proteasome subunit beta type-7 (PSMB7, UniProt ID: Q99436) which is convincingly resolved in our SEC data. The proteasome assembly line is a well-studied yet still heavily investigated system. Figure 4A shows a simplified schematic of the process. A key step in the prevalent model of 20S particle assembly is the integration of PSMB7 as the last beta subunit, triggering proteolytic cleavage of its pro-peptide, followed by formation of the full 20S core proteasome complex 41 . Our data-driven COPF strategy identified two assembly specific proteoforms for PSMB7 (also see Supplementary Figure 2A).
The SEC profiles of all detected PSMB7 peptides are shown in Figure 4B. The peptides assigned to the two different proteoform groups are highlighted in blue and orange. While the peptides of both proteoform groups participate in the lower molecular weight peak around fraction 33, only peptides of the orange proteoform group were detected in the higher molecular weight peak around fraction 25.
From protein co-elution analysis (Supplementary Figure 2C) as well as our previous study 21 , we know that the peak group around fraction 33 corresponds to a proteasome assembly intermediate and the peak group around fraction 25 corresponds to the full 20S core proteasome. Checking the location of the detected peptides along the PSMB7 sequence reveals that the peptides of the blue proteoform correspond to the two N-terminal peptides ( Figure 4C, also see Supplementary Figure 2B). The second peptide (TGTTIAGVVYK) spans the known proteolytic cleavage site of the PSMB7 pro-peptide. To verify our finding, we performed a targeted re-extraction of the semi-tryptic peptide (TTIAGVVYK) that is produced by tryptic digestion of the processed, short proteoform using Skyline 42,43 . The extracted signal of this semi-tryptic peptide is highlighted in green in Figure 4B and C (also see Supplementary Figure 3). In contrast to the fully tryptic (blue) peptide, the cleaved peptide sequence co-elutes with the peptides of the orange proteoform group, indicating that the processed form is integrated in the 20S proteasome core complex as expected and identifying the precise location of the proteolytic processing that generates the proteoform.
The nuclear pore complex (NPC) protein Nup98-Nup96 (UniProt ID: P52948) presents a second example of the capacity of COPF for data-driven proteoform assignment. The Nup98-Nup96 proteoform groups were identified by the algorithm as both assembly and cell cycle specific ( Figure 4D). The Nup98 gene is known to encode a 186 kDa precursor protein that undergoes autoproteolytic cleavage, to generate a 98 kDa nucleoporin (NUP98) and a 96 kDa nucleoporin (NUP96) ( Figure 4D SEH1L (Q96EE3) and NUP85 (Q9BW27). Here our purely data-driven approach was capable of correctly grouping peptides according to the two known proteoforms NUP98 and NUP96 ( Figure 4E and F, also see Supplementary Figure   2D and E). Based on this assignment we can now also demonstrate that only the NUP96 proteoform integrates into the Nup107-160 sub-complex and follows its behavior across the cell cycle (Supplementary Figure 2F). This is in line with reports from previous studies 48 .  49 . In our HeLa cell cycle dataset COPF detected two assembly-specific proteoform groups that correctly matched the annotated proteoforms sNASP and tNASP ( Figure 5A, also see Supplementary Figure 4A).
The two distinct peptide peak groups in different molecular weight regions of the SEC separation indeed confirm previous findings that the two proteoforms engage in different assemblies.
Whereas the examples described above refer to well annotated proteoforms that we were able to resolve without including prior knowledge in the analysis, our findings also uncovered less-well understood proteoforms. Figure 5C and D show the profile and sequence location for peptides of Transmembrane protein 106B (TMEM106B, UniProt ID: Q9NUM4), which has no annotated sequence variants or specific post-processing steps annotated in UniProt. Nevertheless, COPF identified two clearly distinguishable, assembly-specific proteoforms (Supplementary Figure 4B). In recent literature, we found evidence that TMEM106B is a lysosomal membrane protein that, upon membrane integration, undergoes evolutionarilyconserved regulated intramembrane proteolysis 50,51 . This involves a two-step mechanism where the luminal domain is first cleaved off by an unknown enzyme at amino-acid (AA) residue 127, followed by a second cleavage (at AA 106) of the N-terminal fragment that is still anchored in the membrane (cleavage sites are indicated by the scissors in Figure 5D). Our data suggest that the peptides of the Making use of the possibility to re-extract data from SWATH-MS maps, we performed a targeted analysis in Skyline to detect and quantify the expected phosphopeptides in our dataset (Supplementary Figure 6). As predicted, we could confirm mitosis-specific phosphorylation for two of the four peptides (EIESS[+80]PQYR and TLSS[+80]PSNRPSGETSVPPPPAVGR, Figure 5F). One

phosphopeptide (GPPQS[+80]PVFEGVYNNSR) had only a weak signal (purple)
and the fourth could not be detected. Nevertheless, it is remarkable that the phosphopeptides could be detected and quantified in the samples given the long processing protocol without enrichment and without specific phosphatase inhibition treatment. These findings highlight that the COPF approach is capable of determining phospho-specific proteoform groups, given that at least two peptides are involved. Here, it is important to emphasize again that the strategy by which phospho-specific proteoform groups were detected with COPF directly links these detections to their biological relevance in the cell cycle.

Tissue-specific proteoforms in SWATH-MS data of different mouse tissues
In  Figure 6A) we tested the assumption that the COPF strategy to identify functional proteoforms was also applicable to peptide intensity vs. sample data matrices from sample sets of sufficient size and variability between samples.
We initially imported the peptide-level data matrix (intensity vs. sample) into the R-CCprofiler framework and applied the same data processing steps as for the application of COPF for SEC-SWATH-MS data described above, except of the consecutive identification filter and missing value imputation. Proteins with two or more remaining peptides were considered for further analysis by COPF and resulted in the proteoform score distribution for 2885 proteins shown in Figure   6B. We manually selected a score threshold of 0.4 based on the criteria that the threshold precedes the exponential increase in the number of proteoforms observed for scores < 0.4 (Supplementary Figure 1C). Based on the selected threshold, the algorithm predicted 104 proteins with potential functional proteoform groups. 25 (24%) of these are annotated with multiple isoforms in UniProt. Peptide proximity analysis within the coding sequence further revealed that the proteoforms for 30 proteins (29%) were significantly closer in sequence proximity than expected by chance (p-value ≤ 10%) and proteoforms for an additional 9 proteins (9%) scored among the lowest 10% of possible p-values, given the number of peptides identified for the respective protein (for details also see Method section). Further evaluation of the detected proteoform groups revealed that proteoform groups for 101 proteins (97%) could be classified as tissue specific. This classification is based on the protein being differentially regulated when using tissue and predicted proteoform group information as prior knowledge for an ANOVA analysis (Bonferroni corrected p-value ≤ 0.01). A summary of the proteoform characterization is provided in Figure 6C.
We further performed an enrichment analysis of the proteins annotated with multiple proteoforms, showing that they are significantly enriched in keywords related to energy metabolism in mitochondria. This is in line with the expectation that energy metabolism is regulated differently between different tissues.
Interestingly, many proteins are associated with acetylation. Among the 104 proteins assigned to have functional proteoform groups, multiple biologically interesting instances stood out, exemplified by LIM domain-binding protein 3 (Ldb3, also known as Cypher, UniProt ID: Q9JKS4) and sorbin and SH3 domain-containing protein 2 (Sorbs2, UniProt ID: Q3UTJ2), respectively.
The protein Ldb3 was previously described as muscle specific 52 and accordingly was found to be highly expressed only in heart and quadriceps samples of our dataset ( Figure 7A). The COPF strategy clearly assigned the peptides of Ldb3 into two tissue-specific proteoform groups, indicated in orange and blue ( Figure 7B).
These proteoforms directly match the previously annotated splice variants of Ldb3, where peptides of the first proteofrom group, Q9JKS4-1, exactly map to the canonical sequence region that also has an alternative sequence variant (alternative sequence 1 in Figure 7C). To further validate this finding, we performed a targeted extraction of peptides in the alternative sequence region For Sorbs2 ( Figure 7D) our fully data-driven approach assigned the peptides to two clearly distinct proteoform groups ( Figure 7E). While the peptides of the orange proteoform group are abundant in brain, heart and liver tissues, the peptides of the second, blue proteoform group are observed exclusively in the brain. Mapping the identified peptides to the canonical protein sequence and matching them with annotated sequence variants ( Figure 7F), it can be observed that the two blue peptides map to the region of a brain-specific splice variant (alternative sequence 10) that includes an exon that has previously been shown to only be expressed in brain tissue and exclusively in neurons 53-55 . Until recently, proteoforms have mainly been inferred from transcriptomics data of next-generation sequencing (NGS) studies that aim to discover different mRNA transcripts of the same gene, mostly derived from alternative splicing. In these studies, it is assumed that the alternative transcripts are further translated to protein sequences, which is not always the case 56,57 . One approach to study proteoforms directly on the protein level is to generate and apply proteoformspecific antibodies, which can only be achieved at fairly low throughput. One recent example represents a study that characterized alternative CD6 isoforms Because the connection between protein and peptide is lost at an early stage of bottom-up proteomics, this method has mostly been considered unsuitable for the analysis of proteoforms to date. In this study we present a novel analysis concept and software we term COPF that i) systematically assigns peptides identified by bottom-up proteomics to proteoform groups and ii) directly links their identification to the biological context of the dataset at hand. We define a proteoform group as a group of peptides, derived from the same gene, that co-vary across large multi-condition datasets. COPF follows a fully data-driven approach, which necessitates a change in relative abundance between proteoform groups across the studied samples in order to enable proteoform group detection. The approach therefore directly couples detection to the assessment of a biological context.
An important distinction between functional proteoform groups assigned by COPF and those determined by top-down proteomics approaches is that COPF does not fully characterize the proteoform's complete primary amino acid sequence and all of its modifications. It merely determines whether peptides exist that can differentiate the different biological contexts of a protein. Importantly, proteoform groups detected by COPF can directly imply a functional consequence depending on the study design. The power of the method is thereby directly linked to specific dataset properties. SEC-SWATH-MS data, as presented in this study, for example provides the opportunity to link the detection of proteoform groups directly to protein complex assembly, a property which makes such datasets unique and especially interesting for systematic proteoform investigation by COPF. A second factor based on which COPF can detect proteoform groups and distinguish functional associations is the biological context of a study, namely from the different biological conditions at hand. In the two datasets presented herein, these correspond to cell cycle-and tissue-specific proteoform groups.
A key challenge during the development of COPF was the lack of a ground truth dataset with known functional proteoforms. To nevertheless enable a performance evaluation of our software, we generated different in silico reference datasets. Our analysis demonstrated that COPF can both identify mixed proteins and assign peptides to the correct origin (Figure 2A-C). Further simulation studies allowed us to investigate the different parameters that influence the sensitivity of our approach. We could demonstrate that reducing technical variation between samples increases the sensitivity of mixed protein detection. Increased technical variation could further be rescued by increasing the number of peptides in a specific proteoform group ( Figure 2D-F). These findings suggest that a high sequence coverage per protein as well as high reproducibility and quantitative accuracy are the key determinants for a successful proteoform group detection in bottom-up datasets. In this study, COPF was applied to two datasets generated by SWATH-MS. When compared to more classical data-dependent acquisition (DDA), DIA (or SWATH-MS) has the advantage of both increased data completeness and quantitative accuracy 59 . Recent developments on MS instrument level as well as in data acquisition and analysis promise further improvements with regard to proteome and sequence coverage 60,61 . Based on the results of the simulation benchmark, it can be expected that COPF performance and sensitivity will improve in concert. In addition to these technical considerations about sequence coverage and quantitative accuracy, it is also important to consider expected effect sizes of proteoform differences that can be observed in a given dataset. A limitation of COPF in its current implementation is that proteins will only be split into maximally two proteoform groups. It is expected that some proteins might have more than two functionally relevant proteoforms which could, at this point, not be resolved. Future work will focus on strategies towards a more finegrained separation of proteins into a variable number of proteoform groups. This process will strongly benefit from a higher sequence coverage, that could potentially be achieved by newest DIA technologies in the near future 60,61 .
The presented examples in this study include proteoform groups generated by proteolytic cleavage (Figure 4A-C, Figure

COPF analysis of the cell cycle SEC-SWATH-MS dataset
The peptide-level data and annotation of the cell cycle SEC-SWATH-MS dataset,
The differential analysis was performed as previously described by Heusel et al. 28 with minor modifications. Intensity values per condition and replicated were first extracted by extractFeatureVals and fillFeatureVals.

COPF analysis of the mouse SWATH-MS dataset
The peptide-level quantitative data from Williams et al. 30 was downloaded and only the whole proteome samples were selected for the analysis with CCprofiler and COPF. The quantitative data were loaded in R and imported as traces object, using importPCPdata. The proteins were further annotated with general information from UniProt using annotateTraces. Peptide positions within the canonical protein sequence were determined by annotatePeptideSequences.
Peptides from the same protein with similar start or end position, e.g. generated by missed cleavages, were summarized into a single peptide based on the highest intensity across the dataset by summarizeAlternativePeptideSequences. Peptides with zero variance were removed and only proteins with multiple remaining peptides were kept for downstream analysis (filterSinglePeptideHits). Due to its suspicious properties, the protein with UniProt accession A2ASS6 was removed.
For COPF analysis, all pairwise Pearson correlations between sibling peptides were calculated by calculateGeneCorrMatrices. Hierachical clustering based on an average linkage was performed by clusterPeptides. The tree was cut into two clusters by cutClustersInNreal. Each cluster is required to contain at least two peptides. Peptides that would form a single peptide cluster were marked as outliers. Finally, the proteoform score was calculated by calculateProteoformScore. Proteoform groups were finally annotated by annotateTracesWithProteoforms.
To evaluate the sequence proximity of the resulting clusters, the evaluateProteoformLocation function was applied.

ANOVA analysis of the mouse SWATH-MS dataset
For the ANOVA analysis, we selected proteoforms based on a score cutoff of 0.4.
Outlier peptides from the clustering were removed prior to proteoform quantification, which was performed by proteinQuantification (quantLevel = "proteoform_id", topN = 1000, keep_less = TRUE). ANOVA analysis for each protein was performed using the aov function of the R stats library, applying following function: log2-intensity ~ tissue * proteoform. Multiple testing across all proteins was performed using a Bonferroni correction (p.adjust of the R stats library).

Enrichment analyses
All presented enrichment analyses were performed on the DAVID website 69,70 .
Results were filtered for a p-value of 0.01, a minimum count of 5 and a minimum fold-enrichment of 1.2.

Sequence proximity analysis
The sequence proximity analysis evaluates if peptides assigned to the same proteoform group are in closer relative sequence proximity than expected for random peptide grouping. To test this hypothesis, the peptides of each protein are ranked by their relative peptide position start site (i.e. the position of the first amino acid of the given peptide in the canonical combined protein sequence).
Subsequently, the normalized standard deviation of the derived peptide position ranks is calculated by dividing the standard deviation of the observed rank vector by the standard deviation of a uniform distribution of equal length. To estimate a probability of whether the observed proximity score is more extreme than expected by chance, we randomly shuffle the peptide ranks 1000 times for each protein and calculate the according proximity score for each of the permutations.
An empirical p-value for each proteoform group is subsequently derived by dividing the number of random permutations with a more extreme proximity score compared to the real observation by the proximity score of the real observation. For proteins with very few detected peptides, it is impossible to reach statistical significance by means of a classical p-value. For these cases we implemented a second pseudo p-value that does not necessitate values 'smaller or equal' (as for the classical p-value), but that only considers 'smaller' values. This means that at a pseudo p-value of 5%, maximally 5% of the random permutations score better than the real observation. This criterion covers extreme cases with few peptides, which might still be taken into consideration for follow-up investigations. One example for a protein with two proteoforms that do not reach statistical significance in the sequence proximity analysis, because only two peptides in the short proteoform are too few, is PSMB7 (Supplementary Figure   2B). You can appreciate that the true observed proximity score is as low as statistically possible, therefore still potentially presenting an interesting candidate, as shown in this example. We report these extreme cases by stating that the protein scored among the lowest 10% of possible p-values.

Phospho-enrichment analysis
To test our hypothesis that some of the proteoform groups detected by the cell cycle SEC-SWATH-MS analysis might be muti-phosphorylation derived, we retrieved the cell cycle phospho-proteomic dataset from Karayel et al. 35

Targeted analysis of selected peptides in Skyline
Selected raw data of both the cell cycle SEC-SWATH-SM and mouse tissue SWATH-MS data was downloaded from their respective Pride repositories, https://www.ebi.ac.uk/pride/archive/projects/PXD010288 and https://www.ebi.ac.uk/pride/archive/projects/PXD005044. Targeted extraction of the selected peptides was performed using Skyline version 4.1. The acquisition method was set to DIA and the isolation scheme was matched to the original publication of the datasets. Complete y and b ion series without retention time filtering were extracted for all peptides of interest. The data was than manually inspected for matching peak groups and the fragment ions were filtered for coelution. The quantification was subsequently based on the total area under the identified peak groups.

Website
Data preprocessing and visualization for the dashboard was performed using the python programming language. The following libraries were utilized for data processing: numpy, pandas, scipy, re and pyteomics 71,72