Folding a Small Protein Using Harmonic Linear Discriminant Analysis

Many processes of scientific importance are characterized by time scales that extend far beyond the reach of standard simulation techniques. To circumvent this impediment a plethora of enhanced sampling methods has been developed. One important class of such methods relies on the application of a bias that is function of a set of collective variables specially designed for the problem under consideration. The design of good collective variables can be challenging and thereby constitutes the main bottle neck in the application of these methods. To address this problem, recently we have introduced Harmonic Linear Discriminant Analysis, a method to systematically construct collective variables. The method uses as input information on the metastable states visited during the process that is being considered, information that can be gathered in short unbiased MD simulations, to construct the collective variables as linear combinations of a set of descriptors. Here, to scale up our examination of the method's efficiency, we applied it to the folding of Chignolin in water. Interestingly, already before any biased simulations were run, the constructed one dimensional collective variable revealed much of the physics that underlies the folding process. In addition, using it in Metadynamics we were able to run simulations in which the system goes from the folded state to the unfolded one and back, where to get fully converged results we combined Metadynamics with Parallel Tempering. Finally, we examined how the collective variable performs when different sets of descriptors are used in its construction.

Many processes of scientific importance are characterized by time scales that extend far beyond the reach of standard simulation techniques. To circumvent this impediment a plethora of enhanced sampling methods has been developed. One important class of such methods relies on the application of a bias that is function of a set of collective variables specially designed for the problem under consideration. The design of good collective variables can be challenging and thereby constitutes the main bottle neck in the application of these methods. To address this problem, recently we have introduced Harmonic Linear Discriminant Analysis, a method to systematically construct collective variables. The method uses as input information on the metastable states visited during the process that is being considered, information that can be gathered in short unbiased MD simulations, to construct the collective variables as linear combinations of a set of descriptors. Here, to scale up our examination of the method's efficiency, we applied it to the folding of Chignolin in water. Interestingly, already before any biased simulations were run, the constructed one dimensional collective variable revealed much of the physics that underlies the folding process. In addition, using it in Metadynamics we were able to run simulations in which the system goes from the folded state to the unfolded one and back, where to get fully converged results we combined Metadynamics with Parallel Tempering. Finally, we examined how the collective variable performs when different sets of descriptors are used in its construction. Simulations of complex processes such as drug binding, protein association, protein folding, phase transitions, etc. have proven to be of great value and are a pillar of contemporary scientific investigation. However, many such processes are characterized by very long time scales which prohibit their simulation using conventional simulation techniques. Hence, to circumvent this limitation, a plethora of enhanced sampling methods has been developed over the years including replica exchange based methods such as Parallel Tempering 1 and bias based techniques such as Umbrella Sampling 2 , Metadynamics 3 and Variationally Enhanced Sampling 4 . The latter category relies on the use of collective variables (CVs) which describe the most essential degrees of freedom of the processes being considered. Constructing appropriate CVs however, can be challenging and time consuming. Thus, and in light of the expected continuing increase in the complexity and size of the systems being studied, devising techniques for the systematic construction of efficient CVs is regarded as an important objective of the enhanced sampling community. Also, finding good CVs is not only a technical issue, but is a way of encoding in a compact and transparent way the essence of the process being considered.
In the effort to address this challenge, in a recent publication 5,6 we have proposed a new scheme for constructing systematically viable CVs through the utilization of the supervised learning class classification paradigm, and in particular using a modification of Fisher's Linear Discriminant Analysis (LDA), termed Harmonic Linear Discriminant Analysis (HLDA) (see also ref. 7 ). The LDA assumes a multivariate normal distribution of the descriptors. For this reason we also examine how deviation from multivariate normality affect the CVs efficiency. The scheme of choice requires as input only short unbiased trajectories, for each metastable state. Using these data, HLDA can estimate the direction within an N d dimensional space of selected system descriptors upon which the projections of these sets of data are best separated. The linear combination corresponding to this direction is then utilized as the CV.
To test its applicability, HLDA has been used in ref. 5 in two examples taken from the realm of Materials Science and Chemistry. In both cases HLDA was found to be able to generate good CVs, leading to biased runs characterized by high frequency of transitions between the metastable states of interest and to a rapidly converging sampling.
The application of HLDA in ref. 5 was, however, still confined to a set of relatively simple problems. Hence, if its use (and the class classification paradigm underlying it) is to be adopted for scientific and technological problems of increasing complexity, it would need to prove effective in such systems. Here, we would like to investigate the performance of HLDA for a relatively more complex system. We consider the case of a small protein, Chignolin, for which extensive simulations on purpose build machines are available 8 . The procedure to determine the HLDA CVs requires the use of a convenient set of descriptors d i (R) that are capable of describing the initial and final states. This study will give us information on the effect of the choice of descriptors on the CVs quality.
We show also how the HLDA CVs bring out much of the physics even before performing the simulations. Moreover, we find that employing the HLDA CVs within Metadynamics simulations enables sampling numerous folding and unfolding path-ways and that through their incorporation in Parallel Tempered Metadynamics (PTMetaD) simulations 9 , estimates for the system Free Energy Surface (FES) can be obtained.

Constructing the collective variable
To construct CVs which describe the Chignolin folding process we utilize the paradigm introduced in ref. 5 that estimates the direction W in an N d dimensional descriptor space in which the projections of the unbiased distributions of the folded and unfolded states are best separated. As in ref. 5 we do this using HLDA, a modification of Fisher's LDA. Thus, as in Fisher's LDA, the estimation of W is done through the maximization of the ratio between the system's so called between class S b and within class S w scatter matrices. Like LDA, the former is measured by the square of the distances between the projected means, and can be written as where µ A,B are the expectation values of the two metastable states. In contrast to LDA, in which the within class matrix is estimated using the average of the two metastable states multivariate variances Σ A,B , here it is estimated from the spreads harmonic average The HLDA objective function which has the form of a Rayleigh ratio is then maximized by which in turn yields the HLDA CV Computational Details Simulations of Chignolin (sequence TYR-TYR-ASP-PRO-GLU-THR-GLY-THR-TRP-TYR) were conducted using GROMACS 5.1.2 10,11 and the PLUMED 2.4 plugin 12 . The CHARMM22* force field 13 and the three-site transferable inter-molecular potential (TIP3P) water model 14 were used to make direct comparisons with ref. 15 . ASP and GLU residues were simulated in their charged states, as were the N-and C-terminal amino acids. A time step of 2 fs was used for all systems and a constant temperature of 340 K (in agreement with ref. 15 ) was maintained by the velocity rescaling thermostat of Bussi et al. 16 . All bonds involving H atoms were constrained with the linear constraint solver (LINCS) algorithm 17 . Electrostatic interactions were calculated with the particle mesh Ewald scheme 18 and 1 nm cutoff was applied to all non-bonded interactions. Runs were all conducted with a box containing 1649 water molecules, along with 2 sodium ions to neutralize the system. Parallel Tempering simulations were run with 40 replicas, each at a different temperature. The replica with the lowest temperature was set to be at T=340 K while the rest of the temperatures were arranged in a geometrical series with a factor a = 1.011. (Details regarding the utilized descriptor sets can be found in the Supplementary Information).

RESULTS
In accordance with the basic requirements of the HLDA approach, we began the study by acquiring two unbiased trajectories spanning roughly 2µs each in the system's folded and unfolded states. The first step in applying HLDA requires the selection of a set of system descriptors that can be instrumental in describing the folding and unfolding of the miniprotein. Ideally, since this step should not require an expert's understanding of the system we proceeded by selecting fairly naively three different sets of descriptors to observe how in the present context this selection can influence the outcome of the method implementation. Here, we also chose to asses the HLDA ability to perform beyond its strict theoretical limitations, namely that the descriptor unbiased fluctuations are normal in form.
Thus, the first set D 1 consisted of 12 distances between different atomic sites on the protein. Six of these distances were selected between atoms situated on the backbone, while six more where taken between atoms situated on the protein's side chains. The second set of descriptors D 2 consisted of 6 contacts placed on the protein's backbone while the third D 3 consisted of αβ functions corresponding to the protein's backbone dihedral angles, i.e. αβ = 1 2 (1 + cos(φ i )) with i = 1..18. (For a detailed list of the atom pairs used for the construction of D 1 and D 2 and the contacts parameters see the SI). The two covariance matrices Σ f , Σ u and two mean vectors µ f , µ u corresponding to the folded and unfolded states respectively, were thus constructed for each of the descriptor sets.
Using this information and applying HLDA we could next obtain an estimation of the hyper planes that best separated the unbiased distributions corresponding to the folded and unfolded states within the space spanned by each set of descriptors. Concomitantly, the sought after CVs were obtained using Eq. 7. The weights of the HLDA CVs attained for each of the descriptor sets are plotted in Fig. 1 along with an illustrations of the utilized descriptors.
Interestingly, analysis of the weight distributions reveals much valuable information about the system showing that the main features of the folding process are encoded in the CVs themselves. Thus, we find that for both the sets D 1 and D 2 most of the weight is assigned to the descriptors d 1 , d 2 and d 3 which correspond to the distances and contacts between facing amino acids located away from the backbone beta-turn. Similarly, in both D 1 and D 2 the distances/contacts located in the beta-turn are found to be comparably less important, alluding to the fact that in the unfolded state the beta-turn is intermittently formed. In the case of D 1 the distances between the side chains are assigned lower weights as well. Nevertheless, for the distances d 10 , d 11 and d 12 non negligible weight is assigned, reflecting the associated side-chain's contact formation in the folded state, due to their hydrophobic nature.
Inspecting the weight distribution obtained for D 3 reveals interesting trends as well. Thus, one can observe that by and large the higher weights of the CV are assigned to αβs of the backbone dihedral angles situated in and near the backbone beta-turn, reflecting these angles' importance in the folding process. In addition, we find that in comparison it is the αβ(Ψ) that attain higher weights. Inspection here shows that while the αβ(Φ) fluctuations do not change much between the folded and unfolded states, clearly configurational changes of αβ(Ψ) between the folded and unfolded states are present. Another interesting feature of the CV weight distribution is that with the exception of d 6 , the weights of αβ(Ψ) and αβ(Φ) tend to be in anti-phase. Here, inspection shows that this results from the correlation between the αβ(Ψ) and αβ(Φ) fluctuations in the α-helical basin that is visited in the unfolded state. Moreover, we attribute the single positive value of d 6 (which corresponds to the Glycine Φ) to the fact that unlike the other Φ backbone dihedral angles it can also assume a left helix conformation 19 . Finally, examination of d 13 , the descriptor assigned with the largest weight, shows that it corresponds to the αβ(Ψ) of the Proline amino acid, coinciding with the fact that such angles are associated with a relatively high energetic rotational barrier 20 .
Side chains

Biased Simulations
With the CVs at hand we could next launch Metadynamics simulations with the objective of sampling the system phase space. Monitoring these simulations, we found that several folding events could be observed with D 1 and D 2 taking the lead in the transition frequencies. Thus, with the little initial information with which we commenced, an assortment of folding and unfolding pathways could be harvested, thereby shedding light on the mechanisms underlying these events. Fig. 2 presents segments of three simulations, each run with a CV generated by a different descriptor set, showing the Cα RMSD of the protein with respect to it's folded crystal structure as function of Metadynamics simulations time. All three segments exhibit both folding and unfolding events.
Despite observing even multiple transitions between the folded and unfolded states in some of the simulations, attaining estimates of converged FES using Metadynamics alone was not possible. Observing the simulations dynamics we found this to be caused by the system's phase space intricate multidimensional nature with a profusion of kinetic bottlenecks and free energy barriers. Thus, to circumvent this impediment we resorted to the utilization of Parallel Tempering Metadynamics which has previously been shown to be very effective for such problems. By running such simulations, now with the HLDA generated CVs, we could observe that for all three sets of descriptors effective sampling of the relevant system phase space was achieved. Moreover, estimates of the system FES could be obtained using the Well tempered Metadynamics (WTMD) 21 relation Eq. 8 where γ is the WTMD bias factor and V (s) is the simulation bias. Fig. 3 presents the FES obtained from three different simulations using the CVs generated by the three different sets of descriptors. For the sake of comparison, the corresponding FES obtained using a histogram analysis on a 100µs unbiased simulations taken from the D.E. Shaw data bank 15 is presented as well. As can be seen in all three cases reasonable estimates of the FES could be obtained. However, the results obtained using the sets D 1 and D 2 are markedly more accurate. Additionally, differences in the convergence times between the different simulations were observed, namely t conv (D 1 ) ≈ 25ns, t conv (D 2 ) ≈ 35ns and t conv (D 3 ) ≈ 45ns, where we defined convergence when the calculated FES fluctuations as function of simulation time reached their minimal amplitudes around their final average result. One probable reason for the differences in performance between the different CVs, is the extent to which their underlying descriptors unbiased fluctuations deviate from multivariate normality. To asses the possible influence of such deviations on the CVs quality we thus computed the Kurtosis and Skewness 22 of the covariance matrices, Σ f,u , eigenvectors. 4 presents these data for each of the descriptor sets in both the folded and unfolded states. As can be seen the values obtained for D 1 largely match those that correspond to perfect normal distributions (indicated by dashed lines). In slight contrast, the set D 2 can be seen to exhibit larger deviations from normality, while the set D 3 seems to exhibit the most and largest instances of such deviations. Observation of these data thus indicates that while HLDA seems to be forgiving when applied to data which is not strictly multinormal, it is likely that a price is to be paid in their quality as deviations increase.

CONCLUSIONS
As the size and complexity of systems simulated using molecular dynamics increases, the need for systematic ways of constructing viable CVs for these systems which do not require an expert's knowledge is becoming more evident. In the present study we have applied HLDA, a recently developed modification of LDA to develop one dimensional CVs for the folding problem of Chignolin. In doing so we have found that given a naive selection of descriptors the method is able to generate CVs that, when biased, are able to drive the system back and forth between the system's folded and unfolded states. In addition, we have found that incorporating these CVs in PTMetaD can enable obtaining of good estimates of the systems FES. In both cases we found that some deviation from multivariate normality is tolerated by HLDA, yet increasing the amount of deviation may lead to a reduction in the constructed CVs' quality. Finally, we found that within the weight distributions of the calculated HLDA CVs themselves reside abundant useful information and physical insight about the process being studied. We thus conclude that the present study suggests that HLDA can be applied to increasingly more complex systems for the systematic construction of CVs, a path which we wish to continue and explore in the near future.