Geometric Methods in Structural Computational Biology by Lydia Kavraki - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

"main directions" followed by the molecule's 3N degrees of freedom, or, in other words, the

directions followed collectively by the atoms. Interpolating along each PC makes each atom

follow a linear trajectory, that corresponds to the direction of motion that explains the most data

variance. For this reason, the PCs are often called Main Modes of Motion or Collective Modes of

Motion when computed from molecular motion data. Interpolating along the first few PCs has the

effect of removing atomic "vibrations" that are normally irrelevant for the molecule's bigger scale

motions, since the vibration directions have little data variance and would correspond to the last

(least important) modes of motion. It is now possible to define a lower-dimensional subspace of

protein motion spanned by the first few principal components and to use these to project the initial

high-dimensional data onto this subspace. Since the PCs are displacements (and not positions), in

order to interpolate conformations along the main modes of motion one has to start from one of

the known structures and add a multiple of the PCs as perturbations. In mathematical terms, in

order to produce conformations interpolated along the first PC one can compute:

Where c is a molecular conformation from the (aligned) data set and the interpolating parameter

can be used to add a deviation from the structure c along the main direction of motion. The

parameter can be either positive or negative. However, it is important to keep in mind that large

values of the interpolating parameter will start stretching the molecule beyond physically

acceptable shapes, since the PCs make all atoms follow straight lines and will fairly quickly

destroy the molecule's topology. A typical way of improving the physical feasibility of

interpolated conformations is to subject them to a few iterations of energy minimization following

some empirical force field.

Although it is possible to determine as many principal components as the number of original

variables (3N), PCA is typically used to determine the smallest number of uncorrelated principal

components that explain a large percentage of the total variation in the data, as quantified by the

residual variance. The exact number of principal components chosen is application-dependent and

constitutes a truncated basis of representation.

index-100_1.jpg

As an example of the application of PCA to produce low-dimensional points for high-dimensional

input, consider the Cyanovirin-N (CV-N) protein depicted in figure 4 a), corresponding to PDB

code 2EZM. This protein acts as a virucidal for many viruses, such as HIV. Protein folding

simulations of this protein can be used as input for a PCA analysis. A model of this protein that

considers atoms at the alpha-carbon positions only has 101 such atoms, for a total of 303 degrees

of freedom. Folding/unfolding simulations starting from the native PDB structure produce

abundant conformation samples of CV-N along the folding reaction.

Figure 72.

CV-N protein and PCA projections. a) A cartoon rendering of the CV-N protein. b) The projection of each simulated conformation onto the first two PCs. The concentrated cluster to the left corresponds to the "folded" protein, and the more spread cluster to the right corresponds to the "unfolded" protein. c) The PCA coordinates used as a basis for a free-energy (probability) plot.

Figure 4 b) shows the projection of each simulated conformation onto the first two principal

components. Each point in the plot corresponds to exactly one of the input conformations, which

have been classified along the first two PCs. Note that even the first PC alone can be used to

approximately classify a conformation as being "folded" or "unfolded", and the second PC

explains more of the data variability by assigning a wider spread to the unfolded region (for which

the coordinate variability is much larger). The low-dimensional projections can be used as a basis

to compute probability and free-energy plots, for example as in Das et al. [link] Figure 4 c) shows such a plot for CV-N, which makes the identification of clusters easier. Using PCA to interpolate

conformations along the PCs is not a good idea for protein folding data since all of the protein

experiences large, non-linear motions of its atoms. Using only a few PCs would quickly destroy

the protein topology when large-scale motions occur. In some cases, however, when only parts of

a protein exhibit a small deviation from its equilibrium shape, the main modes of motion can be

exploited effectively as the next example shows.

As an example of the application of PCA to analyze the main modes of motion, consider the work

of Teodoro et al. [link], which worked on the HIV-1 protease, depicted in figure 5. The HIV-1

index-101_1.jpg

protease plays a vital role in the maturation of the HIV-1 virus by targeting amino acid sequences

in the gag and gag-pol polyproteins. Cleavage of these polyproteins produces proteins that

contribute to the structure of the virion, RNA packaging, and condensation of the nucleoprotein

core. The active site of HIV-1 protease is formed by the homodimer interface and is capped by

two identical beta-hairpin loops from each monomer, which are usually referred to as "flaps". The

structure of the HIV-1 protease complexed with an inhibitor is shown in figure 5. The active site

structure for the bound form is significantly different from the structure of the unbound

conformation. In the bound state the flaps adopt a closed conformation acting as clamps on the

bound inhibitors or substrates, whereas in the unbound conformation the flaps are more open.

Molecular dynamics simulations can be run to sample the flexibility of the flaps and produce an

input data set of HIV-1 conformations to which PCA can be applied. A backbone-only

representation of HIV-1 has 594 atoms, which amounts to 1,782 degrees of freedom for each

conformation.

Figure 73.

Alternative conformations for HIV-1 protease. Tube representation of HIV-1 protease (PDB codes 4HVP and 1AID) bound to

different inhibitors represented by spheres. The plasticity of the binding site of the protein allows the protease to change its shape in order to accommodate ligands with widely different shapes and volumes.

Applying the PCA procedure as outlined before to a set of HIV-1 samples from simulation

produces 1,782-dimensional principal components. Since the physical interpretation of the PCs is

quite intuitive in this case, the PC coordinates can be split in groups of 3 to obtain the (x,y,z)

components for each of the 594 atoms. These components are 3-dimensional vectors that point in

the direction each atom would follow along the first PC. In figure 6 a), the per-atom components

of the first PC have been superimposed in purple.

index-102_1.jpg

Figure 74.

First mode of motion for the HIV-1 protease. a) The purple arrows are a convenient representation of the first PC grouped every 3

coordinates -(x,y,z) for each atom- to indicate the linear path each atom would follow. Note that the "flaps" have the most important motion component, which is consistent with the simulation data. b) A reference structure (middle) can be interpolated along the first PC in a negative direction (left) or a positive one (right). Using only one degree of freedom, the flap motion can be approximated quite accurately.

Figure 6 b) shows the effect of interpolating HIV-1 conformations along the first PC, or first mode

of motion. Starting from an aligned conformation from the original data set, multiples of the first

PC can be added to produce interpolated conformations. Note that the first mode of motion

corresponds mostly to the "opening" and "closing" of the flaps, as can be inferred from the relative

magnitued of the first PC components in the flap region. Thus, interpolating in the direction of the

first PC produces an approximation of this motion, but using only one degree of freedom. This

way, the complex dynamics of the system and the 1,782 apparent degrees of freedom have been

approximated by just one, effectively reducing the dimensionality of the representation.

index-103_1.jpg

Figure 75.

The residual variance (solid line) and percentage of overall variance explained (dashed line) after each principal component.

Figure 7 (solid line) shows the residual variance left unexplained by discarding the lower-ranked

principal components. Residual variance plots always decrease, and in this case, the first PC

accounts for approximately 40% of the total variance along the motion (the dashed line shows the

percentage of total variance explained up to the given PC). Also, the first 10 PCs account for more

than 70% of the total data variance. Given the dominance of only a few degrees of freedom it is

possible to represent the flexibility of the protein in a drastically reduced search space.

Non-Linear Methods

PCA falls in the category of what is called a linear method, since mathematically, the PCs are

computed as a series of linear operations on the input coordinates. Linear methods such as PCA

work well only when the collective atom motions are small (or linear), which is hardly the case

for most interesting biological processes. Non-linear dimensionality reduction methods do exist,

but are normally much more computationally expensive and have other disadvantages as well.

However, non-linear methods are much more effective in describing complex processes using

much fewer parameters.

As a very simple and abstract example of when non-linear dimensionality reduction would be

preferred over a linear method such as PCA, consider the data set depicted in figure 8 a). The data

is apparently two-dimensional, and naively considering the data variance in this way leads to two

"important" principal components, as figure 8 b) shows. However, the data has been sampled from

a one-dimensional, but non-linear, process. Figure 8 c) shows the correct interpretation of this

data set as non-linear but one-dimensional.

index-104_1.jpg

Figure 76.

Effects of dimensionality reduction on an inherently non-linear data set. a) The original data given as a two-dimensional set. b) PCA identifies two PCs as contributing significantly to explain the data variance. c) However, the inherent topology (connectivity) of the data helps identify the set as being one-dimensional, but non-linear.

Most interesting molecular processes share this characteristic of being low-dimensional but

highly non-linear in nature. For example, in the protein folding example depicted in figure 1, the

atom positions follow very complicated, curved paths to achieve the folded shape. However, the

process can still be thought of as mainly one-dimensional. Linear methods such as PCA would fail

to correctly identify collective modes of motion that do not destroy the protein when followed,

much like in the simple example of figure 8.

Several non-linear dimensionality reduction techniques exist, that can be classified as either

parametric or non-parametric.

Parametric methods need to be given a model to try to fit the data, in the form of a

mathematical function called a kernel. Kernel PCA is a variant of PCA that projects the points onto a mathematical hypersurface provided as input together with the points. When using

parametric methods, the data is forced to lie on the supplied surface, so in general this family

of methods do not work well with molecular motion data, for which the non-linearity is

unknown.

Non-parametric methods use the data itself in an attempt to infer the non-linearity from it,

much like deducing the inherent connectivity in figure 8 c). The most popular methods are

Isometric Feature Mapping (Isomap) from Tenenbaum et al. [link] and Locally Linear Embedding (LLE) from Roweis et al. [link].

The Isomap algorithm is more intuitive and numerically stable than LLE and will be discussed

next.

Isometric Feature Mapping (Isomap)

Isomap was introduced by Tenenbaum et al. [link] in 2000. It is based on an improvement over a

index-105_1.jpg

index-105_2.jpg

previous technique known as MultiDimensional Scaling (MDS). Isomap aims at capturing the

non-linearity of a data set from the set itself, by computing relationships between neighboring

points. Both MDS and the non-linear improvement will be presented first, and then the Isomap

algorithm will be detailed.

MDS. Multidimensional Scaling (see Cox et al. [link]) is a technique that produces a low-dimenisional representation for an input set of n points, where the dimensions are ordered by

variance, so it is similar to PCA in this respect. However, MDS does not require coordinates as

input, but rather, distances between points. More precisely, MDS requires as input a data set of

points, and a distance measure d(i,j) between any pair of points xi and xj. MDS works best when

the distance measure is a metric. MDS starts by computing all possible pairwise distances between the input points, and placing them in a matrix D so that D ij=d(i,j). The goal of MDS is to

produce, for every point, a set of n Euclidean coordinates such that the Euclidean distance between

all pairs match as close as possible the original pairwise distances D ij, as depicted in figure 9.

Figure 77.

MDS at work. A matrix of interpoint distances D is used to compute a set of Euclidean coordinates for each of the input points.

If the distance measure between points has the metric properties as explained above, then n

Euclidean coordinates can always be found for a set of n points, such that the Euclidean distance

between them matches the original distances. In order to obtain these Euclidean coordinates, such

coordinates are assumed to have a mean of zero, i.e., they are centered. In this way, the matrix of

pairwise distances D can be converted into a matrix of dot products by squaring the distances and

performing a "double centering" on them to produce the matrix B of pairwise dot products:

Let's assume that n Euclidean coordinates exist for each data point and that such coordinates can

index-106_1.jpg

index-106_2.jpg

index-106_3.jpg

be placed in a matrix X. Then, multiplying X by its transpose should equal the matrix B of dot

products computed before. Finally, in order to retrieve the coordinates in the unknown matrix X,

we can perform an SVD of B which can be expressed as:

Where the left and right singular vectors coincide because B is a symmetric matrix. The diagonal

matrix of eigenvalues can be split into two identical matrices, each having the square root of the

eigenvalues, and by doing this a solution for X can be found:

Since X has been computed through an SVD, the coordinates are ordered by data variance like in

PCA, so the first component of X is the most important, and so on. Keeping this in mind, the

dimensionality reduction can be performed by ignoring the higher dimensions in X and only

keeping the first few, like in PCA.

Geodesic distances. The notion of "geodesic" distance was originally defined as the length of a

geodesic path, where the geodesic path between two points on the surface of the Earth is

considered the "shortest path" since one is constrained to travel on the Earth's surface. The

concept of geodesic distance can be generalized to any mathematical surface, and defined as "the

length of the shortest path between two points that lie on a surface, when the path is constrained to

lie on the surface." Figure 10 shows a schematic of this generalized concept of geodesic path and

distance.

Figure 78.

Geodesic distance. The geodesic distance between the two red points is the length of the geodesic path, which is the shortest path between the points, that lies on the surface.

index-107_1.jpg

The Isomap algorithm. The Isomap method augments classical MDS with the notion of geodesic

distance, in order to capture the non-linearity of a data set. To achieve its goal, Isomap uses MDS

to compute few Euclidean coordinates that best preserve pairwise geodesic distances, rather than

direct distances. Since the coordinates computed by MDS are Euclidean, these can be plotted on a

Cartesian set of axes. The effect of Isomap is similar to "unrolling" the non-linear surface into a

natural parameterization, as depicted in figure 11. Isomap approximates the geodesic distances

from the data itself, by first building a neighborhood graph for the data. A neighborhood graph

consists of the original set of points, together with a connection between "neighboring" points

(neighboring points are defined as the closest points to a given point, as determined by the

distance measure used) as shown in figure 11-b. After the neighborhood graph has been built, it

can be used to approximate the geodesic distance between all pairs of points as the shortest path

distance along the graph (red path in figure 11-b). Naturally, the sampling of the data set has to be

enough to capture the inherent topology of the non-linear space for this approximation to work.

MDS then takes these geodesic distances and produces the Euclidean coordinates for the set.

Figure 11-c shows two Euclidean coordinates for each point in the example.

Figure 79.

Isomap at work on the "swiss roll" data set. a) The input data is given as three-dimensional, but is really two-dimensional in nature. b) Each point in the data set is connected to its neighbors to form a neighborhood graph, overlaid in light grey. Geodesic distances are approximated by computing shortest paths along the neighborhood graph (red). c) MDS applied to the geodesic distances has the effect of "unrolling" the swiss roll into its natural, two-dimensional parameterization. The neighborhood graph has been overlaid for comparison. Now, the Euclidean distances (in blue) approximate the original geodesic distances (in red).

The Isomap method can be put in algorithmic form in the following way:

The Isomap algorithm

1. Build the neighborhood graph. Take all input points and connect each point to the closest

ones according to the distance measure used. Different criteria can be used to select the closest

neighbors, such as the k closest or all points within some threshold distance.

2. Compute all pairwise geodesic distances. These are computed as the shortest paths on the

neighborhood graph. Efficient algorithms for computing all-pairs-shortest-paths exist, such as

Dijkstra's algorithm. All geodesic distances can be put in a matrix D.

3. Perform MDS on geodesic distances. Take the matrix D and apply MDS to it. That is, apply

the double-centering formula explained above to produce B, and compute the low-dimensional

coordinates by computing B's eigenvectors and eigenvalues.

The Isomap algorithm captures the non-linearity of the data set automatically, from the data itself.

It returns a low-dimensional projection for each point; these projections can be used to understand

the underlying data distribution better. However, Isomap does not return "modes of motion" like

PCA does, along which other points can be interpolated. Also, Isomap is much more expensive

than PCA, since building a neighborhood graph and computing all-pairs-shortest-paths can have

quadratic complexity on the number of input points. Plus, ther is the hidden cost of the distance

measure itself, which can also be quite expensive. Regardless of its computational cost, Isomap

has proven extremely useful in describing non-linear processes with few parameters. In particular,

the work in Das et al. [link] applied Isomap successfully to the study of a protein folding reaction.

In order to apply Isomap to a set of molecular conformations (gathered, for example, through

molecular dynamics simulations) all that is needed is a distance measure between two

conformations. The most popular distance measure for conformations is lRMSD, discussed in the

module Molecular Distance Measures. There is no need to pre-align all the conformations in this case, since the lRMSD distance already includes pairwise alignment. Thus, the Isomap algorithm

as described above can be directly applied to molecular conformations. Choosing an appropriate

value for a neighborhood parameter (such as k) may require a bit of experience, though, and it

may depend on the amount and variability of the data. It should be noted that now, the shape of the

low-dimensional surface where we hypothesize the points lie is unknown to us. In the swiss roll

example above, it was obvious that the data was sampled from a two-dimensional surface. For

molecular data, however, we do not know, a priori, what the surface looks like. But we know that

the process should be low-dimensional and highly non-linear in nature. Of course, the distance

measure used as an underlying operation has an impact on the final coordinates as well. The

validation of the Isomap method using the lRMSD distance on protein folding data was done in

Das et al. [link], where a statistical analysis method was used to prove its usefulness.

Figure 12 shows the results of applying Isomap to the same molecular trajectory used for the CV-

N example in the PCA section. CV-N is a protein that is known to have an "intermediate" state in

the folding mechanism, and it is also known to fold through a very ordered process following a

well-defined "folding route".

index-109_1.jpg

Figure 80.

The first two Isomap coordinates for the CV-N protein folding trajectory. Left: the projection of each simulated point onto the first two low-dimensional coordinates. Right: A free-energy (probability) surface computed using the Isomap coordinates.

The free-energy plot in figure 12 (right) clearly shows the superior quality of the Isomap

coordinates when compared with PCA. Isomap clearly identifies the CV-N intermediate (seen as a

free-energy minimum in blue) in between the unfolded and folded states. Also, the highest

probability route connecting the intermediate to the folded state is clearly seen in this rendering.

The PCA coordinates did not identify an itermediate or a folding route, since the PCA coordinates

are simply a linear projection of the original coordinates. Isomap, on the contrary, finds the

underlying connectivity of the data and always returns coordinates that clearly show the

progression of the reaction along its different stages.

The Isomap procedure can be applied to different molecular models. As another, smaller example,

consider the Alanine Dipeptide molecule shown in figure 13 (left). This is an all-atom molecule

that has been studied thoroughly and is known to have two predominant shapes: an "extended"

shape (also called C5 and a variant called PII) and a "helical" shape (also called alpha, with three

variants: right-handed, P-like, and left-handed). The application of the Isomap algorithm to this

molecule, without any bias by a priori known information, yields the low-dimensional coordinates

on the left of figure 13 (shown again as a free-energy plot). The first two coordinates (top right)

are already enough to identify all five major states of the peptide, and the possible transitions

between them. Applying PCA to such a simple molecule yields comparable results (not shown),

but PCA cannot differentiate the left-handed helix shape from the right-handed one. Only the

geodesic formulat