Summary: This module introduces concepts on dimensionality reduction, including Principal Components Analysis (PCA) and Isomap. It also shows how to apply them to molecular motion data, and how to interpret the resulting coordinates.
Note: Your browser may not currently support MathML. See our browser support page for additional details. You can always view the correct math in the PDF version.
The study of many biological processes at the molecular level involves understanding how biological molecules (especially proteins) behave dynamically. The three-dimensional shape of these molecules, which we call a conformation, usually determines the chemical action they perform. Both the stable (also called native) shape of a biomolecule and dynamical deviations from it are important to understand how it interacts with other molecules such as pharmaceutical drugs or other complexes. For these reasons, understanding the main shapes and motions of these molecules is of utmost importance.
Current structural biology experimental methods are restricted in the amount of information they can provide regarding protein motions because they were designed mainly to determine the three-dimensional static representation of a molecule. For this reason, in silico methods (i.e., run in a computer) are used to extensively sample protein conformations. As a summary, the most popular methods to gather protein conformations are:
The computations required to simulate protein motion in silico are very expensive and involve non-trivial force field evaluations, as explained above. These simulations provide us with the
![]() |
However, many biological processes are known to be very structured at the molecular level, since the constituent atoms self-organize to achieve their bio-chemical goal. An example of such a process is protein folding, the process by which a protein achieves its thermodynamically-stable three-dimensional shape to perform its biological function (a depiction of a protein folding process is shown in figure 1). To study such processes based on data gathered through simulations, there is a need to "summarize" the high-dimensional conformational data. Simply visualizing the time-series of a moving protein as produced by simulation packages does not provide a lot of insight into the process itself. One way of summarizing these motions is to turn conformations into a low-dimensional representation, such as a vector with very few components, that somehow give the "highlights" of the process. This data analysis process -turning high-dimensional data into low-dimensional data for better interpretation- is called dimensionality reduction and is discussed next.
When molecular shapes are sampled throughout some physical-chemical process that involves the motion of the molecule, there is a need to simplify the high-dimensional (albeit redundant) representation of a molecule given as a 3N-dimensional point, since it is believed that the actual degrees of freedom (DoFs) of the process are much less, as explained before. The resulting, simplified representation has to be useful to classify the different conformations along one or more "directions" or "axes" that provide enough discrimination between them.
Dimensionality Reduction techniques aim at analyzing a set of points, given as input, and producing the corresponding low-dimensional representation for each. The goal is to discover the true dimensionality of a data set that is only apparently high-dimensional. There exist mathematical tools to perform automatic dimensionality reduction, based on arbitrary input data in the form of high-dimensional points (not just molecules). Although different techniques achieve their goals in different ways, and have both advantages and disadvantages, the most general definition for dimensionality reduction could be stated as:
As a simple example of dimensionality reduction, consider the case of a bending string of beads, as depicted in figure 2. The input data has 3x7=21 dimensions (if given as the
![]() |
When dimensionality reduction methods are applied to molecular motion data, the goal is to find the main "directions" or "axes" collectively followed by the atoms, and the placement of each input conformation along these axes. The meaning of such axes can be intuitive or abstract, depending on the technique used and how complex the system is. We can reword the definition of dimensionality reduction when working with molecular motion samples as:
The remainder of this module describes two dimensionality reduction techniques and their application to molecular motion data. These are Principal Components Analysis (PCA), a linear method, and ISOmetric feature MAPping (Isomap), a non-linear method.
We will start our discussion of Principal Components Analysis, or PCA, considering a very general data set of points as depicted in figure 3. Each point in this simple data set is given as a 3-dimensional vector
| Principal Components Analysis |
|---|
![]() |
For data points of dimensionality M, the goal of PCA is to compute M so-called Principal Components (PCs), which are M-dimensional vectors that are aligned with the directions of maximum variance (in the mathematical sense) of the data. These PCs have the following properties:
For M-dimensional input data, the Principal Components (PCs) are M-dimensional vectors and have two main uses. These are to:
There is one detail left to introduce before going into the mechanics of computing the principal components. Notice that in figure 3-a the data set is not shown at any particular position in 3D space, that is, the data set could be centered at any point in space, not necessarily around the origin
To compute the principal components, let

We want

So then, by multiplying by

Since

Where

This fact, together with the fact that
It is important to note at this point that the eigenvalues actually correspond to the variance along each PC. By computing the ratio of each eigenvalue

which is a typical measure of the error made by approximating the data set using only the first
The above derivation can be written as an algorithm fairly easily.
PCA is very well established as a dimensionality reduction technique and efficient algorithms with guaranteed convergence for its computation are readily available. Software packages that perform SVD and PCA are freely available and trivial to use. For example Matlab has built-in commands for both, and moderately experienced C and Fortran programmers can use the popular and extremely-efficient LAPACK linear algebra package. The concepts explained in this section have not assumed any particular number of dimensions. Even though a 3D example was given at the beginning, the concepts can be lifted to deal with spaces of arbitrary dimensionality, where the input data set can have a large dimension M. PCA has the advantage over other available methods that the principal components have a direct physical interpretation, especially when working with molecular motion data.
In structural bioinformatics we want to apply PCA to a set of molecular conformations, which will serve as our high-dimensional points. The input dimensionality of each point is 3N, where N is the number of atoms in the molecule. We will have n such conformations, that have been gathered through some form of sampling (for example through molecular dynamics simulations), and we want to reduce the dimensionality of each "point" (conformation) for analysis purposes. The data used as input for PCA is in the form of several atomic position vectors corresponding to different structural conformations which together constitute a vector set. Each vector in the conformational vector set has dimension 3N and is of the form
[
However, before running the PCA procedure outlined above, all conformations need to be aligned with a reference structure first, as discussed in the module Molecular Distance Measures. The reason why alignment is important is that most simulation packages model the exchange of heat between the molecule and a thermal bath, in the form of random velocity perturbations on the atoms. These perturbations will in general add non-zero linear and angular momenta to the molecule structure, and not all simulation packages remove them. As a result, molecular conformations that are almost identical in shape but translated/rotated with respect to each other will have significantly different coordinates, and will be considered as different 3N-dimensional points. The reference structure (to which all structures should be aligned to) can be chosen from the set (e.g., the native structure) or provided from outside the set. Results may vary a little, but aligning all conformations to the same reference structure yields comparable results in general.
After all conformations have been aligned, the PCA procedure can be used exactly as detailed above. The first step is to determine the average vector from the conformational vector set, so it can be subtracted from all conformations to build the matrix
The PCs for molecular conformation data can be used for the two purposes explained before, i.e., to obtain a low-dimensional representation of each point and to synthesize (or interpolate) new conformations by following the PCs. Now, the PCs have a physical meaning: they represent the "main directions" followed by the molecule's 3N degrees of freedom, or, in other words, the directions followed collectively by the atoms. Interpolating along each PC makes each atom follow a linear trajectory, that corresponds to the direction of motion that explains the most data variance. For this reason, the PCs are often called Main Modes of Motion or Collective Modes of Motion when computed from molecular motion data. Interpolating along the first few PCs has the effect of removing atomic "vibrations" that are normally irrelevant for the molecule's bigger scale motions, since the vibration directions have little data variance and would correspond to the last (least important) modes of motion. It is now possible to define a lower-dimensional subspace of protein motion spanned by the first few principal components and to use these to project the initial high-dimensional data onto this subspace. Since the PCs are displacements (and not positions), in order to interpolate conformations along the main modes of motion one has to start from one of the known structures and add a multiple of the PCs as perturbations. In mathematical terms, in order to produce conformations interpolated along the first PC one can compute:

Where
Although it is possible to determine as many principal components as the number of original variables (3N), PCA is typically used to determine the smallest number of uncorrelated principal components that explain a large percentage of the total variation in the data, as quantified by the residual variance. The exact number of principal components chosen is application-dependent and constitutes a truncated basis of representation.
As an example of the application of PCA to produce low-dimensional points for high-dimensional input, consider the Cyanovirin-N (CV-N) protein depicted in figure 4 a), corresponding to PDB code 2EZM. This protein acts as a virucidal for many viruses, such as HIV. Protein folding simulations of this protein can be used as input for a PCA analysis. A model of this protein that considers atoms at the alpha-carbon positions only has 101 such atoms, for a total of 303 degrees of freedom. Folding/unfolding simulations starting from the native PDB structure produce abundant conformation samples of CV-N along the folding reaction.
![]() |
Figure 4 b) shows the projection of each simulated conformation onto the first two principal components. Each point in the plot corresponds to exactly one of the input conformations, which have been classified along the first two PCs. Note that even the first PC alone can be used to approximately classify a conformation as being "folded" or "unfolded", and the second PC explains more of the data variability by assigning a wider spread to the unfolded region (for which the coordinate variability is much larger). The low-dimensional projections can be used as a basis to compute probability and free-energy plots, for example as in Das et al. Figure 4 c) shows such a plot for CV-N, which makes the identification of clusters easier. Using PCA to interpolate conformations along the PCs is not a good idea for protein folding data since all of the protein experiences large, non-linear motions of its atoms. Using only a few PCs would quickly destroy the protein topology when large-scale motions occur. In some cases, however, when only parts of a protein exhibit a small deviation from its equilibrium shape, the main modes of motion can be exploited effectively as the next example shows.
As an example of the application of PCA to analyze the main modes of motion, consider the work of Teodoro et al., which worked on the HIV-1 protease, depicted in figure 5. The HIV-1 protease plays a vital role in the maturation of the HIV-1 virus by targeting amino acid sequences in the gag and gag-pol polyproteins. Cleavage of these polyproteins produces proteins that contribute to the structure of the virion, RNA packaging, and condensation of the nucleoprotein core. The active site of HIV-1 protease is formed by the homodimer interface and is capped by two identical beta-hairpin loops from each monomer, which are usually referred to as "flaps". The structure of the HIV-1 protease complexed with an inhibitor is shown in figure 5. The active site structure for the bound form is significantly different from the structure of the unbound conformation. In the bound state the flaps adopt a closed conformation acting as clamps on the bound inhibitors or substrates, whereas in the unbound conformation the flaps are more open. Molecular dynamics simulations can be run to sample the flexibility of the flaps and produce an input data set of HIV-1 conformations to which PCA can be applied. A backbone-only representation of HIV-1 has 594 atoms, which amounts to 1,782 degrees of freedom for each conformation.
![]() |
Applying the PCA procedure as outlined before to a set of HIV-1 samples from simulation produces 1,782-dimensional principal components. Since the physical interpretation of the PCs is quite intuitive in this case, the PC coordinates can be split in groups of 3 to obtain the
![]() |
Figure 6 b) shows the effect of interpolating HIV-1 conformations along the first PC, or first mode of motion. Starting from an aligned conformation from the original data set, multiples of the first PC can be added to produce interpolated conformations. Note that the first mode of motion corresponds mostly to the "opening" and "closing" of the flaps, as can be inferred from the relative magnitued of the first PC components in the flap region. Thus, interpolating in the direction of the first PC produces an approximation of this motion, but using only one degree of freedom. This way, the complex dynamics of the system and the 1,782 apparent degrees of freedom have been approximated by just one, effectively reducing the dimensionality of the representation.
![]() |
Figure 7 (solid line) shows the residual variance left unexplained by discarding the lower-ranked principal components. Residual variance plots always decrease, and in this case, the first PC accounts for approximately 40% of the total variance along the motion (the dashed line shows the percentage of total variance explained up to the given PC). Also, the first 10 PCs account for more than 70% of the total data variance. Given the dominance of only a few degrees of freedom it is possible to represent the flexibility of the protein in a drastically reduced search space.
PCA falls in the category of what is called a linear method, since mathematically, the PCs are computed as a series of linear operations on the input coordinates. Linear methods such as PCA work well only when the collective atom motions are small (or linear), which is hardly the case for most interesting biological processes. Non-linear dimensionality reduction methods do exist, but are normally much more computationally expensive and have other disadvantages as well. However, non-linear methods are much more effective in describing complex processes using much fewer parameters.
As a very simple and abstract example of when non-linear dimensionality reduction would be preferred over a linear method such as PCA, consider the data set depicted in figure 8 a). The data is apparently two-dimensional, and naively considering the data variance in this way leads to two "important" principal components, as figure 8 b) shows. However, the data has been sampled from a one-dimensional, but non-linear, process. Figure 8 c) shows the correct interpretation of this data set as non-linear but one-dimensional.
![]() |
Most interesting molecular processes share this characteristic of being low-dimensional but highly non-linear in nature. For example, in the protein folding example depicted in figure 1, the atom positions follow very complicated, curved paths to achieve the folded shape. However, the process can still be thought of as mainly one-dimensional. Linear methods such as PCA would fail to correctly identify collective modes of motion that do not destroy the protein when followed, much like in the simple example of figure 8.
Several non-linear dimensionality reduction techniques exist, that can be classified as either parametric or non-parametric.
Isomap was introduced by Tenenbaum et al. in 2000. It is based on an improvement over a previous technique known as MultiDimensional Scaling (MDS). Isomap aims at capturing the non-linearity of a data set from the set itself, by computing relationships between neighboring points. Both MDS and the non-linear improvement will be presented first, and then the Isomap algorithm will be detailed.
MDS. Multidimensional Scaling (see Cox et al.) is a technique that produces a low-dimenisional representation for an input set of n points, where the dimensions are ordered by variance, so it is similar to PCA in this respect. However, MDS does not require coordinates as input, but rather, distances between points. More precisely, MDS requires as input a data set of points, and a distance measure
![]() |
If the distance measure between points has the metric properties as explained above, then n Euclidean coordinates can always be found for a set of n points, such that the Euclidean distance between them matches the original distances. In order to obtain these Euclidean coordinates, such coordinates are assumed to have a mean of zero, i.e., they are centered. In this way, the matrix of pairwise distances

Let's assume that n Euclidean coordinates exist for each data point and that such coordinates can be placed in a matrix

Where the left and right singular vectors coincide because

Since
Geodesic distances. The notion of "geodesic" distance was originally defined as the length of a geodesic path, where the geodesic path between two points on the surface of the Earth is considered the "shortest path" since one is constrained to travel on the Earth's surface. The concept of geodesic distance can be generalized to any mathematical surface, and defined as "the length of the shortest path between two points that lie on a surface, when the path is constrained to lie on the surface." Figure 10 shows a schematic of this generalized concept of geodesic path and distance.
![]() |
The Isomap algorithm. The Isomap method augments classical MDS with the notion of geodesic distance, in order to capture the non-linearity of a data set. To achieve its goal, Isomap uses MDS to compute few Euclidean coordinates that best preserve pairwise geodesic distances, rather than direct distances. Since the coordinates computed by MDS are Euclidean, these can be plotted on a Cartesian set of axes. The effect of Isomap is similar to "unrolling" the non-linear surface into a natural parameterization, as depicted in figure 11. Isomap approximates the geodesic distances from the data itself, by first building a neighborhood graph for the data. A neighborhood graph consists of the original set of points, together with a connection between "neighboring" points (neighboring points are defined as the closest points to a given point, as determined by the distance measure used) as shown in figure 11-b. After the neighborhood graph has been built, it can be used to approximate the geodesic distance between all pairs of points as the shortest path distance along the graph (red path in figure 11-b). Naturally, the sampling of the data set has to be enough to capture the inherent topology of the non-linear space for this approximation to work. MDS then takes these geodesic distances and produces the Euclidean coordinates for the set. Figure 11-c shows two Euclidean coordinates for each point in the example.
![]() |
The Isomap method can be put in algorithmic form in the following way:
The Isomap algorithm captures the non-linearity of the data set automatically, from the data itself. It returns a low-dimensional projection for each point; these projections can be used to understand the underlying data distribution better. However, Isomap does not return "modes of motion" like PCA does, along which other points can be interpolated. Also, Isomap is much more expensive than PCA, since building a neighborhood graph and computing all-pairs-shortest-paths can have quadratic complexity on the number of input points. Plus, ther is the hidden cost of the distance measure itself, which can also be quite expensive. Regardless of its computational cost, Isomap has proven extremely useful in describing non-linear processes with few parameters. In particular, the work in Das et al. applied Isomap successfully to the study of a protein folding reaction.
In order to apply Isomap to a set of molecular conformations (gathered, for example, through molecular dynamics simulations) all that is needed is a distance measure between two conformations. The most popular distance measure for conformations is lRMSD, discussed in the module Molecular Distance Measures. There is no need to pre-align all the conformations in this case, since the lRMSD distance already includes pairwise alignment. Thus, the Isomap algorithm as described above can be directly applied to molecular conformations. Choosing an appropriate value for a neighborhood parameter (such as
Figure 12 shows the results of applying Isomap to the same molecular trajectory used for the CV-N example in the PCA section. CV-N is a protein that is known to have an "intermediate" state in the folding mechanism, and it is also known to fold through a very ordered process following a well-defined "folding route".
![]() |
The free-energy plot in figure 12 (right) clearly shows the superior quality of the Isomap coordinates when compared with PCA. Isomap clearly identifies the CV-N intermediate (seen as a free-energy minimum in blue) in between the unfolded and folded states. Also, the highest probability route connecting the intermediate to the folded state is clearly seen in this rendering. The PCA coordinates did not identify an itermediate or a folding route, since the PCA coordinates are simply a linear projection of the original coordinates. Isomap, on the contrary, finds the underlying connectivity of the data and always returns coordinates that clearly show the progression of the reaction along its different stages.
The Isomap procedure can be applied to different molecular models. As another, smaller example, consider the Alanine Dipeptide molecule shown in figure 13 (left). This is an all-atom molecule that has been studied thoroughly and is known to have two predominant shapes: an "extended" shape (also called C5 and a variant called PII) and a "helical" shape (also called alpha, with three variants: right-handed, P-like, and left-handed). The application of the Isomap algorithm to this molecule, without any bias by a priori known information, yields the low-dimensional coordinates on the left of figure 13 (shown again as a free-energy plot). The first two coordinates (top right) are already enough to identify all five major states of the peptide, and the possible transitions between them. Applying PCA to such a simple molecule yields comparable results (not shown), but PCA cannot differentiate the left-handed helix shape from the right-handed one. Only the geodesic formulation of Isomap can.
![]() |
Although the Isomap coordinates describe non-linear molecular processes extremely well, their computation is very expensive, as mentioned earlier. The most expensive step is the computation of the neighborhood graph. Some approximations are possible to speed up its computation, at the expense of some accuracy in the identification of the "true" nearest neighbors for each point. The work in Plaku et al. proposed the use of an approximation technique that allows the computation of nearest-neighbors for high-dimensional data sets orders of magnitude faster, by quickly reducing the dimensionality of all points, and then querying for neighbors. The tradeoff between precision and speed is controllable through several algorithm parameters.