# Connexions

You are here: Home » Content » Molecular Distance Measures

### Lenses

What is a lens?

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

#### Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
• Rice Digital Scholarship

This module is included in aLens by: Digital Scholarship at Rice UniversityAs a part of collection: "Geometric Methods in Structural Computational Biology"

Click the "Rice Digital Scholarship" link to see all content affiliated with them.

#### Also in these lenses

• eScience, eResearch and Computational Problem Solving

This module is included inLens: eScience, eResearch and Computational Problem Solving
By: Jan E. OdegardAs a part of collection: "Geometric Methods in Structural Computational Biology"

Click the "eScience, eResearch and Computational Problem Solving" link to see all content selected in this lens.

### Recently Viewed

This feature requires Javascript to be enabled.

# Molecular Distance Measures

Module by: Lydia E. Kavraki. E-mail the author

Summary: Given a set of structures of the same molecule, it is often necessary to decide which are more similar or less similar to each other. This module presents a few ways to approach that problem, including root mean squared distance (RMSD), least RMSD, and intramolecular distance measures.

## Comparing Molecular Conformations

Molecules are not rigid. On the contrary, they are highly flexible objects, capable of changing shape dramatically through the rotation of dihedral angles. We need a measure to express how much a molecule changes going from one conformation to another, or alternatively, how different two conformations are from each other. Each distinct shape of a given molecule is called a conformation. Although one could conceivably compute the volume of the intersection of the alpha shapes for two conformations (see Molecular Shapes and Surfaces for an explanation of alpha shapes) to measure the shape change, this is prohibitively computationally expensive. Simpler measures of distance between conformations have been defined, based on variables such as the Cartesian coordinates for each atom, or the bond and torsion angles within the molecule. When working with Cartesian coordinates, one can represent a molecular conformation as a vector whose components are the Cartesian coordinates of the molecule's atoms. Therefore, a conformation for a molecule with N atoms can be represented as a 3N-dimensional vector of real numbers.

## RMSD and lRMSD

One of the most widely accepted difference measures for conformations of a molecule is least root mean square deviation (lRMSD). To calculate the RMSD of a pair of structures (say x and y), each structure must be represented as a 3N-length (assuming N atoms) vector of coordinates. The RMSD is the square root of the average of the squared distances between corresponding atoms of x and y. It is a measure of the average atomic displacement between the two conformations:

However, when molecular conformations are sampled from molecular dynamics or other forms of sampling, it is often the case that the molecule drifts away from the origin and rotates in an arbitrary way. The lRMSD distance aims at compensating for these facts by representing the minimum RMSD over all possible relative positions and orientations of the two conformations under consideration. Calculating the lRMSD consists of first finding an optimal alignment of the two structures, and then calculating their RMSD. Note that aligning two conformations may require both a translation and rotation. In other words, before computing the RMSD distance, it is necessary to remove the translation of the centroid of both conformations and to perform an "optimal alignment" or "optimal rotation" of them, since these two factors artificially increase the RMSD distance between them.

Finding the optimal rotation to minimize the RMSD between two point sets is a well-studied problem, and several algorithms exist. The Kabsch Algorithm [1][2], which is implemented in several molecular modeling packages, solves a matrix equation for the three dimensional rotation matrix corresponding to the optimal rotation. An alternative approach, discussed in detail after the matrix method, uses a compact representation of rotational transformations called quaternions [3][4]. Quaternions are currently the preferred representation for global rotation in calculating lRMSD, since they require less numbers to be stored and are easy to re-normalize. In contrast, re-normalization of orthonormal matrices is quite expensive and potentially numerically unstable. Both quaternions and their application to global alignment of conformations will be presented after the next section.

## Optimal Alignment for lRMSD Using Rotation Matrices

This section presents a method for computing the optimal rotation between 2 datasets as an orthonormal rotation matrix. As stated earlier, this approach is slightly more numerically unstable (since guaranteeing the orthonormality of a matrix is harder than the unit length of a quaternion) and requires taking care of the special case when the resulting matrix may not be a proper rotation, as discussed below.

As stated earlier, the optimal alignment requires both a translation and a rotation. The translational part of the alignment is easy to calculate. It can be proven that the optimal alignment is obtained by translating one set so that its centroid coincides with the other set's centroid (see section 2-C of [3] [link] for proof). The centroid of a point set a is simply the average position of all its points:

We can then redefine each point in two sets A and B as a deviation from the centroid: Given this notation relative to the centroid, we can explicitly set the centroids to be equal and proceed with the rotational part of the alignment.

One of the first references to the solution of this problem in matrix form is from Kabsch [1][2]. The Kabsch method uses Lagrange multipliers to solve a minimization problem to find the optimal rotation. Here, we present a slightly more intuitive method based on matrix algebra and properties, that achieves the same result. This formulation can be found in [4] and [5]. Imagine we wish to align two conformations composed of N atoms each, whose Cartesian coordinates are given by the vectors xx and yy. The main idea behind this approach is to find a 3x3 orthonormal matrix UU such that the application of UU to the atom positions of one of the data vectors, xx, aligns it as best as possible with the other data vector, yy, in the sense that the quantity to minimize is the distance d(Ux,y)d(Ux,y), where xx and yy are assumed to be centered, that is, both their centroids coincide at the origin (centering both conformations is the first step). Mathematically, this problem can be stated as the minimization of the following quantity:

When E is a minimum, the square root of its value becomes the least RMSD (lRMSD) between xx and yy. Being an orthonormal rotation matrix, UU needs to satisfy the orthonormality property U U T =I U U T I , where II is the identity matrix. The orthonormality contraint ensures that the rows and columns are mutually orthogonal, and that their length (as vectors) is one. Any orthonormal matrix represents a rigid orientation (transformation) in space. The only problem with this approach as is, is that all orthonormal matrices encode a rigid transformation, but if the rows/columns of the matrix do not constitute a right handed system, then the rotation is said to be improper. In an improper rotation, one of the three directions may be "mirrored". Fortunately, this case can be detected easily by computing the determinant of the matrix UU, and if it is negative, correcting the matrix. Denoting UxUx as x'x', and moving the constant factor N to the left, the formula for the error becomes:

An alternative way to represent the two point sets, rather than a one-dimensional vector or as separate atom coordinates, is using two 3xN matrices (N atoms, 3 coordinates for each). Using this scheme, xx is represented by the matrix XX and yy is represented by the matrix YY. Note that column 1iN1iN in these matrices stands for point (atom) xixi and yiyi, respectively. Using this new representation, we can write:

where X'=UXX'UX and Tr(A)Tr(A) stands for the trace of matrix A, the sum of its diagonal elements. It is easy to see that that the trace of the matrix to the right amounts precisely to the sum on the left (simply carrying out the multiplication of the first row/column should convince the reader). The right-hand side of the equation can be expanded into:

Which follows from the properties of the trace operator, namely: Tr(A+B)=Tr(A)+Tr(B), Tr(AB)=Tr(BA)Tr(A+B)=Tr(A)+Tr(B), Tr(AB)=Tr(BA), Tr(Tr(ATAT)=Tr(A))=Tr(A), and Tr(kA)=kTr(A)Tr(kA)=kTr(A). Furthermore, the first two terms in the expansion above represent the sum of the squares of the components xixi and yiyi, so it can be rewritten as:

Note that the xx components do not need to be primed (i.e., x'x') since the rotation UU around the origin does not change the length of xixi. Note that the summation above does not depend on UU, so minimizing E is equivalent to maximizing Tr(Tr(YTX'YTX')). For this reason, the rest of the discussion focuses on finding a proper rotation matrix UU that maximizes Tr(Tr(YTX'YTX')). Remembering that X'=UXX'UX, the quantity to maximize is then Tr(Tr((YTU)XYTUX)). From the property of the trace operator, this is equivalent to Tr(Tr((XYT)UXYTU)). Since XYTXYT is a square 3x3 matrix, it can be decomposed through the Singular Value Decomposition technique (SVD) into XYT=VSWTXYTVSWT, where VV and WTWT are the matrices of left and right eigenvectors (which are orthonormal matrices), respectively, and SS is a diagonal 3x3 matrix containing the eigenvalues s1s1, s2s2, s3s3 in decreasing order. Again from the properties of the trace operator, we obtain that:

If we introduce the 3x3 matrix TT as the product T=WTUV T WT UV , we can rewrite the above expression as:

Since TT is the product of orthonormal matrices, it is itself an orthonormal matrix and det(T)=+/-1det(T)=+/-1. This means that the absolute value of each element of this matrix is no more than one, from where the last equality follows. It is obvious that the maximum value of the left hand side of the equation is reached when the diagonal elements of TT are equal to 1, and since it is an orthonormal matrix, all other elements must be zero. This results in T=ITI. Moreover, since T=WTUV T WT UV , we can write that WTUV=I WT UV I , and because WW and VV are orthonormal, WWT=I W WT I and VVT=I V VT I . Multiplying WTUV WT UV by WW to the left and VTVT to the right yields a solution for UU:

Where VV and WTWT are the matrices of left and right eigenvectors, respectively, of the covariance matrix C=XYT C X YT . This formula ensures that UU is orthonormal (the reader should carry out the high-level matrix multiplication and verify this fact).

The only remaining detail to take care of is to make sure that UU is a proper rotation, as discussed before. It could indeed happen that det(U)=-1det(U)=-1 if its rows/columns do not make up a right-handed system. When this happens, we need to compromise between two goals: maximizing Tr(Tr(YTX'YTX')) and respecting the constraint that det(U)=+1det(U)=+1. Therefore, we need to settle for the second largest value of Tr(Tr(YTX'YTX')). It is easy to see what the second largest value is; since:

then the second largest value occurs when T11=T22=+1T11T22+1 and T33=-1T33-1. Now, we have that TT cannot be the identity matrix as before, but instead it has the lower-right corner set to -1. Now we finally have a unified way to represent the solution. If det(C)>0det(C)>0, TT is the identity; otherwise, it has a -1 as its last element. Finally, these facts can be expressed in a single formula for the optimal rotation UU by stating:

where d=sign(det(C))dsign(det(C)). In the light of the preceding derivation, all the facts that have been presented as a proof can be succinctly put as an algorithm for computing the optimal rotation to align two data sets xx and yy:

### Optimal rotation

1. Build the 3xN matrices XX and YY containing, for the sets xx and yy respectively, the coordinates for each of the N atoms after centering the atoms by subtracting the centroids.
2. Compute the covariance matrix C=XYTCXYT
3. Compute the SVD (Singular Value Decomposition) of C=VSWTCVSWT
4. Compute d=sign(det(C))dsign(det(C))
5. Compute the optimal rotation UU as

## Optimal Alignment for lRMSD Using Quaternions

Another way of solving the optimal rotation for the purposes of computing the lRMSD between two conformations is to use quaternions. These provide a very compact way of representing rotations (only 4 numbers as compared to 9 or 16 for a rotation matrix) and are extremely easy to normalize after performing operations on them. Next, a general introduction to quaternions is given, and then they will be used to compute the optimal rotation between two point sets.

### Introduction to Quaternions

Quaternions are an extension of complex numbers. Recall that complex numbers are numbers of the form a + bi, where a and b are real numbers and i is the canonical imaginary number, equal to the square root of -1. Quaternions add two more imaginary numbers, j and k. These numbers are related by the set of equalities in the following figure:

These equalities give rise to some unusual properties, especially with respect to multiplication.

Given this definition of i, j, and k, we can now define a quaternion.

Based on the definitions of i, j and k, we can also derive rules for addition and multiplication of quaternions. Assume we have two quaternions, p and q, defined as follows: Addition of p and q is fairly intuitive: The dot product and magnitude of a quaternion also closely resemble those operations for vectors. Note that a unit quaternion is a quaternion with magnitude 1 under this definition: Multiplication, however, is not, due to the definitions of i, j, and k: Quaternion multiplication also has two equivalent matrix forms which will become relevant later in the derivation of the alignment method: These useful properties of quaternion multiplication can be derived easily using the matrix form for multiplication, or they can be proved by carrying out the products:

### Quaternions and Three-Dimensional Rotations

A number of different methods exist for denoting rotations of rigid objects in three-dimensional space. These are introduced in a module on protein kinematics. Unit quaternions represent a rotation of an angle around an arbitrary axis. A rotation by the angle theta about an axis represented by the unit vector v = [x, y, z] is represented by a unit quaternion:

Like rotation matrices, quaternions may be composed with each other via multiplication. The major advantage of the quaternion representation is that it is more robust to numerical instability than orthonormal matrices. Numerical instability results from the fact that, because computers use a finite number of bits to represent real numbers, most real numbers are actually represented by the nearest number the computer is capable of representing. Over a series of floating point operations, the error caused by this inexact representation accumulates, quite rapidly in the case of repeated multiplications and divisions. In manipulating orthonormal transformation matrices, this can result in matrices that are no longer orthonormal, and therefore not valid rigid transformations. Finding the "nearest" orthonormal matrix to an arbitrary matrix is not a well-defined problem. Unit-length quaternions can accumulate the same kind of a numerical error as rotation matrices, but in the case of quaternions, finding the nearest unit-length quaternion to an arbitrary quaternion is well defined. Additionally, because quaternions correspond more directly to the axis-angle representation of three-dimensional rotations, it could be argued that they have a more intuitive interpretation than rotation matrices. Quaternions, with four parameters, are also more memory efficient than 3x3 matrices. For all of these reasons, quaternions are currently the preferred representation for three-dimensional rotations in most modeling applications.

Vectors can be represented as purely imaginary quaternions, that is, quaternions whose scalar component is 0. The quaternion corresponding to the vector v = [x, y, z] is q = xi + yj + zk.

We can perform rotation of a vector in quaternion notation as follows:

### Optimal Alignment with Quaternions

The method presented here is from Berthold K. P. Holm, "Closed-form solution of absolute orientation using unit quaternions." Journal of the Optical Society of America A, 4:629-642.

The alignment problem may be stated as follows:

• We have two sets of points (atoms) A and B for which we wish to find an optimal alignment, defined as the alignment for which the root mean square difference between each point in A and its corresponding point in B is minimized.
• We know which point in A corresponds to which point in B. This is necessary for any RMSD-based method.

As for the case of rotation matrices, the translational part of the alignment consists of making the centroids of the two data sets coincide. To find the optimal rotation using quaternions, recall that the dot product of two vectors is maximized when the vectors are in the same direction. The same is true when the vectors are represented as quaternions. Using this property, we can define a quantity that we want to maximize (proof here):

Equivalently, using the last property from the section "Introduction to quaternions", we get: Now, recall that quaternion multiplication can be represented by matrices, and that the quaterions a and b have a 0 real component: Using these matrices, we can derive a new form for the objective function: where: The quaternion that maximizes this product is the eigenvector of N that corresponds to its most positive eigenvalue (proof here). The eigenvalues can be found by solving the following equation, which is quartic in lambda: This quartic equation can be solved by a number of standard approaches. Finally, given the maximum eigenvalue lambda-max, the quaternion corresponding to the optimal rotation is the eigenvector v: A closed-form solution to this equation for v can be found by applying techniques from linear algebra. One possible algorithm, based on constructing a matrix of cofactors, is presented in appendix A5 of the source paper [3] [link].

In summary, the alignment algorithm works as follows:

• Recalculate atom coordinates as displacements from the centroid of each molecule. The optimal translation superimposes the centroids.
• Construct the matrix N based on matrices A and B for each atom.
• Find the maximum eigenvalue by solving the quartic eigenvalue equation.
• Find the eigenvector corresponding to this eigenvalue. This vector is the quaternion corresponding to the optimal rotation.

This method appears computationally intensive, but has the major advantage over other approaches of being a closed-form, unique solution.

## Intramolecular Distance and Related Measures

RMSD and lRMSD are not ideally suited for all applications. For example, consider the case of a given conformation A, and a set S of other conformations generated by some means. The goal is to estimate which conformations in S are closest in potential energy to A, making the assumption that they will be the conformations most structurally similar to A. The lRMSD measure will find the conformations in which the overall average atomic displacement is least. The problem is that if the quantity of interest is the potential energy of conformations, not all atoms can be treated equally. Those on the outside of the protein can often move a fair amount without dramatically affecting the energy. In contrast, the core of the molecule tends to be more compact, and therefore a slight change in the relative positions of a pair of atoms could lead to overlap of the atoms, and therefore a completely infeasible structure and high potential energy. A class of distance measures and pseudo-measures based on intramolecular distances have been developed to address this shortcoming of RMSD-based measures.

Assume we wish to compare two conformations P and Q of a molecule with N atoms. Let pijpij be the distance between atom i and atom j in conformation P, and let qijqij be the same distance for conformation Q. Then the intramolecular distance is defined as

One of the main computational advantages of this class of approaches is that we do not have to compute the alignment between P and Q. On the other hand, for this metric we need to sum over a quadratic number of terms, whereas for RMSD the number of terms is linear in the number of atoms. Approximations can be made to speed up this computation, as shown in [7]. Also, the intramolecular distance measure given above, which is sometimes referred to as the dRMSD, is subject to the problem that pairs of atoms most distant from each other are the ones that contribute the greatest amount to their measured difference.

An interesting open problem is to come up with physically meaningful molecular distance metric that allows for fast nearest neighbor computations. This can be useful for, for example, clustering conformations. One proposed method is the contact distance. Contact distance requires constructing a contact map matrix for each conformation indicating which pairs of atoms are less than some threshold separation. The distance measure is then a measure of the difference of the contact maps.

Other distance measures attempt to weight each pair in the dRMSD based on how close the atoms are, with closer pairs given more weight, in keeping with the intuition that small changes in the relative positions of nearby atoms are more likely to result in collisions. One such measure is the normalized Holm and Sander Score. This score is technically a pseudo-measure rather than a measure because it does not necessarily obey the triangle inequality.

The definition of distance measures remains an open problem. For reference on ongoing work, see articles that compare several methods, such as [5] [link].

The first two papers are the original descriptions of the Kabsch Algorithm, and use rotations represented as orthonormal matrices to find the correct rotational transformation. Many software packages use this alignment method. The third and fourth papers use quaternions. The alignment method presented in the previous section comes from the third paper:

## References

1. Kabsch, W. (1976). A Solution for the Best Rotation to Relate Two Sets of Vectors. Acta Crystallographica, 32, 922-923.
2. Kabsch, W. (1978). A Discussion of the Solution for the Best Rotation to Relate Two Sets of Vectors. Acta Crystallographica, 34, 827-828.
3. Horn, Berthold K. P. (1986). Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America, 4, 629-642.
4. Coutsias, E. A., C. Seok and K. A. Dill. (1978). Using quaternions to calculate RMSD. Journal of Computational Chemistry, 25, 1849-1857.
5. Golub, G. H. and Loadn, C. F. V. (1996). Matrix Computations. (third). Johns Hopkins University Press.
6. Wallin, S., J. Farwer and U. Bastolla. (2003). Testing similarity measures with continuous and discrete protein models. Proteins, 50, 144-157.
7. Schwarzer, F. and Lotan, I. (2003). Approximation of protein structure for fast similarity measures. ACM. Proceedings of the seventh annual international conference on research in computational molecular biology.

## Content actions

### Give feedback:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks