Summary: This module introduces students to a family of algorithms for assessing molecular shape, volume, surface area, and negative space (i.e., pockets and cavities).
Many problems in structural biology, require a researcher to understand the shape of a protein. At first glance, this may seem obvious. By opening a molecular visualizer, one can easily see the shape of a protein. But what about calculating the surface area or volume of the protein? What about performing analyses of the surface, such as looking for concave pockets in a protein that might be binding sites for other molecules? What about calculating the volume and shape of those empty binding pockets, in order to find molecules that might fit in them? What about determining whether a particular small molecule can fit in a binding pocket?
All of these problems require some formal notion of the shape of a protein. A protein structure file usually provides no more information than a list of atom locations in space and their types. It will be assumed that for any given application, a radius may be defined for each atom type. This leads to the space filling representation of a protein, in which each atom is treated as an impenetrable sphere.
| HIV-1 Protease |
|---|
![]() |
Using the sphere model for atoms, one way to define the shape of a molecule is as
the union of (possibly overlapping) balls in
| Space Filling Diagram |
|---|
![]() |
| Representations of Molecular Shape | ||||||
|---|---|---|---|---|---|---|
|
| Solvent Accessible Surface Area | ||||||
|---|---|---|---|---|---|---|
|
Part of the problem with defining the shape of a protein is that we start with nothing but a point set, and the "shape" of a set of discontinuous points is poorly defined. The problem is, what do we mean by shape? As you saw above, the shape of a molecule depends on what is being used to measure it. To handle this ambiguity, we will introduce a method of shape calculation based on a parameter, α, which will determine the radius of a spherical probe that will define the surface. The method defines a class of shapes, called α-shapes [4] for any given point set. It allows fast, accurate, and efficient calculations of volume and surface area.
α-shapes are a generalization of the convex hull. Consider a point set S. Define an α-ball as a sphere of radius α. An α-ball is empty if it contains no points in S. For any α between zero and infinity, the α-hull of S is the complement of the union of all empty α-balls.
| Two-Dimensional α-Shapes |
|---|
![]() |
A triangulation of a three-dimensional point set S is any decomposition of S into non-intersecting tetrahedra (triangles for two-dimensional point sets). The Delaunay triangulation of S is the unique triangulation of S satisfying the additional requirement that no sphere circumscribing a tetrahedron in the triangulation contains any point in S. Although it is incidental to α-shapes, it is worth noting that the Delaunay triangulation maximizes the average of the smallest angle over all triangles. In other words, it favors relatively even-sided triangles over sharp and stretched ones.
| Two-Dimensional Delaunay Triangulation |
|---|
![]() |
O(n^2) time, but expected O(n^(3/2)) time. Without the sort in the first step, the expected case would be O(n log n). A full description and analysis of Delaunay triangulation algorithms is given in [1], chapter 9.From the Delaunay triangulation the α-shape is computed by removing all edges, triangles, and tetrahedra that have circumscribing spheres with radius greater than α. Formally, the α-complex is the part of the Delaunay triangulation that remains after removing edges longer than α. The α-shape is the boundary of the α-complex.
Pockets [3] can be detected by comparing the α-shape to the whole Delauney triangulation. Missing tetrahedra represent indentations, concavity, and generally negative space in the overall volume occupied by the protein. Particularly large or deep pockets may indicate a substrate binding site.
Regular α-shapes can be extended to deal with varying weights (i.e.,
spheres with different radii, such as different types of atoms)
[2]
. The formal definitions become complicated,
but the key idea is to use a pseudo distance measure that uses the weights.
Suppose we have two atoms at positions p1 and p2 with weights w1 and w2.
Then the pseudo distance is defined as the square of the Euclidean distance minus the weights. The pseudo distance
is zero if and only if two spheres centered at p1 and p2 with radii equal
to sqrt(w1) and
sqrt(w2) are just touching.
![]() |
The volume of a molecule can be approximated using the space-filling model, in which each atom is modeled as a ball whose radius is α, where α is selected depending on the model being used: Van der Waals surface, molecular surface, solvent accessible surface, etc. Unfortunately, calculating the volume is not as simple as taking the sum of the ball volumes because they may overlap. Calculating the volume of a complex of overlapping balls is non-trivial because of the overlaps. If two spheres overlap, the volume is the sum of the volumes of the spheres minus the volume of the overlap, which was counted twice. If three overlap, the volume is the sum of the ball volumes, minus the volume of each pairwise overlap, plus the volume of the three-way overlap, which was subtracted one too many times in accounting for the pairwise overlaps. In the general case, all pairwise, three-way, four-way and so on to n-way intersections (assuming there are n atoms) must be considered. Proteins generally have thousands or tens of thousands of atoms, so the general n-way case may be computationally expensive and may introduce numerical error.
![]() |
α-shapes provide a way around this undesirable combinatorial complexity [2] , and this issue has been one of the motivating factors for introducing α-shapes. To calculate the volume of a protein, we take the sum of all ball volumes, then subtract only those pairwise intersections for which a corresponding edge exists in the α-complex. Only those three-way intersections for which the corresponding triangle is in the α-complex must then be added back. Finally, only four-way intersections corresponding to tetrahedra in the α-complex need to be subtracted. No higher-order intersections are necessary, and the number of volume calculations necessary corresponds directly to the complexity of the α-complex, which is O(n log n) in the number of atoms.
An example of how this approach works is given on page 4 of the Liang et al. article in the Recommended Reading section below. A proof of correctness and derivation is also provided in the article. Surface area calculations, such as solvent-accessible surface area, which is often used to estimate the strength of interactions between a protein and the solvent molecules surrounding it, are made by a similar use of the α-complex.