The rational development of pharmaceutical drugs is one major purpose of structural computational biology. In this chapter we investigate the problem of pharmacophore identification, and the unknown receptor problem.
A pharmacophore is a specific, three dimensional map of biological properties common to all active conformations of a set of ligands which exhibit a particular activity. Conceptually, a pharmacophore is a distillation of the functional attributes of ligands which accomplish a specific task.
Pharmacophores are conceptual templates for drug design. Once it is extracted from a set of ligands, a pharmacophore can be used as a model for the design of other molecules that can accomplish the same activity. For example, consider the inhibitors of thermolysin in the following figure. These small molecules have between 5 and 11 torsional degress of freedom, generating a few hundred distinct, low energy conformations. Of these configurations, only a handful are functionally active in the inhibition of thermolysin. The pharmacophore, a 3 to 5 point abstraction, must be congruent to the functional components of at least one active conformation for each inhibitor, in order to represent the entire set.
| Inhibitors of Thermolysin |
|---|
The problem of pharmacophore identification is to generate the pharmacophore from structural data describing ligands and their interaction with the receptor. As many protein structures are described as sets of points, pharmacophore identification is commonly reduced directly to the problem of finding points common to all functional ligand conformations. This is a geometric problem called the Largest Common Pointset problem. However, since precisely congruent points are never actually part of the data, it is more accurate to classify this as the Largest Approximate Commpon Pointset problem. We begin by considering two manifestations of this problem, where only a few ligand conformations are considered, and very exacting methods are applied to determining the pharmacophore. We then describe pharmacophore identification when considering numerous ligand conformations.
Finding the Largest Common Pointset is well known to be NP-complete, and for ligand structures of size n, cannot approximated within n^e unless P = NP. Despite this, many approaches to this problem exist. One prominent approach is known as Geometric Hashing, which we will describe in the next section.
Geometric Hashing is a pattern matching technique originally developed by Y Lamdan for image recognition in computer vision research. This flexible technique has been widely studied, and has been shown to be quite successful in biological problems such as active site recognition and identification, functional annotation, and pharmacophore identification. Embarassingly parallel online processing and efficient data structures built in a preprocessing phase make this technique very successful in many environments. In this section we quickly describe Geometric Hashing and its application to pharmacophore identification.
Geometric Hashing was designed so that during a preprocessing phase, the system will learn a Motif (in this case, a active ligand configuration described by a series of points in space). Then, during an online processing phase, the system is exposed to a new pattern of points, the Target, from which it is to identify a subset of reasonable geometric similarity to the motif.
Geometric Hashing begins with a source motif we wish to match (Figure \ref{MotifAssemblyFig}a-b). We processes this by decomposing it into subpatterns (Figure \ref{MotifAssemblyFig}c-d). Each subpattern is a set of three points in space, but in order to get all possible subpatterns, we decompose the source motif into all subsets of three points (abbreviated in Figure FIXME). We refer to these as 3-plets. From each triple we record invariant measurements that remain the same despite rotation and translation. This is key to invariant matching techniques like Geometric Hashing.
| Preprocessing |
|---|
![]() |
The measurement that Geometric Hashing uses is the spatial relationship between 3-plets of points in 3D. Since three points in space define a triangle in a plane, we can take several simple measurements of this triangle, and use these measurements to compare with any other 3-plet, regardless of the orientation of the other 3-plet. The following description refers to Figure \ref{3PletMeasurement}.
For Geometric Hashing, these measurements are described as the length of one side ($\overline{ac}$), as well as the position of the point not on that side (point b) with respect to a 2D coordinate system centered at the centroid of all three points (the 2D coordinate system defined by axis vectors B and C). This can be calculated by simply defining the coordinate system with one axis parallel to the side we measured, and one perpendicular, always facing away from $\overline{ac}$. This produces a triple, $\{i,j,x\}$ that uniquely describes the triangle.
| How 3-plets are measured |
|---|
![]() |
Three points in general position also define a plane, and a centroid. At that centroid we can place the origin of a three dimensional orthogonal coordinate system, where one axis (A) is perpendicular to the plane, one axis is parallel with the side we measured above (B), and the last is perpendicular with these two (C). This coordinate system can be used as a local coordinate system for simple alignments. Two triangles with similar measurements can be aligned by simply considering the position of each point relative to this coordinate system - effectively aligning the two triangles by aligning the coordinate systems.
Geometric Hashing uses these 3-plet measurements as hash keys, for constant time lookup. Each 3-plet, then, becomes a tool for three dimensional alignment in the way we just described. We store each 3-plet in a hash bin associated with its key. Because 3-plets may vary continuously, they are stored in such a way that 3-plets stored in adjacent or close bins are very similar in shape (see Figure \ref{MotifHashBinFig}). This is a strong design advantage that is used later.
| Geometric Hashing Hash Bins |
|---|
![]() |
After the triangles of the source motif have all been generated and stored, preprocessing is complete. The hash table can be stored for later recognition of this motif in other structures later, and never needs to be recalculated.
Now that the source motif is processed and stored in the hash table, we need to compare it with the target pattern. This is the primary purpose of the Online Processing phase.
Much like the decomposition that occurred with the source motif, we repeat the same decomposition process for the target pattern. However, each time a new 3-plet is generated, rather than storing it, we calculate the hash key for this 3-plet and then query the hash table. Using the property we discussed earlier, querying the hash table for the key results in finding several similar 3-plets which were part of the source. As we mentioned earlier, it is simple to align these triangles and find an acceptable alignment. Symmetrical cases are detected and processed according to the number of symmetries. The transformation necessary for the structural alignment is then recorded into a tally, which records each transformation generated.
Many spurious matches may have three matching points; a spurious 3-plet alignment. But an alignment of more than three points clearly involves more than one pair of matching 3-plets. Since the source motif is rigid, then the transformations aligning these 3-plets must be identical. Therefore, the largest alignment has the most pairs of similar 3-plets, and thus the most identical transformations associated with it. In practice, not all 3-plets are perfectly aligned, but this still means that the largest clusters of similar transformations mark the region where the transformation generating the largest alignment can be found.
Finding the transformation that corresponds to this alignment is therefore a clustering problem: the densest clusters represent areas in the space of rigid transformations where structural similarity (in terms of numbers of well aligned 3-plets) is highest. A standard implementation of Geometric Hashing calls for application of a clustering algorithm to discriminate clusters of similar transformations, and then generate a representative optimum alignment from each cluster. In this stage we take the set of representative alignments and align the entire source and target. These ``best'' alignments are then submitted as output.
Another approach towards finding the Largest Common Pointset is an algorithm called DISCO. DISCO operates on two ligand structures by generating a graph along the following definition:
This means that for a DISCO graph G, finding a clique, a set of nodes n1, n2, ... nk where for any i,j less than k, the edge (ni,nj) is in G, implies finding a set of reasonably congruent points common to both ligand structures. However, finding the largest clique is a well known NP complete problem referred to as MAX-CLIQUE. This is a significant problem, but because ligand structures often have very few points, computation of the largest clique in G is often tractable.
To this end, a standard clique detection algorithm due to Bron and Kerbosch can be applied to detect cliques in G. In addition, if multiple ligands are available for pharmacophore identification, then one can be chosen as a reference, and the rest compared to it, to find a consensus pharmacophore.
When considering numerous ligand structures, it becomes apparent that these techniques are too time consuming. Geometric Hashing runs in worst case O(n^3), and on averge in O(n^2) in the number of points in the ligand structure, and Bron-Kerbosch is an exponential algorithm for solving MAX-CLIQUE. If this is the comparison cost for just a pair of structures, the overall cost for considering tens of structures can be expensive.
When considering numerous ligand structures, it becomes apparent that these techniques are too time consuming. Geometric Hashing runs in worst case O(n^3), and on averge in O(n^2) in the number of points in the ligand structure, and Bron-Kerbosch is an exponential algorithm for solving MAX-CLIQUE. If this is the comparison cost for just a pair of structures, the overall cost for considering tens of structures can be expensive.
ADD STUFF HERE