In a previous module, the topic of comparing and quantifying the distance between different conformations of a given molecule was explored. Structure-based comparison is also of interest for distinct proteins, which lack the atom-by-atom correspondence necessary for RMSD calculations. In this case, an alignment is performed either based on amino acid sequence or on three-dimensional structure, and the subset of atoms successfully aligned are used as the basis for calculating conformational distance. Computing distances among entire proteins by doing a global alignment of their structures is useful for protein classification.
Protein classification is motivated by the notion of "descriptive biology". When faced with tremendous amounts of highly complex data, such as with the set of all proteins, one way to understand the data is by classification: the act of associating or grouping proteins into classes using certain criteria. One such criterion is protein sequence identity, where sequential similarity led to the development of phylogenetic trees and multiple sequence analyses. The same is done in protein structure classification, where the effort is to identify groups of similar proteins, with the hope that this will yield information about their biochemical function and biological purpose.
Proteins are classified by simultaneously applying a number of criteria, including sequence homology (evolutionary relatedness), function, folding motifs, structural features, and so on. The resulting hierarchies and clusters of protein structures provide a notion of the distance between two proteins and their structures. A couple of popular classification schemes are linked below. Note that a fair amount of manual annotation and classification was necessary to build these systems.
The core computational problem of protein classification, using sequence or structure, is the problem of comparing two proteins. For structural classification, one method for comparison is structural alignment, which identifies an ideal superimposition between two protein structures, in order to compare them.
SSAP, Dali, Foldminer, Lock, and Geometric Hashing [1] are algorithms which have been designed in part to align whole protein structures. Despite differences in algorithmic approach, all of these algorithms essentially evolved from the need to assign the best possible correlation between points in one structure and points in another. The problem of finding the optimal alignment is polynomial in the number of atoms in biological data, where we are assured that atoms cannot fall within a certain distance to each other (Van der Waals forces enforce this), but without this constraint the problem is exponential.
Protein alignment has been used for the classification and comparison of proteins in many existing algorithms. These include:
Once alignment algorithms have been implemented, it is possible to explore different classifications of proteins. Naturally, it would be intuitive to classify proteins solely according to their structure, but much richer data is available as well. Current classifications of proteins integrate sequence and structure information in order to maximize their utility. These include the following:
The biological purpose of designing Global Stuructural Alignment algorithms was the identification, classification, and ultimately prediction, of protein function, under the hypothesis that protein structure dictates protein function. Simultaneously, it was realized that small changes to proteins could lead to massive changes in function, or nothing at all. This suggested that global alignment and global structure comparison (as well as global sequence alignment and global sequence comparison) should not be the only tools used for function prediction.
In particular, it was realized that active sites, clusterings of amino acids on the surface of proteins and a tiny minority of the entire protein, were often strongly related to protein function. In a continuation of the effort to predict protein function through structural comparison, algorithms were developed to compare functionally relevant substructures of proteins. We refer to these algorithms collectively as local structure alignment algorithms.
Algorithms for local structure alignment address the similar computational problem of selecting a correspondence between a motif, a tiny substructure of a protein, often between 3 and 20 amino acids, and a target, a full protein structure. Once a correspondence has been established, the "distance" of the motif to the identified part of the target is measured using lRMSD.
![]() |
Some algorithms for local structure alignment are based on pattern matching algorithms. Pattern matching algorithms seek target substructures called matches which have maximal geometric similarity to the motif. An excellent example of this type of algorithm is Geometric Hashing [1] , a very flexible framework for geometric pattern matching under noisy constraints. Geometric Hashing [1] , has been adapted to alignment by atom position [2] , by backbone C-alpha [3] , multiple structural alignment [4] , and alignment of hinge-bending and flexible protein models [3] . Other algorithms for substructural alignment include JESS [5] , PINTS [6] , webFEATURE [7] , and pvSOAR [8] . The description below concentrates on the work developed in [15] .
A motif S is a set of m points
Other motifs have been designed with other approaches. Motifs have been composed of points on the Connolly surface [11] representing electrostatic potentials [12] , of hinge-bending sets of points in space [13] , of sets of "pseudo-centers" representing protein-ligand interactions [14] , or of points taken from atom coordinates with evolutionary data [15] , [9] , to name a few. Depending on how motif points are defined, they have different labels associated with them and these labels need to be taken into account when comparing motifs.
Local structure comparison algorithms mainly target the biological problem of Protein Function Prediction. Ideally, a function prediction pipeline should address the following subproblems:
Suppose for a moment that we have hand-designed motifs. In the next section, we concentrate on an efficient algorithm for local structure comparison.
Many such algorithms exist, but differ fundamentally in that they are optimized for comparing different types of motifs. There are algorithms for comparing graph-based motifs [16] , algorithms for finding catalytic sites [5] , and the seminal Geometric Hashing framework [1] which can search for many types of motifs, including motifs based on atom position [2] , points on Connolly face centers [17] , catalytic triads [18] , and flexible protein models [13] . Match Augmentation (MA) [15] is described below.
Rank information prioritizes motif data and MA was designed in a prioritized fashion, where correspondences with higher ranked points are identified first. MA is composed of two parts: Seed Matching and Augmentation. The purpose of Seed Matching is to identify a match for the seed S' = {
Seed Matching identifies all sets of 3 target points T' = {
Augmentation is an application of depth first search that begins with the list of seed matches. Assuming that there are more than four motif points, we must find correspondences for the unmatched motif points within the target. Interpret the list of seed matches as a stack of partially complete matches. Pop off the first match, and considering the lRMSD alignment of this match, plot the position p of the next unmatched motif point
Structural similarity is important to functional annotation only if a strong correlation exists between identifiably significant structural similarity and functional similarity. However, the existence of a match alone does not guarantee functional similarity. lRMSD can be a differentiating factor. If matches of homologous proteins represent statistically significant structural similarity over what is expected by random chance, we could differentiate on lRMSD, as long as we can evaluate the statistical significance of the lRMSD of a match.
BLAST [19] first calculated the statistical significance of sequence matches with a combinatorial model of the space of similar sequences. Determining the statistical significance of structural matches has also been attempted. Modeling was applied for the PINTS database [6] to estimate the probability of a structural match given a particular LRMSD. An artificial distribution was parameterized by motif size and amino acid composition in order to fit a given data set, and the p-value is calculated relative to that distribution. Another approach was taken in the algorithm JESS [5] , using comparative analysis to generate a significance score relative to a specific population of known motifs. Both methods have some disadvantages. The artificial models of PINTS are not parameterized by the geometry of motifs, and, all else equal, produce identical distributions for motifs of different geometry. JESS, on the other hand, is dependent on a set of known motifs; should this set change, all significance scores would have to be revised.
Local structural alignment methods operate on the assumption that local structural and chemical similarity implies functional similarity. A statistical model has been developed that can be used to identify the degree of similarity sufficient to follow this implication. Given a match m with lRMSD r between motif S and target T, exactly one of two hypotheses must hold:
![]() |
Given a motif representing a specific biochemical function, the task of any algorithm seeking to predict protein functions is to identify other proteins with similar biological functions. This is immediately affected by how well the motif represents the biological function - protein structures are flexible, dynamic objects, and any computational representation of these objects is likely to be inaccurate in some manner. In particular, since it is the computational representation that is compared, the fidelity of representation is essential for the effectiveness of the structural comparison approach.
An effective motif must simultaneously fulfill two criteria:
Motifs must maintain geometric and chemical similarity, in respect to the characteristics compared by the comparison algorithm, to functional homologs. This ensures that algorithms searching for the motif identify proteins with similar function.
Sensitivity is measured as the proportion of acceptable matches to functional homologs relative to the total number of functional homologs. Specific motifs must maintain geometric and chemical dissimilarity, in respect to the characteristics compared by the comparison algorithm, to functionally unrelated proteins. This ensures that algorithms searching for the motif accidentally identify less matches to functionally unrelated proteins. Specificity is measured as the proportion of rejected matches to functionally unrelated proteins, relative to all functionally unrelated proteins. The ideally effective motif is 100% sensitive and 100% specific.
Under most conditions, effective motifs have been designed by hand. However, a few recently studies exist which examine the relationship between elements of the motif and the sensitivity and specificity of the motif. These studies resulted in the design of MultiBind [14] , an algorithm which identifies conserved binding patterns among functionally homologous active sites, in order to generate motifs which represent only conserved binding patterns. Another study of motif design produced Geometric Sieving [20] , an algorithm for maximizing the Geometric Uniqueness of motifs from functionally unrelated proteins.
Putting together effective motif design, an efficient matching algorithm, and a statistical scoring of the results can lead to an automated functional annotation pipeline for proteins.