Skip to content Skip to navigation

Connexions

You are here: Home » Content » Protein Classification, Local Alignment, and Motifs

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

    • External bookmarks
  • E-mail the author

Recently Viewed

This feature requires Javascript to be enabled.

Protein Classification, Local Alignment, and Motifs

Module by: Lydia E. Kavraki

In a previous module, the topic of comparing and quantifying the distance between different conformations of a given molecule was explored. Structure-based comparison is also of interest for distinct proteins, which lack the atom-by-atom correspondence necessary for RMSD calculations. In this case, an alignment is performed either based on amino acid sequence or on three-dimensional structure, and the subset of atoms successfully aligned are used as the basis for calculating conformational distance. Computing distances among entire proteins by doing a global alignment of their structures is useful for protein classification.

Protein Classification

Protein classification is motivated by the notion of "descriptive biology". When faced with tremendous amounts of highly complex data, such as with the set of all proteins, one way to understand the data is by classification: the act of associating or grouping proteins into classes using certain criteria. One such criterion is protein sequence identity, where sequential similarity led to the development of phylogenetic trees and multiple sequence analyses. The same is done in protein structure classification, where the effort is to identify groups of similar proteins, with the hope that this will yield information about their biochemical function and biological purpose.

Proteins are classified by simultaneously applying a number of criteria, including sequence homology (evolutionary relatedness), function, folding motifs, structural features, and so on. The resulting hierarchies and clusters of protein structures provide a notion of the distance between two proteins and their structures. A couple of popular classification schemes are linked below. Note that a fair amount of manual annotation and classification was necessary to build these systems.

Protein Alignment

The core computational problem of protein classification, using sequence or structure, is the problem of comparing two proteins. For structural classification, one method for comparison is structural alignment, which identifies an ideal superimposition between two protein structures, in order to compare them.

SSAP, Dali, Foldminer, Lock, and Geometric Hashing [1] are algorithms which have been designed in part to align whole protein structures. Despite differences in algorithmic approach, all of these algorithms essentially evolved from the need to assign the best possible correlation between points in one structure and points in another. The problem of finding the optimal alignment is polynomial in the number of atoms in biological data, where we are assured that atoms cannot fall within a certain distance to each other (Van der Waals forces enforce this), but without this constraint the problem is exponential.

Protein alignment has been used for the classification and comparison of proteins in many existing algorithms. These include:

  • Dali is a structural comparison algorithm based on pairwise distance matrices. Dali uses patterns of residue contacts, similar to contact maps described above in the intramolecular distances section, in order to align structures. The alignments are found using a randomized (Monte Carlo) search.
  • FoldMiner and LOCK 2. FoldMiner finds protein structures similar to an input structure by performing alignment the query structures secondary structure elements with proteins in its database using the LOCK 2 algorithm. LOCK 2 uses a combination of geometric hashing [1] and dynamic programming to optimize the alignments of secondary structure elements of different proteins. Once a set of alignments to similar structures are found, motifs consisting of similar secondary structure arrangements are constructed and used to refine the similarity search.
  • Sequential Structure Alignment Program (SSAP) Given two protein structures, SSAP returns a structural alignment.

Protein Classification

Once alignment algorithms have been implemented, it is possible to explore different classifications of proteins. Naturally, it would be intuitive to classify proteins solely according to their structure, but much richer data is available as well. Current classifications of proteins integrate sequence and structure information in order to maximize their utility. These include the following:

  • SCOP (Structural Classification of Protein) is a database of all proteins whose structures have been determined, organized by family (evolutionary relationship), superfamily (structural and functional similarity), and fold (similar secondary structure, with similar arrangement and topological connections). SCOP was constructed largely by manual inspection and annotation.
  • CATH Protein Structure Classification is another database of protein structures, organized by Class (secondary structure content), Architecture (orientation of secondary structures), Topology (overall shape and connectivity), and Homologous Superfamily (evolutionary relationship inferred based on sequence and structure similarity). CATH uses SSAP (the Sequential Structural Alignment Program, a secondary structure element-based method) for structural comparison.

The biological purpose of designing Global Stuructural Alignment algorithms was the identification, classification, and ultimately prediction, of protein function, under the hypothesis that protein structure dictates protein function. Simultaneously, it was realized that small changes to proteins could lead to massive changes in function, or nothing at all. This suggested that global alignment and global structure comparison (as well as global sequence alignment and global sequence comparison) should not be the only tools used for function prediction.

In particular, it was realized that active sites, clusterings of amino acids on the surface of proteins and a tiny minority of the entire protein, were often strongly related to protein function. In a continuation of the effort to predict protein function through structural comparison, algorithms were developed to compare functionally relevant substructures of proteins. We refer to these algorithms collectively as local structure alignment algorithms.

Local Matching: Geometric Hashing, Pose Clustering and Match Augmentation

Algorithms for local structure alignment address the similar computational problem of selecting a correspondence between a motif, a tiny substructure of a protein, often between 3 and 20 amino acids, and a target, a full protein structure. Once a correspondence has been established, the "distance" of the motif to the identified part of the target is measured using lRMSD.

Figure 1
Figure 1 (MatchingPic.jpg)

Some algorithms for local structure alignment are based on pattern matching algorithms. Pattern matching algorithms seek target substructures called matches which have maximal geometric similarity to the motif. An excellent example of this type of algorithm is Geometric Hashing [1] , a very flexible framework for geometric pattern matching under noisy constraints. Geometric Hashing [1] , has been adapted to alignment by atom position [2] , by backbone C-alpha [3] , multiple structural alignment [4] , and alignment of hinge-bending and flexible protein models [3] . Other algorithms for substructural alignment include JESS [5] , PINTS [6] , webFEATURE [7] , and pvSOAR [8] . The description below concentrates on the work developed in [15] .

Motifs

A motif S is a set of m points s1s1, ..., smsm in three dimensions, whose coordinates are taken from backbone and side-chain atoms. Each motif point sisi in the motif has an associated rank psipsi , a measure of the functional significance of the motif point. Each sisi also has a set of alternate amino acid labels lsilsi in {GLY, ALA, ...}, which represent residues this amino acid has mutated to during evolution. Labels permit our motifs to simultaneously represent many homologous active sites with slight mutations, not just a single active site. In this paper, we obtain labels and ranks using Evolutionary Trace (ET) [9] , [10] .

Other motifs have been designed with other approaches. Motifs have been composed of points on the Connolly surface [11] representing electrostatic potentials [12] , of hinge-bending sets of points in space [13] , of sets of "pseudo-centers" representing protein-ligand interactions [14] , or of points taken from atom coordinates with evolutionary data [15] , [9] , to name a few. Depending on how motif points are defined, they have different labels associated with them and these labels need to be taken into account when comparing motifs.

Protein Function Prediction

Local structure comparison algorithms mainly target the biological problem of Protein Function Prediction. Ideally, a function prediction pipeline should address the following subproblems:

  • The design of effective motifs
  • The efficient identification of matches
  • The effective filtering of matches

Suppose for a moment that we have hand-designed motifs. In the next section, we concentrate on an efficient algorithm for local structure comparison.

Identification of Matches

Many such algorithms exist, but differ fundamentally in that they are optimized for comparing different types of motifs. There are algorithms for comparing graph-based motifs [16] , algorithms for finding catalytic sites [5] , and the seminal Geometric Hashing framework [1] which can search for many types of motifs, including motifs based on atom position [2] , points on Connolly face centers [17] , catalytic triads [18] , and flexible protein models [13] . Match Augmentation (MA) [15] is described below.

cnx.jpg

Match Augmentation

Rank information prioritizes motif data and MA was designed in a prioritized fashion, where correspondences with higher ranked points are identified first. MA is composed of two parts: Seed Matching and Augmentation. The purpose of Seed Matching is to identify a match for the seed S' = {s1s1,s2s2,s3s3}, the three highest ranked motif points. The k lowest lRMSD seed matches are passed to Augmentation to be iteratively expanded into matches for the remaining motif points, in descending rank order. Augmentation outputs the match with smallest lRMSD.

Seed Matching

Seed Matching identifies all sets of 3 target points T' = {tAtA,tBtB,tCtC} which fulfill our matching criteria with the highest ranked 3 motif points, S'= {s1s1,s2s2,s3s3 }. In this stage, we represent the target as a geometric graph with colored edges. There are exactly three unordered pairs of points in S', and we name them red, blue and green. In the target, if any pair of target points titi,tjtj fulfills our first two criteria with either red, blue or green, we draw a corresponding red blue or green edge between titi,tjtj in the target. Once we have processed all pairs of target points, we find all three-colored triangles in T. These are the Seed Matches, a set of three-point correlations to S' that we sort by lRMSD and pass to Augmentation.

Augmentation

Augmentation is an application of depth first search that begins with the list of seed matches. Assuming that there are more than four motif points, we must find correspondences for the unmatched motif points within the target. Interpret the list of seed matches as a stack of partially complete matches. Pop off the first match, and considering the lRMSD alignment of this match, plot the position p of the next unmatched motif point sisi relative to the aligned orientation of the motif. In the spherical region V around p, identify all target points titi, compatible with sisi, inside V. Now compute the lRMSD alignment of all correlated points, include the new correlation (sisi, titi). If the new alignment satisfies our first two criteria and there are no more unmatched motif points, put this match into a heap which maintains the match with smallest lRMSD. If there are more unmatched motif points, put this partial match back onto the stack. Continue to test correlations in this manner, until V contains no more target points that satisfy our criteria. Then, return to the stack, and begin again by popping off the first match on the stack, repeating this process until the stack is empty.

Filtering Matches

Structural similarity is important to functional annotation only if a strong correlation exists between identifiably significant structural similarity and functional similarity. However, the existence of a match alone does not guarantee functional similarity. lRMSD can be a differentiating factor. If matches of homologous proteins represent statistically significant structural similarity over what is expected by random chance, we could differentiate on lRMSD, as long as we can evaluate the statistical significance of the lRMSD of a match.

BLAST [19] first calculated the statistical significance of sequence matches with a combinatorial model of the space of similar sequences. Determining the statistical significance of structural matches has also been attempted. Modeling was applied for the PINTS database [6] to estimate the probability of a structural match given a particular LRMSD. An artificial distribution was parameterized by motif size and amino acid composition in order to fit a given data set, and the p-value is calculated relative to that distribution. Another approach was taken in the algorithm JESS [5] , using comparative analysis to generate a significance score relative to a specific population of known motifs. Both methods have some disadvantages. The artificial models of PINTS are not parameterized by the geometry of motifs, and, all else equal, produce identical distributions for motifs of different geometry. JESS, on the other hand, is dependent on a set of known motifs; should this set change, all significance scores would have to be revised.

Local structural alignment methods operate on the assumption that local structural and chemical similarity implies functional similarity. A statistical model has been developed that can be used to identify the degree of similarity sufficient to follow this implication. Given a match m with lRMSD r between motif S and target T, exactly one of two hypotheses must hold:

  • H0H0: S and T are structurally dissimilar
  • HAHA: S and T are structurally similar
The proposed statistical model implements this measurement by computing a motif profile. Motif profiles are frequency distributions (see Figure below) of match lRMSDs between S and the entire Protein Data Bank (PDB) [site], which is essentially a large set of functionally unrelated proteins. A motif profile is essentially a histogram, where the vertical axis measures the number of matches at each specific lRMSD, which is measured on the horizontal axis. Motif profiles provide very complete information about matches typical of H0H0. If we suspect that a match m has lRMSD r indicative of functional similarity, we can use the motif profile to determine the probability p of observing another match m' with smaller lRMSD by computing the volume under the curve to the left of r, relative to the entire volume. The probability p, often referred to as the p-value, is the measure of statistical significance. With a standard of statistical significance alpha, if p is less than alpha, then we say that the probability of observing a match m' with lRMSD r' less than r is so low that we reject the null hypothesis in favor of the alternative hypothesis. This process means that if a match m with lRMSD r has a p-value exactly equal to alpha, then this lRMSD is the highest lRMSD for which our statistical model predicts that m identifies structural and chemical similarity sufficient to imply functional similarity. Matches with this property are considered to be statistically significant.

Figure 2: The use of motif profiles.
Figure 2 (PvalueCalc.jpg)

Designing Effective Motifs

Given a motif representing a specific biochemical function, the task of any algorithm seeking to predict protein functions is to identify other proteins with similar biological functions. This is immediately affected by how well the motif represents the biological function - protein structures are flexible, dynamic objects, and any computational representation of these objects is likely to be inaccurate in some manner. In particular, since it is the computational representation that is compared, the fidelity of representation is essential for the effectiveness of the structural comparison approach.

An effective motif must simultaneously fulfill two criteria:

  • The motif must be sensitive
  • The motif must be specific

Motifs must maintain geometric and chemical similarity, in respect to the characteristics compared by the comparison algorithm, to functional homologs. This ensures that algorithms searching for the motif identify proteins with similar function.

Sensitivity is measured as the proportion of acceptable matches to functional homologs relative to the total number of functional homologs. Specific motifs must maintain geometric and chemical dissimilarity, in respect to the characteristics compared by the comparison algorithm, to functionally unrelated proteins. This ensures that algorithms searching for the motif accidentally identify less matches to functionally unrelated proteins. Specificity is measured as the proportion of rejected matches to functionally unrelated proteins, relative to all functionally unrelated proteins. The ideally effective motif is 100% sensitive and 100% specific.

Under most conditions, effective motifs have been designed by hand. However, a few recently studies exist which examine the relationship between elements of the motif and the sensitivity and specificity of the motif. These studies resulted in the design of MultiBind [14] , an algorithm which identifies conserved binding patterns among functionally homologous active sites, in order to generate motifs which represent only conserved binding patterns. Another study of motif design produced Geometric Sieving [20] , an algorithm for maximizing the Geometric Uniqueness of motifs from functionally unrelated proteins.

Putting together effective motif design, an efficient matching algorithm, and a statistical scoring of the results can lead to an automated functional annotation pipeline for proteins.

Recommended Reading:

References

  1. Wolfson H.J. and Rigoutsos I. (1997). Geometric Hashing: An Overview. IEEE Comp. Sci. Eng., 4, 10-21.
  2. Bachar O. et. al. (1993). A computer vision based technique for 3-D sequence independent structural comparison of proteins. Prot. Eng., 6, 279-288.
  3. Verbitsky G. and Nussinov R. and Wolfson H.J. (1999). Structural comparison allowing hinge bending. Prot: Struct. Funct. Genet., 34(2), 232-254.
  4. Liebowitz N. et. al. (2001). Automated Multiple Structure Alignment and Detection of a Common Substructural Motif. Prot: Struct. Func. Genet., 43, 235-245.
  5. Barker J.A. and Thornton J.M. (2003). An algorithm for constraint-based structural template matching: application to {3D} templates with statistical analysis. Bioinf., 19, 1644-1649.
  6. Stark A. and Sunyaev S. and Russell R.B. (2003). A Model for Statistical Significance of Local Similarities in Structure. J. Mol. Biol., 326, 1307-1316.
  7. Laing M.P. et.al. (2003). WebFEATURE: an interactive web tool for identifying and visualizing functional sites on macromolecular structures. Nucl. Acids Res., 31, 3324-7.
  8. Binkowski et al. (2003). Inferring Functional Relationships of Proteins from Local sequence and Spatial Surface Patterns. J. Mol. Biol., 332, 505-526.
  9. Lichtarge O. and Bourne H.R. and Cohen F.E. (1996). An Evolutionary Trace Method Defines Binding Surfaces Common to Protein Families. J. Mol. Biol., 257, 342-358.
  10. Lichtarge O. and Yamamoto K.R. and Cohen F.E. (1997). Identification of functional surfaces of the zinc binding domains of intracellular receptors. J.Mol.Biol., 274, 325-7.
  11. Connolly M.L. (1983). Solvent-accessible surfaces of proteins and nucleic acids. Science, 221, 709-713.
  12. Kinoshita K. and Nakamura H. (2003). Identification of protein biochemical functions by similarity search using the molecular surface database eF-site. Protein Science, 12, 1589–1595.
  13. Shatsky M. and Nussinov R. and Wolfson H.J. (2004). FlexProt: Alignment of Flexible Protein Structures Without a Predefinition of Hinge Regions. Journal of Computational Biology, 11, 83-106.
  14. Shatsky M. and Shulman-Peleg A. and Nussinov R. and Wolfson H.J. (2005). Recognition of Binding Patterns Common to a Set of Protein Structures. Proceedings of RECOMB 2005, 440-55.
  15. Chen B.Y. et al. (2005). Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs. Proceedings of Pacific Symposium on Biocomputing 2005, 334-45.
  16. Artymuik P.J. et. al. (1994). A Graph-theoretic Approach to the Identification of Three dimensional Patterns of Amino Acid Side Chains in Protein Structures. J. Mol. Biol., 243, 327-344.
  17. Rosen M. et. al. (1998). Molecular Shape comparisons in searches for active sites and functional similarity. Prot. Eng., 11, 263-277.
  18. Wallace A.C. and Laskowski R.A. and Thornton J.M. (1996). Derivation of 3D coordinate templates for searching structural databases. Prot. Sci., 5, 1001-13.
  19. Altschul S.F. et. al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids. Res., 25, 3389-3402.
  20. Chen B.Y. et al. (2006). Geometric Sieving: Automated Distributed Optimization of 3D Motifs for Protein Function Prediction. Proceedings of The Tenth Annual International Conference on Computational Molecular Biology (RECOMB 2006), 500-512.

Comments, questions, feedback, criticisms?

Send feedback