Inside Collection (Course): Bios 533 Bioinformatics
Summary: This module is designed to introduce the student to PSI-BLAST. PSI-BLAST is a bioinformatics tool, offered through NCBI, for identifying weak, but biologically relevant sequence similarities.
PSI-BLAST (1) (Position-Specific Iterated BLAST) is a tool that produces a position-specific scoring matrix constructed from a multiple alignment of the top-scoring BLAST responses to a given query sequence. This scoring matrix produces a profile designed to identify the key positions of conserved amino acids within a motif. When a profile is used to search a database, it can often detect subtle relationships between proteins that are distant structural or functional homologues. These relationships are often not detected by a BLAST search with a sample sequence query.
For an oversimplified example of what a consensus sequence, or profile, looks like, consider that the EF-hand binding loop of the calmodulin family could be represented as follows:
Loop Position # 1 3 4 5 6 8 12
Profile D x D G D/N G x I x x x E
Here "x" stands for positions where there is variability in amino acid type, and therefore, that position is not heavily weighted in the alignment. Comparing the profile to some actual binding loop sequences from different calmodulins is the best way to illustrate the derivation of this profile.
POSITION # 1 3 4 5 6 8 12
CALM_HUMAN_1 D K D G D G T I T T K E
CALF_NAEGR_1 D K D G D G T I T T S E
CALM_SCHPO_1 D R D Q D G N I T S N E
CALM_HUMAN_2 D A D G N G T I D F P E
CALF_NAEGR_2 D A D G N G T I D F T E
CALM_SCHPO_2 D A D G N G T I D F T E
CALM_HUMAN_3 D K D G N G Y I S A A E
CALF_NAEGR_3 D K D G N G F I S A Q E
CALM_SCHPO_3 D K D G N G Y I T V E E
CALM_HUMAN_4 D I D G D G Q V N Y E E
CALF_NAEGR_4 D I D G D N Q I N Y T E
CALM_SCHPO_4 D T D G D G V I N Y E E
The rules for deriving this simple profile are: 1) any position with 90% amino acid identity or greater is considered conserved in the profile, and thus a higher score would be given when the conserved amino acid is found at that position in the sequence, and 2) any position that always contains one of only two types of amino acids would be up-weighted to give a higher score whenever either of those two amino acids appears at that position. A program such as PSI-BLAST will employ more sophisticated rules to create a profile than this example, of course. It is easy to see even with these sequences that amino acid similarity could be taken into consideration in addition to amino acid identity, and exploited in the profile.
There are three common categories of homologues that are studied in relation to biological molecules, sequence homology, structural homology, and functional homology. Sequence homology is the easiest to identify, and is therefore the primary target of many bioinformatics methods. Sequence homology yields direct implications about the relatedness of proteins and their potential pathways of derivation. However, to help understand how a protein is implicated in a certain disease state, or how to design a pharmaceutical that interacts with a given protein, functional and/or structural information is necessary. Functional homologues are relatively easy to define, as they are any two proteins, or protein domains, that perform similar functions. Structural homologues contain similar "folds", which are localized regions of a molecule that comprise a structural feature such as a "beta barrel" or "four helical bundle" motif. The fold can encompass the entire protein, or just one domain of the protein. A good introduction to the topic of protein folds can be found at the website for the Internet Course on The Principles of Protein Structure organized by Birkbeck College (2). When considering sequence, functional, or structural homology, it is important to understand that one type of homology between proteins does not always infer another type of homology. Nevertheless, it is a reasonable assumption that proteins that are related through evolutionary pathways are likely to have some degree of all three types of homology. PSI-BLAST was engineered to identify distant relationships between sequences that are too subtle to discover with a regular BLAST search.
In the first round, PSI-BLAST is just like a normal BLAST; it finds sequence homologues. In the second round or "iteration" of PSI-BLAST, it figures out which residues tend to be conserved by creating a custom profile for each position of the sequence from a multiple alignment. Then another BLAST is performed, using the profile to produce a position-specific scoring matrix based on which positions evolution has conserved vs. which positions evolution has allowed to vary. The sequences found after the first round are added to the profile, allowing PSI-BLAST to detect more distant homologues in each iteration.
One of the known weaknesses of PSI-BLAST is that its ability to detect distant relationships between proteins is critically dependent on the choice of the query sequence. For this reason, a recommended strategy with PSI-BLAST is to query using individual functional domains. PSI-BLAST will then find other proteins that share this domain, even if they do not possess overall homology. To acquaint the new user with PSI-BLAST, this tutorial mimics an investigation performed by Aravind and Koonin (3) in 1999, wherein new members of the HSP70/actin protein family were identified, except the analysis in our tutorial will be on a much smaller scale than that presented in the paper. The HSP70/actin family members were originally recognized to have a common evolutionary origin as a result of a study performed by Bork et al. (4) in 1992. A structural superposition of the structures of actin, hexokinase, and the molecular chaperonin hsp70, and alignment of many sequences in each of the three families, uncovered a set of common conserved residues, distributed in five sequence motifs, that are involved in ATP binding and in a flexible interdomain hinge. Although each of these proteins performs very different functions, and their sequences are quite divergent, the similarity in the fold of the ATP-binding domain is visually recognizable. These are all ATP-dependent enzymes and the patterns discovered by Bork and associates could not be detected by traditional BLAST-type sequence searches. Therefore, Aravind and Koonin chose this family as a test of PSI-BLAST's ability to detect distant evolutionary relationships.
Aravind and Koonin chose actin from the PDB file with accession code 1atn as one of their query sequences. Begin the query by retrieving this sequence from the PDB. Check the box for searching the PDB archive that says "PDB ID", then enter the accession code 1atn as the query. Notice that the crystal structure deposited in this entry contained DNase I complexed with actin. There will be a link in the menu in the blue border on the left entitled "FASTA Sequence", select this link to download the sequence file. This file will contain two sequences 1ATN:A (actin) and 1ATN:D (DNase I). Copy the sequence for 1ATN:A and paste it into the BLAST query box that arises from choosing the Protein BLAST, then selecting "PSI-BLAST" under the algorithm section. Change the database from "nr" to "swissprot", but accept the default values for everything else. Click on BLAST, then view the results. NOTE THAT each time another iteration of PSI-BLAST runs, the results page will indicate the iteration number. This is very helpful for keeping track of the stage of the results.
There is a statistical section at the very end of the BLAST report. BLAST query sequence windows will be lined up with similar regions in the database, then the algorithm will try to extend the alignment with the database sequence in both directions along the query sequence. What is the number of successful extensions for the first BLAST report?
What is, by far, the most common protein shown as a hit in the list of scores?
At the end of the list of scores, note the section entitled "Sequences with E-values WORSE than threshold". Why is it that PSI-BLAST includes this section, but BLAST does not?
Do any members from the Actin-like ATPase domain superfamily, other than actin (or actin-like proteins), show up as hits in this section? If so, name one.
There are several buttons within the results document entitled "Run PSI-BLAST iteration 2"; if they are difficult to locate, there is one at the end of the list of descriptions and scores; click on one of these buttons to run a second iteration of PSI-BLAST. Look at the results window and above the graphical display, it should say "Results of PSI-Blast iteration 2". Keep track of what the next iteration number should be, and make sure it matches the iteration number that PSI-BLAST displays to avoid getting lost within the PSI-Blast search. When the results appear, look at the legend under the color alignment graph. It says that a yellow starburst with the word "NEW" inside it indicates a new sequence that was identified as a result of the most recent iteration, and a green dot indicates a sequence that was already present prior to the most recent iteration.
What is the number of successful extensions for the second BLAST iteration?
Did the second iteration locate any new sequences within (meaning less than or equal to) the threshold E-value? If so, how many?
Are there new members of the Actin-like ATPase domain superfamily in the section entitled "Sequences with E-value WORSE than threshold"? If so, name one.
How many new sequences are identified as a result of the third iteration?
Are there new sequences returned that are within the threshold E-value that are NOT actin or actin-like proteins? If so, name one.
Finally, perform the first iteration of the PSI-BLAST search with the same query sequence as above, but with the database set to "nr".
What is the number of successful extensions for the first BLAST report when searching the nr database?
The "swissprot" database is a curated database, and all the descriptions on the return list were informative. However, searching against the large "nr" database yields a list of returns that should contain some proteins described as "unknown", or "unnamed". The advantage of "nr" is that usually many more hits are returned. Sometimes this is desirable, and sometimes it is not, but it is one thing to consider when designing a PSI-BLAST search strategy.
A PSI-BLAST search can be forced to include more sequences in the profile at a given iteration by altering the threshold. The default value is to accept any results with E values less than 0.005. It is common to change this threshold to 0.05 if the PSI-BLAST converges too quickly at 0.005. We did not take our search to convergence, that would require continuing iterations until the last iteration did not return any new hits. As this tutorial illustrates, PSI-BLAST is a useful tool, but it often requires putting some thought into the search strategy before it will produce meaningful results.