Skip to content Skip to navigation Skip to collection information

Connexions

You are here: Home » Content » Bios 533 Bioinformatics » PSI-BLAST

Navigation

Table of Contents

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • Rice Digital Scholarship

    This collection is included in aLens by: Digital Scholarship at Rice University

    Click the "Rice Digital Scholarship" link to see all content affiliated with them.

Recently Viewed

This feature requires Javascript to be enabled.
 

PSI-BLAST

Module by: Susan Cates. E-mail the author

Summary: This module is designed to introduce the student to PSI-BLAST. PSI-BLAST is a bioinformatics tool, offered through NCBI, for identifying weak, but biologically relevant sequence similarities.

PSI-BLAST (1) (Position-Specific Iterated BLAST) is a tool that produces a position-specific scoring matrix constructed from a multiple alignment of the top-scoring BLAST responses to a given query sequence. This scoring matrix produces a profile designed to identify the key positions of conserved amino acids within a motif. When a profile is used to search a database, it can often detect subtle relationships between proteins that are distant structural or functional homologues. These relationships are often not detected by a BLAST search with a sample sequence query.

For an oversimplified example of what a consensus sequence, or profile, looks like, consider that the EF-hand binding loop of the calmodulin family could be represented as follows:


        Loop Position #         1   3 4  5  6   8       12
        Profile                 D x D G D/N G x I x x x E
        

Here "x" stands for positions where there is variability in amino acid type, and therefore, that position is not heavily weighted in the alignment. Comparing the profile to some actual binding loop sequences from different calmodulins is the best way to illustrate the derivation of this profile.


                POSITION #      1   3 4 5 6   8       12
                CALM_HUMAN_1    D K D G D G T I T T K E
                CALF_NAEGR_1    D K D G D G T I T T S E
                CALM_SCHPO_1    D R D Q D G N I T S N E

                CALM_HUMAN_2    D A D G N G T I D F P E
                CALF_NAEGR_2    D A D G N G T I D F T E
                CALM_SCHPO_2    D A D G N G T I D F T E

                CALM_HUMAN_3    D K D G N G Y I S A A E
                CALF_NAEGR_3    D K D G N G F I S A Q E 
                CALM_SCHPO_3    D K D G N G Y I T V E E

                CALM_HUMAN_4    D I D G D G Q V N Y E E
                CALF_NAEGR_4    D I D G D N Q I N Y T E
                CALM_SCHPO_4    D T D G D G V I N Y E E

The rules for deriving this simple profile are: 1) any position with 90% amino acid identity or greater is considered conserved in the profile, and thus a higher score would be given when the conserved amino acid is found at that position in the sequence, and 2) any position that always contains one of only two types of amino acids would be up-weighted to give a higher score whenever either of those two amino acids appears at that position. A program such as PSI-BLAST will employ more sophisticated rules to create a profile than this example, of course. It is easy to see even with these sequences that amino acid similarity could be taken into consideration in addition to amino acid identity, and exploited in the profile.

There are three common categories of homologues that are studied in relation to biological molecules, sequence homology, structural homology, and functional homology. Sequence homology is the easiest to identify, and is therefore the primary target of many bioinformatics methods. Sequence homology yields direct implications about the relatedness of proteins and their potential pathways of derivation. However, to help understand how a protein is implicated in a certain disease state, or how to design a pharmaceutical that interacts with a given protein, functional and/or structural information is necessary. Functional homologues are relatively easy to define, as they are any two proteins, or protein domains, that perform similar functions. Structural homologues contain similar "folds", which are localized regions of a molecule that comprise a structural feature such as a "beta barrel" or "four helical bundle" motif. The fold can encompass the entire protein, or just one domain of the protein. A good introduction to the topic of protein folds can be found at the website for the Internet Course on The Principles of Protein Structure organized by Birkbeck College (2). When considering sequence, functional, or structural homology, it is important to understand that one type of homology between proteins does not always infer another type of homology. Nevertheless, it is a reasonable assumption that proteins that are related through evolutionary pathways are likely to have some degree of all three types of homology. PSI-BLAST was engineered to identify distant relationships between sequences that are too subtle to discover with a regular BLAST search.

In the first round, PSI-BLAST is just like a normal BLAST; it finds sequence homologues. In the second round or "iteration" of PSI-BLAST, it figures out which residues tend to be conserved by creating a custom profile for each position of the sequence from a multiple alignment. Then another BLAST is performed, using the profile to produce a position-specific scoring matrix based on which positions evolution has conserved vs. which positions evolution has allowed to vary. The sequences found after the first round are added to the profile, allowing PSI-BLAST to detect more distant homologues in each iteration.

One of the known weaknesses of PSI-BLAST is that its ability to detect distant relationships between proteins is critically dependent on the choice of the query sequence. For this reason, a recommended strategy with PSI-BLAST is to query using individual functional domains. PSI-BLAST will then find other proteins that share this domain, even if they do not possess overall homology. To acquaint the new user with PSI-BLAST, this tutorial mimics an investigation performed by Aravind and Koonin (3) in 1999, wherein new members of the HSP70/actin protein family were identified, except the analysis in our tutorial will be on a much smaller scale than that presented in the paper. The HSP70/actin family members were originally recognized to have a common evolutionary origin as a result of a study performed by Bork et al. (4) in 1992. A structural superposition of the structures of actin, hexokinase, and the molecular chaperonin hsp70, and alignment of many sequences in each of the three families, uncovered a set of common conserved residues, distributed in five sequence motifs, that are involved in ATP binding and in a flexible interdomain hinge. Although each of these proteins performs very different functions, and their sequences are quite divergent, the similarity in the fold of the ATP-binding domain is visually recognizable. These are all ATP-dependent enzymes and the patterns discovered by Bork and associates could not be detected by traditional BLAST-type sequence searches. Therefore, Aravind and Koonin chose this family as a test of PSI-BLAST's ability to detect distant evolutionary relationships.

Aravind and Koonin chose actin from the PDB file with accession code 1atn as one of their query sequences. Begin the query by retrieving this sequence from the PDB. Check the box for searching the PDB archive that says "PDB ID", then enter the accession code 1atn as the query. Notice that the crystal structure deposited in this entry contained DNase I complexed with actin. There will be a link in the menu in the blue border on the left entitled "FASTA Sequence", select this link to download the sequence file. This file will contain two sequences 1ATN:A (actin) and 1ATN:D (DNase I). Copy the sequence for 1ATN:A and paste it into the BLAST query box that arises from choosing the Protein BLAST, then selecting "PSI-BLAST" under the algorithm section. Change the database from "nr" to "swissprot", but accept the default values for everything else. Click on BLAST, then view the results. NOTE THAT each time another iteration of PSI-BLAST runs, the results page will indicate the iteration number. This is very helpful for keeping track of the stage of the results.

Exercise 1

There is a statistical section at the very end of the BLAST report. BLAST query sequence windows will be lined up with similar regions in the database, then the algorithm will try to extend the alignment with the database sequence in both directions along the query sequence. What is the number of successful extensions for the first BLAST report?

Exercise 2

What is, by far, the most common protein shown as a hit in the list of scores?

Exercise 3

At the end of the list of scores, note the section entitled "Sequences with E-values WORSE than threshold". Why is it that PSI-BLAST includes this section, but BLAST does not?

Exercise 4

Do any members from the Actin-like ATPase domain superfamily, other than actin (or actin-like proteins), show up as hits in this section? If so, name one.

There are several buttons within the results document entitled "Run PSI-BLAST iteration 2"; if they are difficult to locate, there is one at the end of the list of descriptions and scores; click on one of these buttons to run a second iteration of PSI-BLAST. Look at the results window and above the graphical display, it should say "Results of PSI-Blast iteration 2". Keep track of what the next iteration number should be, and make sure it matches the iteration number that PSI-BLAST displays to avoid getting lost within the PSI-Blast search. When the results appear, look at the legend under the color alignment graph. It says that a yellow starburst with the word "NEW" inside it indicates a new sequence that was identified as a result of the most recent iteration, and a green dot indicates a sequence that was already present prior to the most recent iteration.

Exercise 5

What is the number of successful extensions for the second BLAST iteration?

Exercise 6

Did the second iteration locate any new sequences within (meaning less than or equal to) the threshold E-value? If so, how many?

Exercise 7

Are there new members of the Actin-like ATPase domain superfamily in the section entitled "Sequences with E-value WORSE than threshold"? If so, name one.

Exercise 8

How many new sequences are identified as a result of the third iteration?

Exercise 9

Are there new sequences returned that are within the threshold E-value that are NOT actin or actin-like proteins? If so, name one.

Finally, perform the first iteration of the PSI-BLAST search with the same query sequence as above, but with the database set to "nr".

Exercise 10

What is the number of successful extensions for the first BLAST report when searching the nr database?

The "swissprot" database is a curated database, and all the descriptions on the return list were informative. However, searching against the large "nr" database yields a list of returns that should contain some proteins described as "unknown", or "unnamed". The advantage of "nr" is that usually many more hits are returned. Sometimes this is desirable, and sometimes it is not, but it is one thing to consider when designing a PSI-BLAST search strategy.

A PSI-BLAST search can be forced to include more sequences in the profile at a given iteration by altering the threshold. The default value is to accept any results with E values less than 0.005. It is common to change this threshold to 0.05 if the PSI-BLAST converges too quickly at 0.005. We did not take our search to convergence, that would require continuing iterations until the last iteration did not return any new hits. As this tutorial illustrates, PSI-BLAST is a useful tool, but it often requires putting some thought into the search strategy before it will produce meaningful results.

References

  1. Altschul et al. (1997). Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389-3402.
  2. Internet Course on The Principles of Protein Structure, a collaboration between Birkbeck College and the Virtual School of Natural Sciences(VSNS) of the Globewide Network Academy (GNA). [http://www.cryst.bbk.ac.uk/PPS95/].
  3. Aravind and Koonin. (1999). Gleaning non-trivial structural,functional and evolutionary information about proteins by interative database searches. JMB, 287:1023-1040.
  4. Bork P, Sander C, Valencia A. (1992). An ATPase domain common to prokaryotic cell cycle proteins, sugar kinases, actin, and hsp70 heat shock proteins. Proc Natl Acad Sci U S A, 89(16):7290-4.

Collection Navigation

Content actions

Download:

Collection as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add:

Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks