Inside Collection (Course): Bios 533 Bioinformatics
Summary: This module describes the many proteomics tools available from the ExPASy website. Tools are introduced for protein identification and characterization from amino acid composition, fingerprint mass spectroscopy and other mass spectroscopy techniques. Also included in this module is an introduction to profile and pattern searches, tools for predictions of post-translational protein modifications, tools for protein topology prediction, primary structure analysis, secondary structure prediction and tertiary structure prediction and visualization.
A proteome is the collection of all the proteins within a given organism, in the same way a genome is the collection of all the genes within a given organism. A proteome has some characteristics that are quite different from a genome, however. A principal difference is the fact that while a particular organism will have the same set of identical DNA in any undamaged, healthy cell throughout its lifetime, the organism's proteins will differ greatly from one tissue to another, and from one life stage to another. Furthermore, proteins commonly incur a variety of chemical modifications after they are made. These modifications are critical for proper protein functioning and/or regulation, and moreover, these modifications cannot be determined with certainty by looking at the DNA sequence alone. In a contempary high-throughput proteomics laboratory, the number of proteins identified and analyzed in one day can be on the order of hundreds.
The term “proteome” was originally coined by an Australian scientist, Mark Wilkins (1), to describe the "PROTEin complement of the genOME". The term "proteomics" is used relatively loosely to describe any and all of the collection of high throughput techniques that have emerged to enable the scientist to analyze all the proteins expressed under a certain set of conditions within an individual cell or organism. The ExPASy (Expert Protein Analysis System) website (2), Swiss Institute of Bioinformatics, offers the definition that "proteomics can be defined as the qualitative and quantitative comparison of proteomes under different conditions to further unravel biological processes."
Common techniques for identifying the proteins within a proteome are 2D-PAGE (polyacrylamide gel electrophoresis) gels, amino acid (AA) composition analysis, peptide mass fingerprinting and other mass spectroscopy applications. A good starting point for becoming acquainted with 2D gels is the 2D PAGE tutorial offered by the Institute of Biological Sciences, University of Wales at Aberystwyth. ExPASy offers a good synopsis of peptide mass fingerprinting and AA composition analysis techniques, for those who are unfamiliar with these methods.
At the ExPASy Proteomics Tools server, the first category of tools are for protein identification and characterization. Take a look at the tools listed in this section. These tools are designed to identify the proteins that make up the proteome of study, using the data received from gels, AA analysis and mass spectroscopy experiments.
What tool from the ExPASy "protein identification and characterization" section would you use for identifying a protein for which you only know the amino acid composition?
What is the name of at least one peptide mass fingerprint tool at the ExPASy site?
Generally outline the underlying principles that allow the identification of a protein through peptide mass fingerprinting.
Scroll down on the ExPASy tools webpage to the section entitled "pattern and profile searches". The tools that populate this section are designed to identify proteins that belong to well characterized protein families, usually identified by conserved domains within family members. Also, well known protein motifs, or domains, are represented independently of their protein families in pattern databases that contain the conserved aspects of the domain sequence. Select the tool entitled "InterPro Scan" (3) to perform an integrated search in PROSITE, Pfam, PRINTS and other family and domain databases. This tool is useful for identifying specific domains or motifs within a protein, once the sequence has been determined, and can sometimes recognize the protein as a member of an established protein family. Test the efficacy of this tool with the following sequences, one at a time, but make sure the interactive run button is selected. An email address will be required to submit the job, but the results can be viewed in the browser interactively.
>Seq1
MAGIAAKLAKDREAAEGLGSHERAIKYLNQDYEALRNECLEAGTLFQDPSFPAIPSALGFKELGPYSSKT
RGIEWKRPTEICADPQFIIGGATRTDICQGALGDCWLLAAIASLTLNEEILARVVPLNQSFQENYAGIFH
FQFWQYGEWVEVVVDDRLPTKDGELLFVHSAEGSEFWSALLEKAYAKINGCYEALSGGATTEGFEDFTGG
IAEWYELKKPPPNLFKIIQKALQKGSLLGCSIDITSAADSEAITFQKLVKGHAYSVTGAEEVESNGSLQK
LIRIRNPWGEVEWTGRWNDNCPSWNTIDPEERERLTRRHEDGEFWMSFSDFLRHYSRLEICNLTPDTLTS
DTYKKWKLTKMDGNWRRGSTAGGCRNYPNTFWMNPQYLIKLEEEDEDEEDGESGCTFLVGLIQKHRRRQR
KMGEDMHTIGFGIYEVPEELSGQTNIHLSKNFFLTNRARERSDTFINLREVLNRFKLPPGEYILVPSTFE
PNKDGDFCIRVFSEKKADYQAVDDEIEANLEEFDISEDDIDDGVRRLFAQLAGEDAEISAFELQTILRRV
LAKRQDIKSDGFSIETCKIMVDMLDSDGSGKLGLKEFYILWTKIQKYQKIYREIDVDRSGTMNSYEMRKA
LEEAGFKMPCQLHQVIVARFADDQLIIDFDNFVRCLVRLETLFKIFKQLDPENTGTIELDLISWLCFSVL
>Seq2
SGPRPVVLSGPSGAGKSTLLKRLLQEHSGIFGFSVSHTTRNPRPGEENGKDYYFVTREVM
QRDIAAGDFIEHAEFSGNLYGTSKVAVQAVQAMNRICVLDVDLQGVRNIKATDLRPIYIS
VQPPSLHVLEQRLRQRNTETEESLVKRLAAAQADMESSKEPGLFDVVIINDSLDQAYAEL
KEALSEEIKKAQRTGA
>Seq3
MTEVISNKITAKDGATSLKDIDDKRWVWISDPETAFTKAWIKEDLPDKKYVVRYNNSRDE
KIVGEDEIDPVNPAKFDRVNDMAELTYLNEPAVTYNLEQRYLSDQIYTYSGLFLVAVNPY
CGLPIYTKDIIQLYKDKTQERKLPHVFAIADLAYNNLLENKENQSILVTGESGAGKTENT
KRIIQYLAAIASSTTVGSSQVEEQIIKTNPVLESFGNARTVRNNNSSRFGKFIKVEFSLS
GEISNAAIEWYLLEKSRVVHQNEFERNYHVFYQLLSGADTALKNKLLLTDNCNDYRYLKD
SVHIIDGVDDKEEFKTLLAAFKTLGFDDKENFDLFNILSIILHMGNIDVGADRSGIARLL
NPDEIDKLCHLLGVSPELFSQNLVRPRIKAGHEWVISARSQTQVISSIEALAKAIYERNF
GWLVKRLNTSLNHSNAQSYFIGILDIAGFEIFEKNSFEQLCINYTNEKLQQFFNHHMFVL
EQEEYMKEEIVWDFIDFGHDLQPTIDLIEKANPIGILSCLDEECVMPKATDATFTSKLDA
LWRNKSLKYKPFKFADQGFILTHYAADVPYSTEGWLEKNTDPLNENVAKLLAQSTNKHVA
TLFSDYQETETKTVRGRTKKGLFRTVAQRHKEQLNQLMNQFNSTQPHFIRCIVPNEEKKM
HTFNRPLVLGQLRCNGVLEGIRITRAGFPNRLPFNDFRVRYEIMAHLPTGTYVESRRASV
MILEELKIDEASYRIGVSKIFFKAGVLAELEERRVATLQRLMTMLQTRIRGFLQRKIFQK
RLKDIQAIKLLQANLQVYNEFRTFPWAKLFFNLRPLLSSTQNDKQLKKRDAEIIELKYEL
KKQQNSKSEVERDLVETNNSLTAVENLLTTERAIALDKEEILRRTQERLANIEDSFSETK
QQNENLQRESASLKQINNELESELLEKTSKVETLLSEQNELKEKLSLEEKDLLDTKGELE
SLRENNATVLSEKAEFNEQCKSLQETIVTKDAELDKLTKYISDYKTEIQEMRLTNQKMNE
KSIQQEGSLSESLKRVKKLERENSTLISDVSILKQQKEELSVLKGVQELTINNLEEKVNY
LEADVKQLPKLKKELESLNDKDQLYQLQATKNKELEAKVKECLNNIKSLTKELENKEEKC
QNLSDASLKYIELQEIHENLLLKVSDLENYKKKYEGLQLDLEGLKDVDTNFQELSKKHRD
LTFNHESLLRQSASYKEKLSLASSENKDLSNKVSSLTKQVNELSPKASKVPELERKITNL
MHEYSQLGKTFEDEKRKALIASRDNEELRSLKSELESKRKLEVEYQKVLEEVKTTRSLRS
EVTLLRNKVADHESIRSKLSEVEMKLVDTRKELNSALDSCKKREAEIHRLKEHRPSGKEN
NIPAVKTTEPVLKNIPQRKTIFDLQQRNANQALYENLKRDYDRLNLEKHNLEKQVNELKG
AEVSPQPTGQSLQHVNLAHAIELKALKDQINSEKAKMFSVQVQYEKREQELQKRIASLEK
VNKDSLIDVRALRDRIASLEDELRAA
View the results for Sequence 1. The first column of the results table identifies whether or not the match is of type "family" or of type "domain". The family and domain names appear at the top of each box in the second column of the results page, the same column that contains the diagrams which show the localization of the section of sequence that has been identified with the referenced family or domain.
How many matches were of the type "family"?
How many were domains?
What are the names of the families identified with this sequence?
List any domains that were identified within Sequence 1.
View the results for Sequence 2.
How many families were returned as matches?
How many domains?
What families were identified with this sequence?
List any domains that were identified within Sequence 2.
View the results for Sequence 3.
How many families were returned as matches?
How many domains?
What families were identified with this sequence?
List any domains that were identified within Sequence 3.
Return to the ExPASy Proteomics Tools server. Now, scroll down to the section entitled "post-translational modification prediction". Use NetPhos (4) to predict possible sites for serine, threonine and tyrosine phosphorylation on the three sequences above (all 3 sequences can be entered as one query). Accept the default values and select "submit". For help interpreting the results, view the NetPhos output format.
How many (a) serine, (b) threonine, and (c) tyrosine phosphorylation sites are predicted for Sequence 1?
How many (a) serine, (b) threonine, and (c) tyrosine phosphorylation sites are predicted for Sequence 2?
How many (a) serine, (b) threonine, and (c) tyrosine phosphorylation sites are predicted for Sequence 3?
Are there any serine, threonine and tyrosine in the sequence that were not listed as a potential phosphorylation site? If so, explain why some of the residues were not listed as predicted phosphorylation sites. (Those uncertain about the answer to this question should view the above link explaining the output.)
Once a protein sequence has been determined through proteomics techniques, bioinformatics can be used to predict certain types of topology. Topology is the sequence of secondary structure elements within a protein. The most basic secondary structure elements within proteins are the alpha helix, the beta sheet and the random coil. However, some algorithms will predict topological features that are closely related to in vivo localization, such as signal sequences and transmembrane helices.
At the ExPASy Proteomics Tools server, scroll down on the ExPASy tools webpage to the section entitled "topology prediction". This section contains tools that predict localization and sorting signals, as well as transmembrane regions within proteins. PSORT (5) is a computer program for the prediction of protein localization. It requires input of an amino acid sequence and its source organism; and it searches for known, organism-specific protein sorting signals. It returns a list of candidate localization sites, accompanied by a score indicating the probability the protein encoded by the input sequence would be localized to that site. To explore the use of PSORT, click on the PSORT link on the ExPASy tool page. Choose the "PSORT II" for eukaryotic sequences, and select the PSORT II Prediction. Cut and paste the following sequence for diacylglycerol kinase from Rattus norvegicus into the query box and click "Submit".
MEPRDPSPEARSSDSESASASSSGSERDADPEPDKAPRRLTKRRFPGLRLFGHRKAITKSGLQHLAPPPP
TPGAPCGESERQIRSTVDWSESAAYGEHIWFETNVSGDFCYVGEQYCVAKMLPKSAPRRKCAACKIVVHT
PCIGQLEKINFRCKPSFRESGSRNVREPTFVRHHWVHRRRQDGKCRHCGKGFQQKFTFHSKEIVAISCSW
CKQAYHSKVSCFMLQQIEEPCSLGVHAAVVIPPTWILRARRPQNTLKASKKKKRASFKRRSSKKGPEEGR
WRPFIIRPTPSPLMKPLLVFVNPKSGGNQGAKIIQSFLWYLNPRQVFDLSQGGPREALEMYRKVHNLRIL
ACGGDGTVGWILSTLDQLRLKPPPPVAILPLGTGNDLARTLNWGGGYTDEPVSKILSHVEEGNVVQLDRW
DLRAEPNPEAGPEERDDGATDRLPLDVFNNYFSLGFDAHVTLEFHESREANPEKFNSRFRNKMFYAGTAF
SDFLMGSSKDLAKHIRVVCDGMDLTPKIQDLKPQCIVFLNIPRYCAGTMPWGHPGEHHDFEPQRHDDGYL
EVIGFTMTSLAALQVGGHGERLTQCREVLLTTAKAIPVQVDGEPCKLAASRIRIALRNQATMVQKAKRRS
TAPLHSDQQPVPEQLRIQVSRVSMHDYEALHYDKEQLKEASVPLGTVVVPGDSDLELCRAHIERLQQEPD
GAGAKSPMCHPLSSKWCFLDATTASRFYRIDRAQEHLNYVTEIAQDEIYILDPELLGASARPDLPTPTSP
LPASPCSPTPGSLQGDAALPQGEELIEAAKRNDFCKLQELHRAGGDLMHRDHQSRTLLHHAVSTGSKEVV
RYLLDHAPPEILDAVEENGETCLHQAAALGQRTICHYIVEAGASLMKTDQQGDTPRQRAEKAQDTELAAY
LENRQHYQMIQREDQETAV
First, view the "k-NN" results by scrolling to the bottom of the page. The k-nearest neighbor (k-NN) algorithm takes the output of the many subprograms and determines a probability of localization at each candidate site within the cell using all of the predictions.
What is the probability the sequence encodes a protein that is (a) secreted by vesicles? (b) localized to the endoplasmic reticulum? (c) cytoplasmic? or (d) localized to the nucleus?
Now, scroll through the results of the subprograms. Clicking on the links will reveal a brief description of the algorithm each individual subprogram utilizes.
What is the localization prediction and reliability score produced by the NNCN subprogram, Reinhardt's methods for cytoplasmic/nuclear discrimination?
The first two subprograms, PSG and GvH, are tools that predict N-terminal signal peptide sequences. Just after their results are listed, there is a statement summarizing whether or not an N-terminal signal peptide has been predicted for the query sequence.
Do these subprograms predict an N-terminal signal peptide for the diacylglycerol kinase query?
After looking over all the results, what is the most likely localization of our query protein?
Read the title and abstract for this article on the Rat diacylglycerol kinase used for the query sequence.
Was PSORT able to predict the correct localization, using the sequence information alone?
Return to the ExPASy tools, and scroll to the section entitled "primary structure analysis". Click on the link for the ProtParam tool. ProtParam is a suite of programs designed to predict various chemical and physical properties about a protein from its sequence. ProtParam will yield an estimated extinction coefficient at selected wavelengths based on protein sequence (6), an estimation of the in vivo half-life of the protein (7 8 9 10), an instability index (11), an aliphatic index (12), and an average value for hydropathicity (13). Cut and paste the Rat diacylglycerol kinase sequence above into the query box and click on "compute parameters".
What is the molecular weight computed from the sequence?
What does the amino acid composition analysis show as the most common amino acid in this protein? (Is that unusual?)
What is the chemical formula for the query protein?
What is the predicted extinction coefficient at 280 nm, in 6M guanidium HCl, 0.02M phosphate, pH6.5 buffer, assuming all cysteines appear as half cysteines?
In what way could it be helpful to know the extinction coefficient?
According to the instability index, is this protein classified as stable or unstable?
Return again to the ExPASy tools. Notice there are two sections dealing with structure prediction, secondary structure prediction tools and tertiary structure prediction and visualization tools. The secondary structure prediction tools are designed to predict features such as the helical content, the beta sheet formations, and the turns, loops, and coil regions within a protein, given the sequence.
Explore the secondary structure tools independently, and submit the diacylglycerol kinase sequence above to any of the available secondary structure prediction tools. Most of these tools will email the results, with at least a 20 minute delay between submission and receipt of results. Forward a results summary to the instructor, outlining the predictions created by the program of choice.
Tertiary structure prediction tools match the query sequence with sequences, or partial sequences, of proteins where the 3-D structure has been published in the Protein Data Bank (PDB). These tools will produce a model of the query protein by piecing together the structural regions from the best matches in the PDB, and threading the query sequence through the predicted structure. For more detailed explanations of available 3-D structure prediction software, view the Swiss-Model demo page and the Geno3D reference page. Although both of these tools are searching for templates from existing PDB entries, they are doing this in different ways.
What program does Swiss-Model use to match the query sequence with sequences of known structures?
What program does Geno3D use to match the query sequence with sequences of known structures?
Notice that the template selection process and the model structure refinement processes are different between these two programs as well.
Finally, in the tertiary structure section of the ExPASy tools page, Swiss PDB Viewer is a graphical tool for the visualization, comparison and analysis of 3-D coordinate files. Swiss PDB Viewer can superimpose 3-D structures by finding the rotation and translation that most closely aligns the two protein structures. Additionally, the Swiss PDB Viewer will perform amino acid mutations, prediction of hydrogen bonds, and calculation of angles and distances between atoms. Best of all, Swiss PDB Viewer is freeware and available for many different platforms, including Macintosh, PC, SGI IRIX, and Linux.
View this supplemental SPDBV web page. What other function does Swiss PDB Viewer have, when used in conjunction with other applications such as OpenGL or POV-Ray?
ExPASy provides a very large library of tools, for proteomics as well as other bioinformatics applications.For those students interested in future research in the field of proteomics, this web server will be an important resource.