A proteome is the collection of all the proteins within a given organism,
in the same way a genome is the collection of all the genes within a given
organism. A proteome has some characteristics that are quite different from
a genome, however. A principal difference is the fact that while a particular
organism will have the same set of identical DNA in any undamaged, healthy cell
throughout its lifetime, the organism's proteins will differ greatly from one
tissue to another, and from one life stage to another. Furthermore, proteins
commonly incur a variety of chemical modifications after they are made.
These modifications are critical for proper protein functioning and/or
regulation, and moreover, these modifications cannot be determined with
certainty by looking at the DNA sequence alone. In a contempary high-throughput
proteomics laboratory, the number of proteins identified and analyzed in one
day can be on the order of hundreds.
The term “proteome” was originally coined by an Australian scientist,
Mark Wilkins (1), to describe the "PROTEin complement of the genOME".
The term "proteomics" is used relatively loosely to describe any and all of the
collection of high throughput techniques that have emerged to enable the
scientist to analyze all the proteins expressed under a certain set of
conditions within an individual cell or organism.
The
ExPASy (Expert Protein Analysis System) website (2),
Swiss Institute of Bioinformatics, offers the definition that
"proteomics can be defined as the qualitative and quantitative comparison of
proteomes under different conditions to further unravel biological processes."
Common techniques for identifying the proteins within a proteome are 2D-PAGE
(polyacrylamide gel electrophoresis) gels, amino acid (AA) composition analysis,
peptide mass fingerprinting and other mass spectroscopy applications.
A good starting point for becoming acquainted with 2D gels is the
2D PAGE tutorial offered by the Institute of Biological Sciences,
University of Wales at Aberystwyth. ExPASy offers a good synopsis of
peptide mass fingerprinting and AA composition analysis techniques,
for those who are unfamiliar with these methods.
At the
ExPASy Proteomics Tools server, the first category of tools are for
protein identification and characterization. Take a look at the tools listed
in this section. These tools are designed to identify the proteins that make
up the proteome of study, using the data received from gels, AA analysis and
mass spectroscopy experiments.
Problem 1
What tool from the ExPASy "protein identification and characterization"
section would you use for identifying a protein for which you only know the
amino acid composition?
Problem 2
What is the name of at least one peptide mass fingerprint tool
at the ExPASy site?
Problem 3
Generally outline the underlying principles that allow the identification of a protein through peptide mass fingerprinting.
Scroll down on the ExPASy tools webpage to the section entitled "pattern and
profile searches". The tools that populate this section are designed to
identify proteins that belong to well characterized protein families,
usually identified by conserved domains within family members.
Also, well known protein motifs, or domains, are represented independently
of their protein families in pattern databases that contain the conserved
aspects of the domain sequence. Select the tool entitled
"InterPro Scan" (3)
to perform an integrated search in PROSITE, Pfam, PRINTS and other family
and domain databases. This tool is useful for identifying specific domains
or motifs within a protein, once the sequence has been determined, and can
sometimes recognize the protein as a member of an established protein family.
Test the efficacy of this tool with the following sequences, one at a time, but make sure
the interactive run button is selected. An email address will be required
to submit the job, but the results can be viewed in the browser interactively.
>Seq1
MAGIAAKLAKDREAAEGLGSHERAIKYLNQDYEALRNECLEAGTLFQDPSFPAIPSALGFKELGPYSSKT
RGIEWKRPTEICADPQFIIGGATRTDICQGALGDCWLLAAIASLTLNEEILARVVPLNQSFQENYAGIFH
FQFWQYGEWVEVVVDDRLPTKDGELLFVHSAEGSEFWSALLEKAYAKINGCYEALSGGATTEGFEDFTGG
IAEWYELKKPPPNLFKIIQKALQKGSLLGCSIDITSAADSEAITFQKLVKGHAYSVTGAEEVESNGSLQK
LIRIRNPWGEVEWTGRWNDNCPSWNTIDPEERERLTRRHEDGEFWMSFSDFLRHYSRLEICNLTPDTLTS
DTYKKWKLTKMDGNWRRGSTAGGCRNYPNTFWMNPQYLIKLEEEDEDEEDGESGCTFLVGLIQKHRRRQR
KMGEDMHTIGFGIYEVPEELSGQTNIHLSKNFFLTNRARERSDTFINLREVLNRFKLPPGEYILVPSTFE
PNKDGDFCIRVFSEKKADYQAVDDEIEANLEEFDISEDDIDDGVRRLFAQLAGEDAEISAFELQTILRRV
LAKRQDIKSDGFSIETCKIMVDMLDSDGSGKLGLKEFYILWTKIQKYQKIYREIDVDRSGTMNSYEMRKA
LEEAGFKMPCQLHQVIVARFADDQLIIDFDNFVRCLVRLETLFKIFKQLDPENTGTIELDLISWLCFSVL
>Seq2
SGPRPVVLSGPSGAGKSTLLKRLLQEHSGIFGFSVSHTTRNPRPGEENGKDYYFVTREVM
QRDIAAGDFIEHAEFSGNLYGTSKVAVQAVQAMNRICVLDVDLQGVRNIKATDLRPIYIS
VQPPSLHVLEQRLRQRNTETEESLVKRLAAAQADMESSKEPGLFDVVIINDSLDQAYAEL
KEALSEEIKKAQRTGA
>Seq3
MTEVISNKITAKDGATSLKDIDDKRWVWISDPETAFTKAWIKEDLPDKKYVVRYNNSRDE
KIVGEDEIDPVNPAKFDRVNDMAELTYLNEPAVTYNLEQRYLSDQIYTYSGLFLVAVNPY
CGLPIYTKDIIQLYKDKTQERKLPHVFAIADLAYNNLLENKENQSILVTGESGAGKTENT
KRIIQYLAAIASSTTVGSSQVEEQIIKTNPVLESFGNARTVRNNNSSRFGKFIKVEFSLS
GEISNAAIEWYLLEKSRVVHQNEFERNYHVFYQLLSGADTALKNKLLLTDNCNDYRYLKD
SVHIIDGVDDKEEFKTLLAAFKTLGFDDKENFDLFNILSIILHMGNIDVGADRSGIARLL
NPDEIDKLCHLLGVSPELFSQNLVRPRIKAGHEWVISARSQTQVISSIEALAKAIYERNF
GWLVKRLNTSLNHSNAQSYFIGILDIAGFEIFEKNSFEQLCINYTNEKLQQFFNHHMFVL
EQEEYMKEEIVWDFIDFGHDLQPTIDLIEKANPIGILSCLDEECVMPKATDATFTSKLDA
LWRNKSLKYKPFKFADQGFILTHYAADVPYSTEGWLEKNTDPLNENVAKLLAQSTNKHVA
TLFSDYQETETKTVRGRTKKGLFRTVAQRHKEQLNQLMNQFNSTQPHFIRCIVPNEEKKM
HTFNRPLVLGQLRCNGVLEGIRITRAGFPNRLPFNDFRVRYEIMAHLPTGTYVESRRASV
MILEELKIDEASYRIGVSKIFFKAGVLAELEERRVATLQRLMTMLQTRIRGFLQRKIFQK
RLKDIQAIKLLQANLQVYNEFRTFPWAKLFFNLRPLLSSTQNDKQLKKRDAEIIELKYEL
KKQQNSKSEVERDLVETNNSLTAVENLLTTERAIALDKEEILRRTQERLANIEDSFSETK
QQNENLQRESASLKQINNELESELLEKTSKVETLLSEQNELKEKLSLEEKDLLDTKGELE
SLRENNATVLSEKAEFNEQCKSLQETIVTKDAELDKLTKYISDYKTEIQEMRLTNQKMNE
KSIQQEGSLSESLKRVKKLERENSTLISDVSILKQQKEELSVLKGVQELTINNLEEKVNY
LEADVKQLPKLKKELESLNDKDQLYQLQATKNKELEAKVKECLNNIKSLTKELENKEEKC
QNLSDASLKYIELQEIHENLLLKVSDLENYKKKYEGLQLDLEGLKDVDTNFQELSKKHRD
LTFNHESLLRQSASYKEKLSLASSENKDLSNKVSSLTKQVNELSPKASKVPELERKITNL
MHEYSQLGKTFEDEKRKALIASRDNEELRSLKSELESKRKLEVEYQKVLEEVKTTRSLRS
EVTLLRNKVADHESIRSKLSEVEMKLVDTRKELNSALDSCKKREAEIHRLKEHRPSGKEN
NIPAVKTTEPVLKNIPQRKTIFDLQQRNANQALYENLKRDYDRLNLEKHNLEKQVNELKG
AEVSPQPTGQSLQHVNLAHAIELKALKDQINSEKAKMFSVQVQYEKREQELQKRIASLEK
VNKDSLIDVRALRDRIASLEDELRAA
View the results for Sequence 1.
The first column of the results table identifies
whether or not the match is of
type "family" or of type "domain".
The family and domain names appear at the top of each
box in the second column of the results
page, the same column that contains the diagrams
which show the localization of the
section of sequence that has been identified with
the referenced family or domain.
Problem 4
How many matches were of the type "family"?
Problem 5
How many were domains?
Problem 6
What are the names of the families identified with this sequence?
Problem 7
List any domains that were identified within Sequence 1.
View the results for Sequence 2.
Problem 8
How many families were returned as matches?
Problem 9
How many domains?
Problem 10
What families were identified with this sequence?
Problem 11
List any domains that were identified within Sequence 2.
View the results for Sequence 3.
Problem 12
How many families were returned as matches?
Problem 13
How many domains?
Problem 14
What families were identified with this sequence?
Problem 15
List any domains that were identified within Sequence 3.
Return to the
ExPASy Proteomics Tools server. Now, scroll down to the section
entitled "post-translational modification prediction".
Use
NetPhos (4) to predict possible sites for serine, threonine and tyrosine
phosphorylation on the three sequences above (all 3 sequences can be entered
as one query). Accept the default values and select "submit". For help
interpreting the results, view the
NetPhos output format.
Problem 16
How many (a) serine, (b) threonine, and
(c) tyrosine phosphorylation sites are predicted for Sequence 1?
Problem 17
How many (a) serine, (b) threonine, and
(c) tyrosine phosphorylation sites are predicted for Sequence 2?
Problem 18
How many (a) serine, (b) threonine, and
(c) tyrosine phosphorylation sites are predicted for Sequence 3?
Problem 19
Are there any serine, threonine and tyrosine in the sequence that were not listed as a potential phosphorylation site?
If so, explain why some of the residues were not listed as predicted phosphorylation sites. (Those uncertain about the answer to this question should view the above link explaining the output.)
Once a protein sequence has been determined
through proteomics techniques, bioinformatics can be used to predict certain
types of topology. Topology is the sequence of secondary structure elements
within a protein. The most basic secondary structure elements within proteins are the alpha helix, the beta sheet and the random coil. However, some
algorithms will predict topological features that are closely related to
in vivo
localization, such as signal sequences and transmembrane helices.
At the
ExPASy Proteomics Tools server, scroll down on the ExPASy tools
webpage to the section entitled "topology prediction". This section
contains tools that predict localization and sorting signals, as well
as transmembrane regions within proteins.
PSORT (5) is
a computer program
for the prediction of protein localization. It requires input of an amino
acid sequence and its source organism; and it searches for known,
organism-specific protein sorting signals. It returns a list of candidate
localization sites, accompanied by a score indicating the probability the
protein encoded by the input sequence would be localized to that site. To
explore the use of PSORT, click on the PSORT link on the ExPASy tool page.
Choose the "PSORT II" for eukaryotic sequences, and select the PSORT II Prediction. Cut and paste the following sequence for diacylglycerol kinase from Rattus norvegicus into the query box and click
"Submit".
MEPRDPSPEARSSDSESASASSSGSERDADPEPDKAPRRLTKRRFPGLRLFGHRKAITKSGLQHLAPPPP
TPGAPCGESERQIRSTVDWSESAAYGEHIWFETNVSGDFCYVGEQYCVAKMLPKSAPRRKCAACKIVVHT
PCIGQLEKINFRCKPSFRESGSRNVREPTFVRHHWVHRRRQDGKCRHCGKGFQQKFTFHSKEIVAISCSW
CKQAYHSKVSCFMLQQIEEPCSLGVHAAVVIPPTWILRARRPQNTLKASKKKKRASFKRRSSKKGPEEGR
WRPFIIRPTPSPLMKPLLVFVNPKSGGNQGAKIIQSFLWYLNPRQVFDLSQGGPREALEMYRKVHNLRIL
ACGGDGTVGWILSTLDQLRLKPPPPVAILPLGTGNDLARTLNWGGGYTDEPVSKILSHVEEGNVVQLDRW
DLRAEPNPEAGPEERDDGATDRLPLDVFNNYFSLGFDAHVTLEFHESREANPEKFNSRFRNKMFYAGTAF
SDFLMGSSKDLAKHIRVVCDGMDLTPKIQDLKPQCIVFLNIPRYCAGTMPWGHPGEHHDFEPQRHDDGYL
EVIGFTMTSLAALQVGGHGERLTQCREVLLTTAKAIPVQVDGEPCKLAASRIRIALRNQATMVQKAKRRS
TAPLHSDQQPVPEQLRIQVSRVSMHDYEALHYDKEQLKEASVPLGTVVVPGDSDLELCRAHIERLQQEPD
GAGAKSPMCHPLSSKWCFLDATTASRFYRIDRAQEHLNYVTEIAQDEIYILDPELLGASARPDLPTPTSP
LPASPCSPTPGSLQGDAALPQGEELIEAAKRNDFCKLQELHRAGGDLMHRDHQSRTLLHHAVSTGSKEVV
RYLLDHAPPEILDAVEENGETCLHQAAALGQRTICHYIVEAGASLMKTDQQGDTPRQRAEKAQDTELAAY
LENRQHYQMIQREDQETAV
First, view the "k-NN" results by scrolling to the bottom of the page.
The k-nearest neighbor (k-NN) algorithm takes the
output of the many
subprograms and determines a probability of localization at each candidate
site within the cell using all of the predictions.
Problem 20
What is the probability the sequence encodes a protein that is
(a) secreted by vesicles?
(b) localized to the endoplasmic reticulum?
(c) cytoplasmic? or
(d) localized to the nucleus?
Now, scroll through the results of the subprograms. Clicking on the links will
reveal a brief description of the algorithm each individual subprogram utilizes.
Problem 21
What is the localization prediction and reliability
score produced by the NNCN subprogram, Reinhardt's methods for
cytoplasmic/nuclear discrimination?
The first two subprograms, PSG and GvH, are tools that predict N-terminal signal peptide
sequences. Just after their results are listed, there is a statement
summarizing whether or not an N-terminal signal peptide has been predicted
for the query sequence.
Problem 22
Do these subprograms predict an N-terminal
signal peptide for the diacylglycerol kinase query?
Problem 23
After looking over all the results,
what is the most likely localization of our query protein?
Read the title and abstract for this
article on the Rat diacylglycerol kinase used for the query sequence.
Problem 24
Was PSORT able to predict the correct localization,
using the sequence information alone?
Return to the
ExPASy tools,
and scroll to the section entitled "primary structure analysis".
Click on the link for the ProtParam tool. ProtParam is a suite of programs
designed to predict various chemical and physical properties about a protein
from its sequence. ProtParam will yield an estimated extinction coefficient
at selected wavelengths based on protein sequence
(6),
an estimation of the
in vivo half-life of the protein
(
7 8
9 10), an instability index
(11), an aliphatic index
(12), and an average value for hydropathicity
(13).
Cut and paste the Rat diacylglycerol
kinase sequence above into the query box and click on "compute parameters".
Problem 25
What is the molecular weight computed from the sequence?
Problem 26
What does the amino acid composition analysis show as the most common amino acid in this protein? (Is that unusual?)
Problem 27
What is the chemical formula for the query protein?
Problem 28
What is the predicted extinction coefficient at 280 nm, in 6M guanidium HCl, 0.02M phosphate, pH6.5 buffer, assuming all cysteines appear as half cysteines?
Problem 29
In what way could it be helpful to know the extinction coefficient?
Problem 30
According to the instability index, is this protein classified as stable or unstable?
Return again to the
ExPASy tools. Notice there are two sections dealing with structure
prediction, secondary structure prediction tools and tertiary structure
prediction and visualization tools. The secondary structure prediction
tools are designed to predict features such as the helical content, the beta
sheet formations, and the turns, loops, and coil regions within a protein,
given the sequence.
Problem 31
Explore the secondary
structure tools independently, and submit the diacylglycerol kinase sequence
above to any of the available secondary structure prediction tools.
Most of these tools will email the results, with at least
a 20 minute delay between submission and receipt of results. Forward a
results summary to the instructor, outlining the predictions created by
the program of choice.
Tertiary structure prediction tools match the query
sequence with sequences, or partial sequences, of proteins where the 3-D
structure has been published in the Protein Data Bank (PDB). These tools
will produce a model of the query protein by piecing together the structural
regions from the best matches in the PDB, and threading the query sequence
through the predicted structure. For more detailed explanations of available
3-D structure prediction software, view the
Swiss-Model demo page and the
Geno3D reference page. Although both of these tools are searching for
templates from existing PDB entries, they are doing this in different ways.
Problem 32
What program does Swiss-Model use to match the
query sequence with sequences of known structures?
Problem 33
What program does Geno3D use to match the query
sequence with sequences of known structures?
Notice that the template selection process and the model structure refinement
processes are different between these two programs as well.
Finally, in the tertiary structure section of the ExPASy tools page, Swiss PDB
Viewer is a graphical tool for the visualization, comparison and analysis of
3-D coordinate files. Swiss PDB Viewer can superimpose 3-D structures by
finding the rotation and translation that most closely aligns the two protein
structures. Additionally, the Swiss PDB Viewer will perform amino acid
mutations, prediction of hydrogen bonds, and calculation of angles and
distances between atoms. Best of all, Swiss PDB Viewer is freeware and
available for many different platforms, including Macintosh, PC, SGI IRIX,
and Linux.
Problem 34
View this supplemental
SPDBV web page.
What other function does Swiss PDB Viewer have, when used in conjunction with
other applications such as OpenGL or POV-Ray?
ExPASy provides a very large library of tools, for proteomics as well as other
bioinformatics applications.For those students interested in future research in the field of proteomics,
this web server will be an important resource.
References-
Wilkins et al. (1995). Progress with gene product mapping of the Mollicutes. Electrophoresis, 16:1090-1094.
-
Appel R.D., Bairoch A., Hochstrasser D.F. (1994). A new generation of information retrieval tools for biologists: the example of the ExPASy WWW server. Trends Biochem. Sci., 19:258-260.
-
Mulder N.J., Apweiler R., Attwood T.K., Bairoch A., Barrell D., Bateman A., Binns D., Biswas M., Bradley P., Bork P., Bucher P., Copley R.R., Courcelle E., Das U., Durbin R., Falquet L., Fleischmann W., Griffiths-Jones S., Haft D., Harte N., Hulo N., Kahn D., Kanapin A., Krestyaninova M., Lopez R., Letunic I., Lonsdale D., Silventoinen V., Orchard S.E., Pagni M., Peyruc D., Ponting C.P., Selengut J.D., Servant F., Sigrist C.J.A., Vaughan R, Zdobnov E.M. (2003). The InterPro Database, 2003 brings increased coverage and new features. Nucl. Acids. Res., 31:315-318.
-
Blom, N., Gammeltoft, S., and Brunak, S. (1999). Sequence- and structure-based prediction of eukaryotic protein phosphorylation sites. Journal of Molecular Biology, 294(5): 1351-1362.
-
Paul Horton and Kenta Nakai. (1997). Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbors Classifier. Intelligent Systems for Molecular Biology, 5:147-152.
-
Gill S.C., von Hippel P.H. (1989). Calculation of protein extinction coefficients from amino acid sequence data. Anal. Biochem., 182:319-326.
-
Bachmair A., Finley D., Varshavsky A. (1986). In vivo half-life of a protein is a function of its amino-terminal residue. Science, 234:179-186.
-
Gonda D.K., Bachmair A., Wunning I., Tobias J.W., Lane W.S., Varshavsky A. (1989). Universality and structure of the N-end rule. J. Biol. Chem., 264:16700-16712.
-
Tobias J.W., Shrader T.E., Rocap G., Varshavsky A. (1991). The N-end rule in bacteria. Science, 254:1374-1377.
-
Ciechanover A., Schwartz A.L. (1989). How are substrates recognized by the ubiquitin-mediated proteolytic system? Trends Biochem. Sci., 14:483-488.
-
Guruprasad K., Reddy B.V.B., Pandit M.W. (1990). Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Engineering, 4:155-161.
-
Ikai A. (1980). Thermostability and aliphatic index of globular proteins. J. Biochem., 88:1895-1898.
-
Kyte, J., Doolittle, R.F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol., 157:105-132.