# Connexions

You are here: Home » Content » Bios 533 Bioinformatics » Expasy Proteomics Tools

### Lenses

What is a lens?

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

#### Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
• Rice Digital Scholarship

This collection is included in aLens by: Digital Scholarship at Rice University

Click the "Rice Digital Scholarship" link to see all content affiliated with them.

### Recently Viewed

This feature requires Javascript to be enabled.

Inside Collection (Course):

Course by: Susan Cates. E-mail the author

# Expasy Proteomics Tools

Module by: Susan Cates. E-mail the author

Summary: This module describes the many proteomics tools available from the ExPASy website. Tools are introduced for protein identification and characterization from amino acid composition, fingerprint mass spectroscopy and other mass spectroscopy techniques. Also included in this module is an introduction to profile and pattern searches, tools for predictions of post-translational protein modifications, tools for protein topology prediction, primary structure analysis, secondary structure prediction and tertiary structure prediction and visualization.

A proteome is the collection of all the proteins within a given organism, in the same way a genome is the collection of all the genes within a given organism. A proteome has some characteristics that are quite different from a genome, however. A principal difference is the fact that while a particular organism will have the same set of identical DNA in any undamaged, healthy cell throughout its lifetime, the organism's proteins will differ greatly from one tissue to another, and from one life stage to another. Furthermore, proteins commonly incur a variety of chemical modifications after they are made. These modifications are critical for proper protein functioning and/or regulation, and moreover, these modifications cannot be determined with certainty by looking at the DNA sequence alone. In a contempary high-throughput proteomics laboratory, the number of proteins identified and analyzed in one day can be on the order of hundreds.

The term “proteome” was originally coined by an Australian scientist, Mark Wilkins (1), to describe the "PROTEin complement of the genOME". The term "proteomics" is used relatively loosely to describe any and all of the collection of high throughput techniques that have emerged to enable the scientist to analyze all the proteins expressed under a certain set of conditions within an individual cell or organism. The ExPASy (Expert Protein Analysis System) website (2), Swiss Institute of Bioinformatics, offers the definition that "proteomics can be defined as the qualitative and quantitative comparison of proteomes under different conditions to further unravel biological processes."

Common techniques for identifying the proteins within a proteome are 2D-PAGE (polyacrylamide gel electrophoresis) gels, amino acid (AA) composition analysis, peptide mass fingerprinting and other mass spectroscopy applications. A good starting point for becoming acquainted with 2D gels is the 2D PAGE tutorial offered by the Institute of Biological Sciences, University of Wales at Aberystwyth. ExPASy offers a good synopsis of peptide mass fingerprinting and AA composition analysis techniques, for those who are unfamiliar with these methods.

At the ExPASy Proteomics Tools server, the first category of tools are for protein identification and characterization. Take a look at the tools listed in this section. These tools are designed to identify the proteins that make up the proteome of study, using the data received from gels, AA analysis and mass spectroscopy experiments.

## Exercise 1

What tool from the ExPASy "protein identification and characterization" section would you use for identifying a protein for which you only know the amino acid composition?

## Exercise 2

What is the name of at least one peptide mass fingerprint tool at the ExPASy site?

## Exercise 3

Generally outline the underlying principles that allow the identification of a protein through peptide mass fingerprinting.

Scroll down on the ExPASy tools webpage to the section entitled "pattern and profile searches". The tools that populate this section are designed to identify proteins that belong to well characterized protein families, usually identified by conserved domains within family members. Also, well known protein motifs, or domains, are represented independently of their protein families in pattern databases that contain the conserved aspects of the domain sequence. Select the tool entitled "InterPro Scan" (3) to perform an integrated search in PROSITE, Pfam, PRINTS and other family and domain databases. This tool is useful for identifying specific domains or motifs within a protein, once the sequence has been determined, and can sometimes recognize the protein as a member of an established protein family. Test the efficacy of this tool with the following sequences, one at a time, but make sure the interactive run button is selected. An email address will be required to submit the job, but the results can be viewed in the browser interactively.


>Seq1
MAGIAAKLAKDREAAEGLGSHERAIKYLNQDYEALRNECLEAGTLFQDPSFPAIPSALGFKELGPYSSKT
FQFWQYGEWVEVVVDDRLPTKDGELLFVHSAEGSEFWSALLEKAYAKINGCYEALSGGATTEGFEDFTGG
LIRIRNPWGEVEWTGRWNDNCPSWNTIDPEERERLTRRHEDGEFWMSFSDFLRHYSRLEICNLTPDTLTS
DTYKKWKLTKMDGNWRRGSTAGGCRNYPNTFWMNPQYLIKLEEEDEDEEDGESGCTFLVGLIQKHRRRQR
KMGEDMHTIGFGIYEVPEELSGQTNIHLSKNFFLTNRARERSDTFINLREVLNRFKLPPGEYILVPSTFE
LAKRQDIKSDGFSIETCKIMVDMLDSDGSGKLGLKEFYILWTKIQKYQKIYREIDVDRSGTMNSYEMRKA

>Seq2
SGPRPVVLSGPSGAGKSTLLKRLLQEHSGIFGFSVSHTTRNPRPGEENGKDYYFVTREVM
QRDIAAGDFIEHAEFSGNLYGTSKVAVQAVQAMNRICVLDVDLQGVRNIKATDLRPIYIS
KEALSEEIKKAQRTGA

>Seq3
MTEVISNKITAKDGATSLKDIDDKRWVWISDPETAFTKAWIKEDLPDKKYVVRYNNSRDE
KIVGEDEIDPVNPAKFDRVNDMAELTYLNEPAVTYNLEQRYLSDQIYTYSGLFLVAVNPY
KRIIQYLAAIASSTTVGSSQVEEQIIKTNPVLESFGNARTVRNNNSSRFGKFIKVEFSLS
NPDEIDKLCHLLGVSPELFSQNLVRPRIKAGHEWVISARSQTQVISSIEALAKAIYERNF
GWLVKRLNTSLNHSNAQSYFIGILDIAGFEIFEKNSFEQLCINYTNEKLQQFFNHHMFVL
EQEEYMKEEIVWDFIDFGHDLQPTIDLIEKANPIGILSCLDEECVMPKATDATFTSKLDA
TLFSDYQETETKTVRGRTKKGLFRTVAQRHKEQLNQLMNQFNSTQPHFIRCIVPNEEKKM
HTFNRPLVLGQLRCNGVLEGIRITRAGFPNRLPFNDFRVRYEIMAHLPTGTYVESRRASV
MILEELKIDEASYRIGVSKIFFKAGVLAELEERRVATLQRLMTMLQTRIRGFLQRKIFQK
RLKDIQAIKLLQANLQVYNEFRTFPWAKLFFNLRPLLSSTQNDKQLKKRDAEIIELKYEL
KKQQNSKSEVERDLVETNNSLTAVENLLTTERAIALDKEEILRRTQERLANIEDSFSETK
QQNENLQRESASLKQINNELESELLEKTSKVETLLSEQNELKEKLSLEEKDLLDTKGELE
SLRENNATVLSEKAEFNEQCKSLQETIVTKDAELDKLTKYISDYKTEIQEMRLTNQKMNE
KSIQQEGSLSESLKRVKKLERENSTLISDVSILKQQKEELSVLKGVQELTINNLEEKVNY
QNLSDASLKYIELQEIHENLLLKVSDLENYKKKYEGLQLDLEGLKDVDTNFQELSKKHRD
MHEYSQLGKTFEDEKRKALIASRDNEELRSLKSELESKRKLEVEYQKVLEEVKTTRSLRS
NIPAVKTTEPVLKNIPQRKTIFDLQQRNANQALYENLKRDYDRLNLEKHNLEKQVNELKG
AEVSPQPTGQSLQHVNLAHAIELKALKDQINSEKAKMFSVQVQYEKREQELQKRIASLEK
VNKDSLIDVRALRDRIASLEDELRAA



View the results for Sequence 1. The first column of the results table identifies whether or not the match is of type "family" or of type "domain". The family and domain names appear at the top of each box in the second column of the results page, the same column that contains the diagrams which show the localization of the section of sequence that has been identified with the referenced family or domain.

## Exercise 4

How many matches were of the type "family"?

## Exercise 5

How many were domains?

## Exercise 6

What are the names of the families identified with this sequence?

## Exercise 7

List any domains that were identified within Sequence 1.

View the results for Sequence 2.

## Exercise 8

How many families were returned as matches?

## Exercise 9

How many domains?

## Exercise 10

What families were identified with this sequence?

## Exercise 11

List any domains that were identified within Sequence 2.

View the results for Sequence 3.

## Exercise 12

How many families were returned as matches?

## Exercise 13

How many domains?

## Exercise 14

What families were identified with this sequence?

## Exercise 15

List any domains that were identified within Sequence 3.

Return to the ExPASy Proteomics Tools server. Now, scroll down to the section entitled "post-translational modification prediction". Use NetPhos (4) to predict possible sites for serine, threonine and tyrosine phosphorylation on the three sequences above (all 3 sequences can be entered as one query). Accept the default values and select "submit". For help interpreting the results, view the NetPhos output format.

## Exercise 16

How many (a) serine, (b) threonine, and (c) tyrosine phosphorylation sites are predicted for Sequence 1?

## Exercise 17

How many (a) serine, (b) threonine, and (c) tyrosine phosphorylation sites are predicted for Sequence 2?

## Exercise 18

How many (a) serine, (b) threonine, and (c) tyrosine phosphorylation sites are predicted for Sequence 3?

## Exercise 19

Are there any serine, threonine and tyrosine in the sequence that were not listed as a potential phosphorylation site? If so, explain why some of the residues were not listed as predicted phosphorylation sites. (Those uncertain about the answer to this question should view the above link explaining the output.)

Once a protein sequence has been determined through proteomics techniques, bioinformatics can be used to predict certain types of topology. Topology is the sequence of secondary structure elements within a protein. The most basic secondary structure elements within proteins are the alpha helix, the beta sheet and the random coil. However, some algorithms will predict topological features that are closely related to in vivo localization, such as signal sequences and transmembrane helices.

At the ExPASy Proteomics Tools server, scroll down on the ExPASy tools webpage to the section entitled "topology prediction". This section contains tools that predict localization and sorting signals, as well as transmembrane regions within proteins. PSORT (5) is a computer program for the prediction of protein localization. It requires input of an amino acid sequence and its source organism; and it searches for known, organism-specific protein sorting signals. It returns a list of candidate localization sites, accompanied by a score indicating the probability the protein encoded by the input sequence would be localized to that site. To explore the use of PSORT, click on the PSORT link on the ExPASy tool page. Choose the "PSORT II" for eukaryotic sequences, and select the PSORT II Prediction. Cut and paste the following sequence for diacylglycerol kinase from Rattus norvegicus into the query box and click "Submit".


TPGAPCGESERQIRSTVDWSESAAYGEHIWFETNVSGDFCYVGEQYCVAKMLPKSAPRRKCAACKIVVHT
PCIGQLEKINFRCKPSFRESGSRNVREPTFVRHHWVHRRRQDGKCRHCGKGFQQKFTFHSKEIVAISCSW
WRPFIIRPTPSPLMKPLLVFVNPKSGGNQGAKIIQSFLWYLNPRQVFDLSQGGPREALEMYRKVHNLRIL
ACGGDGTVGWILSTLDQLRLKPPPPVAILPLGTGNDLARTLNWGGGYTDEPVSKILSHVEEGNVVQLDRW
DLRAEPNPEAGPEERDDGATDRLPLDVFNNYFSLGFDAHVTLEFHESREANPEKFNSRFRNKMFYAGTAF
SDFLMGSSKDLAKHIRVVCDGMDLTPKIQDLKPQCIVFLNIPRYCAGTMPWGHPGEHHDFEPQRHDDGYL
EVIGFTMTSLAALQVGGHGERLTQCREVLLTTAKAIPVQVDGEPCKLAASRIRIALRNQATMVQKAKRRS
TAPLHSDQQPVPEQLRIQVSRVSMHDYEALHYDKEQLKEASVPLGTVVVPGDSDLELCRAHIERLQQEPD
GAGAKSPMCHPLSSKWCFLDATTASRFYRIDRAQEHLNYVTEIAQDEIYILDPELLGASARPDLPTPTSP
LPASPCSPTPGSLQGDAALPQGEELIEAAKRNDFCKLQELHRAGGDLMHRDHQSRTLLHHAVSTGSKEVV
RYLLDHAPPEILDAVEENGETCLHQAAALGQRTICHYIVEAGASLMKTDQQGDTPRQRAEKAQDTELAAY
LENRQHYQMIQREDQETAV



First, view the "k-NN" results by scrolling to the bottom of the page. The k-nearest neighbor (k-NN) algorithm takes the output of the many subprograms and determines a probability of localization at each candidate site within the cell using all of the predictions.

## Exercise 20

What is the probability the sequence encodes a protein that is (a) secreted by vesicles? (b) localized to the endoplasmic reticulum? (c) cytoplasmic? or (d) localized to the nucleus?

Now, scroll through the results of the subprograms. Clicking on the links will reveal a brief description of the algorithm each individual subprogram utilizes.

## Exercise 21

What is the localization prediction and reliability score produced by the NNCN subprogram, Reinhardt's methods for cytoplasmic/nuclear discrimination?

The first two subprograms, PSG and GvH, are tools that predict N-terminal signal peptide sequences. Just after their results are listed, there is a statement summarizing whether or not an N-terminal signal peptide has been predicted for the query sequence.

## Exercise 22

Do these subprograms predict an N-terminal signal peptide for the diacylglycerol kinase query?

## Exercise 23

After looking over all the results, what is the most likely localization of our query protein?

Read the title and abstract for this article on the Rat diacylglycerol kinase used for the query sequence.

## Exercise 24

Was PSORT able to predict the correct localization, using the sequence information alone?

Return to the ExPASy tools, and scroll to the section entitled "primary structure analysis". Click on the link for the ProtParam tool. ProtParam is a suite of programs designed to predict various chemical and physical properties about a protein from its sequence. ProtParam will yield an estimated extinction coefficient at selected wavelengths based on protein sequence (6), an estimation of the in vivo half-life of the protein (7 8 9 10), an instability index (11), an aliphatic index (12), and an average value for hydropathicity (13). Cut and paste the Rat diacylglycerol kinase sequence above into the query box and click on "compute parameters".

## Exercise 25

What is the molecular weight computed from the sequence?

## Exercise 26

What does the amino acid composition analysis show as the most common amino acid in this protein? (Is that unusual?)

## Exercise 27

What is the chemical formula for the query protein?

## Exercise 28

What is the predicted extinction coefficient at 280 nm, in 6M guanidium HCl, 0.02M phosphate, pH6.5 buffer, assuming all cysteines appear as half cysteines?

## Exercise 29

In what way could it be helpful to know the extinction coefficient?

## Exercise 30

According to the instability index, is this protein classified as stable or unstable?

Return again to the ExPASy tools. Notice there are two sections dealing with structure prediction, secondary structure prediction tools and tertiary structure prediction and visualization tools. The secondary structure prediction tools are designed to predict features such as the helical content, the beta sheet formations, and the turns, loops, and coil regions within a protein, given the sequence.

## Exercise 31

Explore the secondary structure tools independently, and submit the diacylglycerol kinase sequence above to any of the available secondary structure prediction tools. Most of these tools will email the results, with at least a 20 minute delay between submission and receipt of results. Forward a results summary to the instructor, outlining the predictions created by the program of choice.

Tertiary structure prediction tools match the query sequence with sequences, or partial sequences, of proteins where the 3-D structure has been published in the Protein Data Bank (PDB). These tools will produce a model of the query protein by piecing together the structural regions from the best matches in the PDB, and threading the query sequence through the predicted structure. For more detailed explanations of available 3-D structure prediction software, view the Swiss-Model demo page and the Geno3D reference page. Although both of these tools are searching for templates from existing PDB entries, they are doing this in different ways.

## Exercise 32

What program does Swiss-Model use to match the query sequence with sequences of known structures?

## Exercise 33

What program does Geno3D use to match the query sequence with sequences of known structures?

Notice that the template selection process and the model structure refinement processes are different between these two programs as well.

Finally, in the tertiary structure section of the ExPASy tools page, Swiss PDB Viewer is a graphical tool for the visualization, comparison and analysis of 3-D coordinate files. Swiss PDB Viewer can superimpose 3-D structures by finding the rotation and translation that most closely aligns the two protein structures. Additionally, the Swiss PDB Viewer will perform amino acid mutations, prediction of hydrogen bonds, and calculation of angles and distances between atoms. Best of all, Swiss PDB Viewer is freeware and available for many different platforms, including Macintosh, PC, SGI IRIX, and Linux.

## Exercise 34

View this supplemental SPDBV web page. What other function does Swiss PDB Viewer have, when used in conjunction with other applications such as OpenGL or POV-Ray?

ExPASy provides a very large library of tools, for proteomics as well as other bioinformatics applications.For those students interested in future research in the field of proteomics, this web server will be an important resource.

## References

1. Wilkins et al. (1995). Progress with gene product mapping of the Mollicutes. Electrophoresis, 16:1090-1094.
2. Appel R.D., Bairoch A., Hochstrasser D.F. (1994). A new generation of information retrieval tools for biologists: the example of the ExPASy WWW server. Trends Biochem. Sci., 19:258-260.
3. Mulder N.J., Apweiler R., Attwood T.K., Bairoch A., Barrell D., Bateman A., Binns D., Biswas M., Bradley P., Bork P., Bucher P., Copley R.R., Courcelle E., Das U., Durbin R., Falquet L., Fleischmann W., Griffiths-Jones S., Haft D., Harte N., Hulo N., Kahn D., Kanapin A., Krestyaninova M., Lopez R., Letunic I., Lonsdale D., Silventoinen V., Orchard S.E., Pagni M., Peyruc D., Ponting C.P., Selengut J.D., Servant F., Sigrist C.J.A., Vaughan R, Zdobnov E.M. (2003). The InterPro Database, 2003 brings increased coverage and new features. Nucl. Acids. Res., 31:315-318.
4. Blom, N., Gammeltoft, S., and Brunak, S. (1999). Sequence- and structure-based prediction of eukaryotic protein phosphorylation sites. Journal of Molecular Biology, 294(5): 1351-1362.
5. Paul Horton and Kenta Nakai. (1997). Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbors Classifier. Intelligent Systems for Molecular Biology, 5:147-152.
6. Gill S.C., von Hippel P.H. (1989). Calculation of protein extinction coefficients from amino acid sequence data. Anal. Biochem., 182:319-326.
7. Bachmair A., Finley D., Varshavsky A. (1986). In vivo half-life of a protein is a function of its amino-terminal residue. Science, 234:179-186.
8. Gonda D.K., Bachmair A., Wunning I., Tobias J.W., Lane W.S., Varshavsky A. (1989). Universality and structure of the N-end rule. J. Biol. Chem., 264:16700-16712.
9. Tobias J.W., Shrader T.E., Rocap G., Varshavsky A. (1991). The N-end rule in bacteria. Science, 254:1374-1377.
10. Ciechanover A., Schwartz A.L. (1989). How are substrates recognized by the ubiquitin-mediated proteolytic system? Trends Biochem. Sci., 14:483-488.
11. Guruprasad K., Reddy B.V.B., Pandit M.W. (1990). Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Engineering, 4:155-161.
12. Ikai A. (1980). Thermostability and aliphatic index of globular proteins. J. Biochem., 88:1895-1898.
13. Kyte, J., Doolittle, R.F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol., 157:105-132.

## Content actions

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

#### Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

#### Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks