# Connexions

You are here: Home » Content » Bios 533 Bioinformatics » Protein Folding and Secondary Structure Prediction

### Lenses

What is a lens?

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

#### Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
• Rice Digital Scholarship

This collection is included in aLens by: Digital Scholarship at Rice University

Click the "Rice Digital Scholarship" link to see all content affiliated with them.

### Recently Viewed

This feature requires Javascript to be enabled.

Inside Collection (Course):

Course by: Susan Cates. E-mail the author

# Protein Folding and Secondary Structure Prediction

Module by: Susan Cates. E-mail the author

Summary: This is a brief introduction to the protein folding problem and methods for protein secondary structure prediction. The "Folding@home" project, Stanford University, is used as an example of one approach to studying the protein folding problem. Various secondary structure prediction tools are introduced.

Proteins are the biological molecules that are the building blocks of cells and organs, and the biochemical processes required to keep living organisms alive are catalyzed and regulated by a particular category of proteins called enzymes. Proteins are linear polymers of amino acids that fold into complex conformations dictated by the physical and chemical properties of the amino acid chain. The biological function of a protein is dependent on the protein folding into the correct, or "native", state. Protein structure is described by biologists in terms of primary structure, which is the amino acid sequence, secondary structure, wherein the polypeptide backbone assembles into local regions of alpha-helices, beta-sheets, coils and turns, tertiary structure, which refers to the entire 3-dimensional structure of the protein, and quaternary structure, which describes interactions between separate polypeptide chains, called subunits, that exist in some large protein complexes. Computational methods have been developed that can predict protein secondary structure with a reasonable degree of accuracy. Prediction methods exist for predicting tertiary structure, but the accuracy of such methods is highly dependent on whether or not the protein in question is related in sequence to any members of the existing library of known protein structures. The development of ab initio tools to predict the complete structural fold of a protein from its amino acid sequence is a burgeoning field in computational biology, but true attainment of this goal is still pretty distant.

Protein folding is usually a spontaneous process, and often when a protein unfolds because of heat or chemical denaturation, it will be capable of refolding into the correct conformation, as soon as it is removed from the environment of the denaturant, meaning folding and unfolding under these circumstances are reversible. Protein folding can go wrong for many reasons. When an egg is boiled, the proteins in the white unfold and misfold into a solid mass of protein that will not refold or redissolve. In a similar way, irreversibly misfolded proteins form insoluble protein aggregates found in certain tissues that are characteristic of some diseases, such as Alzheimer's Disease.

Determining the process by which proteins fold into particular shapes, characteristic of their amino acid sequence, is commonly called "the protein folding problem". One approach to studying the protein folding process is the application of statistical mechanics techniques and simulations to the study of protein folding. (1) These methods allow the investigation of larger systems than methods that try to represent atomic detail in their simulations of biological molecules, and have had success correlating the computational folding model with folding intermediates and transition states that have been experimentally measured for a limited test set of relatively large proteins.

An approach that uses an atomistic model for protein folding in a solvent environment is being taken by The Stanford University (2) Folding@home project, using large scale distributed computing that allows timescales thousands to millions of times longer than previously achievable with a model of this detail. Look at the menu on the left border of the Stanford Folding@home web page. Click on the "Science" link to read the scientific background behind the protein folding distributed computing project.

## Exercise 1

What are the 3 functions of proteins that are mentioned in the "What are proteins?" section of the scientific background?

## Exercise 2

What are 3 diseases that are believed to result from protein misfolding?

## Exercise 3

What are typical timescales for molecular dynamics simulations?

## Exercise 4

What are typical timescales at which the fastest proteins fold?

## Exercise 5

How does the Stanford group break the microsecond barrier with their simulations?

Return to the Stanford Folding@home home page. Click on the "Results" link in the left border of the web page. Look at the information on the folding simulations of the villin headpiece.

## Exercise 6

How many amino acids are in the simulated villin headpiece?

## Exercise 7

How does this compare with the number of amino acids in a typical protein?

## Exercise 8

Taking into consideration the size of the biological molecules in these simulations and the requirements that necessitated using large scale distributed computing methods for the simulations, what are the biggest impediments to understanding the protein folding problem?

Although attempts at predicting tertiary and quaternary structure from the amino acid sequence of proteins are relatively new, methods for predicting protein secondary structure have been in existence for some time. Depending on the method, secondary structure predictions can be performed with approximately 60 - 70% accuracy. Originally, empirical prediction methods were based on tables which listed each amino acid and the frequency with which that amino acid was found in alpha-helices, beta-sheets, turns and random coil. Currently, prediction methods usually employ machine learning in the form of neural networks that are trained with test sets consisting of sequences with known structure. In these cases, the selection of the test set is critically related to the accuracy of the method. However, given the ever increasing number of known structural folds, selecting a representative test set that includes many proteins of diverse structure has become easier.

Use the amino sequence below to explore some structure prediction tools. This is the sequence for lac repressor, a protein involved in gene regulation that is known to have both alpha-helical and beta-sheet structure:


>gi|33112645|sp|P03023|LACI_ECOLI Lactose operon repressor
MKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPNRVAQQLAGKQSLLIGVATSS
TNVPALFLDVSDQTPINSIIFSHEDGTRLGVEHLVALGHQQIALLAGPLSSVSARLRLAGWHKYLTRNQI
RQVSRLESGQ


A quick and simple analysis of protein secondary structure can be performed by the nnpredict tool at UCSF. (3) Notice when pasting the above sequence into the query page that there is a separate line for the name of sequence, meaning that the first line in the above fasta format sequence should be entered here, separately from the rest of the sequence. Compare the nnpredict results to the actual secondary structure of lac repressor, known from the crystal structure with PDBID 1LBI. (4)


Sequence and secondary structure - lac Repressor
Data is from PDB accession number 1LBI, RCSB Protein Data Bank.
The legend for the assignments are:
H=helix; B=residue in isolated beta bridge; E=extended beta strand;
G=310 helix; I=pi helix; T=hydrogen bonded turn; S=bend.

1 MKPVTLYDVA EYAGVSYQTV SRVVNQASHV SAKTREKVEA AMAELNYIPN

51 RVAQQLAGKQ SLLIGVATSS LALHAPSQIV AAIKSRADQL GASVVVSMVE
EEEEEEES S  HHHHHHH HHHHHHHHHH T EEEEEEE

101 RSGVEACKTA VHNLLAQRVS GLIINYPLDD QDAIAVEAAC TNVPALFLDV
SSHHHHHHHH HHHHHHS  S EEEEES   S TTHHHHHHTS  SS EEESSS

151 SDQTPINSII FSHEDGTRLG VEHLVALGHQ QIALLAGPLS SVSARLRLAG
TTSSS EEE E TTHHHHHH HHHHHHHT    EEEEE  SS SSHHHHTHHH

201 WHKYLTRNQI QPIAEREGDW SAMSGFQQTM QMLNEGIVPT AMLVANDQMA
HHHHHTTTT    SEEEE  S SHHHHHHHHH HHHTTT   S EEEESSHHHH

251 LGAMRAITES GLRVGADISV VGYDDTEDSS CYIPPLTTIK QDFRLLGQTS
HHHHHHHHTT TTTBTTTEEE E SB  TTGG GSSS   EEE   HHHHHHHH

301 VDRLLQLSQG QAVKGNQLLP VSLVKRKTTL APNTQTASPR ALADSLMQLA
HHHHHHHHT   S  S EEE   EEE  TT   S TTS   HH HHHHHHHHHH

351 RQVSRLESGQ
HHHHHH



## Exercise 9

Look for regions identified as alpha-helical by nnpredict, but not identified as alpha-helical in the actual secondary structure features listed above, and vice versa. Are there any regions of 3 consecutive amino acids or more that differ? If so, how many alpha-helical regions differ and what residue numbers are involved?

## Exercise 10

Look for regions identified as beta-sheet by nnpredict, but not identified as beta-sheet in the actual secondary structure features listed above, and vice versa. Are there any regions of 3 consecutive amino acids or more that differ? If so, how many beta-sheet regions differ and what residue numbers are involved?

## Exercise 11

The PDB entry for the crystal structure of lac Repressor remarks that the N-terminal residues number 1 - 61 and the C-terminal residues number 358 - 360 are not seen in the electron density. How would this effect the assignment of actual secondary structure shown above?

A more complete sequence analysis tool that includes secondary structure prediction can be found at the PredictProtein server at Columbia University. (5) On the home page, select the tab for submission to enter a protein sequence for secondary structure prediction. The instructions indicate that you should only enter the amino acid sequence, so omit the first line when you paste in the fasta format sequence above for lac Repressor. Click on the "Results on the site, not in email" option. The server will still send an email with a link to the results. Run the prediction tool for lac Repressor sequence. When the email arrives, click on the link for your results. The secondary structure prediction algorithms are within the section entitled PROF. This is the section needed to answer the following questions.

## Exercise 12

Give a brief summary comparing the ProteinPredict results with the actual secondary structure from the PDB, listed above, and with the results from nnpredict.

With the publication of entire genomes that contain sequences to many unknown proteins, scientists would love to have the ability to predict the final folded structure of a protein based on its sequence. Although this is not yet a practical reality, tools exist that can predict secondary structure with some accuracy and inroads are being made toward solving the protein folding problem. Elucidating the mechanisms behind protein folding would provide important knowledge for fighting disease states where misfolded proteins are implicated.

## References

1. C. Clementi, H. Nymeyer and J.N. Onuchic. (2000). Topological and energetic factors: what determines the structural details of the transition state ensemble and 'on-route' intermediates for protein folding? An investigation for small globular proteins. Journal of Molecular Biology, 298: 937-953.
2. Zagrovic B., Sorin E. and Pande V. (2001). Beta-Hairpin Folding Simulations in Atomistic Detail Using an Implicit Solvent Model. JMB, 313:151-169.
3. D. G. Kneller, F. E. Cohen and R. Langridge. (1990). Improvements in Protein Secondary Structure Prediction by an Enhanced Neural Network. JMB, 214:171-182.
4. M.LEWIS,G.CHANG,N.C.HORTON,M.A.KERCHER,H.C.PACE,M.A.SCHUMACHER,R.G.BRENNAN,P.LU. (1996). CRYSTAL STRUCTURE OF THE LACTOSE OPERON REPRESSOR AND ITS COMPLEXES WITH DNA AND INDUCER. SCIENCE, 271:1247.
5. B Rost. (1996). PHD: predicting one-dimensional protein structure by profile based neural networks. Methods in Enzymology, 266:525-539.

## Content actions

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

#### Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

#### Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

### Reuse / Edit:

Reuse or edit collection (?)

#### Check out and edit

If you have permission to edit this content, using the "Reuse / Edit" action will allow you to check the content out into your Personal Workspace or a shared Workgroup and then make your edits.

#### Derive a copy

If you don't have permission to edit the content, you can still use "Reuse / Edit" to adapt the content by creating a derived copy of it and then editing and publishing the copy.

| Reuse or edit module (?)

#### Check out and edit

If you have permission to edit this content, using the "Reuse / Edit" action will allow you to check the content out into your Personal Workspace or a shared Workgroup and then make your edits.

#### Derive a copy

If you don't have permission to edit the content, you can still use "Reuse / Edit" to adapt the content by creating a derived copy of it and then editing and publishing the copy.