Skip to content Skip to navigation

Connexions

You are here: Home » Content » Structural Computational Biology: Introduction and Background

Navigation

Recently Viewed

This feature requires Javascript to be enabled.

Structural Computational Biology: Introduction and Background

Module by: Lydia E. Kavraki. E-mail the author

User rating (How does the rating system work?)
Ratings

Ratings allow you to judge the quality of modules. If other users have ranked the module then its average rating is displayed below. Ratings are calculated on a scale from one star (Poor) to five stars (Excellent).

How to rate a module

Hover over the star that corresponds to the rating you wish to assign. Click on the star to add your rating. Your rating should be based on the quality of the content. You must have an account and be logged in to rate content.

:
(0 ratings)

Summary: This module contains motivational and biochemical background material for a computer scientist beginning to learn about computational structural biology.

Proteins and Their Significance to Biology and Medicine

Proteins are the molecular workhorses of all known biological systems. Among other functions, they are the motors that cause muscle contraction, the catalysts that drive life-sustaining chemical processes, and the molecules that hold cells together to form tissues and organs.

The following is a list of a few of the diverse biological processes mediated by proteins:

  • Proteins called enzymes catalyse vital reactions, such as those involved in metabolism, cellular reproduction, and gene expression.
  • Regulatory proteins control the location and timing of gene expression.
  • Cytokines, hormones, and other signalling proteins transmit information between cells.
  • Immune system proteins recognize and tag foreign material for attack and removal.
  • Structural proteins prevent cells from collapsing on themselves, as well as forming large structures such as hair, nails, and the protective, largely impermeable outer layer of skin. They also provide a framework along which molecules can be transported within cells.

The estimate of the number of genes in the human genome has been changing dramatically since it was annotated (the latest gene count estimates can be found in this Wikipedia article on the human genome). Each gene encodes one or more distinct proteins. The total number of distinct proteins in the human body is larger than the number of genes due to alternate splicing. Of those, only a small fraction have been isolated and studied to the point that their purpose and mechanism of activity is well understood. If the functions and relationships between every protein were fully understood, we would most likely have a much better understanding of how our bodies work and what goes wrong in diseases such as cancer, amyotrophic lateral sclerosis, Parkinson's, heart disease and many others. As a result, protein science is a very active field. As the field has progressed, computer-aided modeling and simulation of proteins have found their place among the methods available to researchers.

Protein Structure

An amino acid is a simple organic molecule consisting of a basic (hydrogen-accepting), amine group bound to an acidic (hydrogen-donating) carboxyl group via a single intermediate carbon atom:

Figure 1: A generic α-amino acid. The "R" group is variable, and is the only difference between the 20 common amino acids. This form is called a zwitterion, because it has both positive and negatively charged atoms. The zwitterionic state results from the amine group (NH2) gaining a hydrogen atom from solution, and the acidic group (COO) losing one.
An α-amino acid
 An α-amino acid  (aminoacid.jpg)
During the translation of a gene into a protein, the protein is formed by the sequential joining of amino acids end-to-end to form a long chain-like molecule, or polymer. A polymer of amino acids is often referred to as a polypeptide. The genome is capable of coding for 20 different amino acids whose chemical properties depend on the composition of their side chains ("R" in the above figure). Thus, to a first approximation, a protein is nothing more than a sequence of these amino acids (or, more properly, amino acid residues, because both the amine and acid groups lose their acid/base properties when they are part of a polypeptide). This sequence is called the primary structure of the protein.
Figure 2: A generic polypeptide chain. The bonds shown in yellow, which connect separate amino acid residues, are called peptide bonds.
A polypeptide
A polypeptide (peptide_chain.jpg)
The Wikipedia entry on amino acids provides a more detailed background, including the structure, properties, abbreviations, and genetic codes for each of the 20 common amino acids.

The primary structure of a protein is easily obtainable from its corresponding gene sequence, as well as by experimental manipulation. Unfortunately, the primary structure is only indirectly related to the protein's function. In order to work properly, a protein must fold to form a specific three-dimensional shape, called its native structure or native conformation. The three-dimensional structure of a protein is usually understood in a hierarchical manner. Secondary structure refers to folding in a small part of the protein that forms a characteristic shape. The most common secondary structure elements are α-helices and β-sheets, one or both of which are present in almost all natural proteins.

Figure 3: α-helices, rendered three different ways. Left is a typical cartoon rendering, in which the helix is depicted as a cylinder. Center shows a trace of the backbone of the protein. Right shows a space-filling model of the helix, and is the only rendering that shows all atoms (including those on side chains).
Secondary Structure: α-helix
 Secondary Structure: α-helix  (alpha_helices.JPG)
Figure 4: Beta-sheets represented in three different rendering modes: cartoon, ribbon, and bond representations.
Secondary Structure: β-sheet
Cartoon representation Ribbon representation Bond representation
(a) Different parts of the polypeptide strand align with each other to form a β-sheet. This β-sheet is anti-parallel, because adjacent segments of the protein run in opposite directions.(b) β-sheets are sometimes referred to as β pleated sheets, because of the regular zig-zag of the strands evident in this representation. (c) Each segment in this representation represents a bond. Unlike the other two representations, side chains are illustrated. Note the alignment of oxygen atoms (red) toward nitrogen atoms (blue) on adjacent strands. This alignment is due to hydrogen bonding, the primary interaction involved in stabilizing secondary structure.
 Cartoon representation  (beta_sheet_cartoon.JPG) Ribbon representation  (beta_sheet_ribbon.JPG) Bond representation  (beta_sheet_bond.JPG)
Tertiary structure refers to structural elements formed by bringing more distant parts of a chain together into structural domains. The spatial arrangement of these domains with respect to each other is also considered part of the tertiary structure. Finally, many proteins consist of more than one polypeptide folded together, and the spatial relationship between these separate polypeptide chains is called the quaternary structure. It is important to note that the native conformation of a protein is a direct consequence of its primary sequence and its chemical environment, which for most proteins is either aqueous solution with a biological pH (roughly neutral) or the oily interior of a cell membrane. Nevertheless, no reliable computational method exists to predict the native structure from the amino acid sequence, and this is a topic of ongoing research. Thus, in order to find the native structure of a protein, experimental techniques are deployed. The most common approaches are outlined in the next section.

Experimental Methods for Protein Structure Determination

A structure of a protein is a three-dimensional arrangement of the atoms such that the integrity of the molecule (its connectivity) is maintained. The goal of a protein structure determination experiment is to find a set of three-dimensional (x, y, z) coordinates for each atom of the molecule in some natural state. Of particular interest is the native structure, that is, the structure assumed by the protein under its biological conditions, as well as structures assumed by the protein when in the process of interacting with other molecules. Brief sketches of the major structure determination methods follow:

X-ray Crystallography

The most commonly used and usually highest-resolution method of structure determination is x-ray crystallography. To obtain structures by this method, laboratory biochemists obtain a very pure, crystalline sample of a protein. X-rays are then passed through the sample, in which they are diffracted by the electrons of each atom of the protein. The diffraction pattern is recorded, and can be used to reconstruct the three-dimensional pattern of electron density, and therefore, within some error, the location of each atom. A high-resolution crystal structure has a resolution on the order of 1 to 2 Angstroms (Å). One Angstrom is the diameter of a hydrogen atom (10^-10 meter, or one hundred-millionth of a centimeter).

Unlike other structure determination methods, with x-ray crystallography, there is no fundamental limit on the size of the molecule or complex to be studied. However, in order for the method to work, a pure, crystalline sample of the protein must be obtained. For many proteins, including many membrane-bound receptors, this is not possible. In addition, a single x-ray diffraction experiment provides only static information - that is, it provides only information about the native structure of the protein under the particular experimental conditions used. As we will see later, proteins are often flexible, dynamic objects when in their natural state in solution, so a single structure, while useful, may not tell the full story. More information on X-ray Crystallography is available at Crystallography 101 and in the Wikipedia.

NMR

Nuclear Magnetic Resonance (NMR) spectroscopy has recently come into its own as a protein structure determination method. In an NMR experiment, a very strong magnetic field is transiently applied to a sample of the protein being studied, forcing any magnetic atomic nuclei into alignment. The signal given off by a nucleus as it returns to an unaligned state is characteristic of its chemical environment. Information about the atoms within two chemical bonds of the resonating nucleus can be deduced, and, more importantly, information about which atoms are spatially near each other can also be found. The latter information leads to a large system of distance constraints between the atoms of the protein, which can then be solved to find a three-dimensional structure. Resolution of NMR structures is variable and depends strongly on the flexibility of the protein. Because NMR is performed on proteins in solution, they are free to undergo spatial rearrangements, so for flexible parts of the protein, there may be many more than one detectable structures. In fact, NMR structures are generally reported as ensembles of 20-50 distinct structures. This makes NMR the only structure determination technique suited to elucidating the behavior of intrinsically unstructured proteins, that is, proteins that lack a well-defined tertiary structure. The reported ensemble may also provide insight into the dynamics of the protein, that is, the ways in which it tends to move.

NMR structure determination is generally limited to proteins smaller than 25-30 kilodaltons (kDa), because the signals from different atoms start to overlap and become difficult to resolve in that range. Additionally, the proteins must be soluble in concentrations of 0.2-0.5 mM without aggregation or precipitation. For more information on how NMR is used to find molecular structures, please see NMR Basics and The World of NMR: Magnets, Radio Waves, and Detective Work at the National Institutes of Health's The Structures of Life website.

Electron Diffraction

Electron diffraction works under the same principle as x-ray crystallography, but instead of x-rays, electrons are used to probe the structure. Because of difficulties in obtaining and interpreting electron diffraction data, it is rarely used for protein structure determination. Nevertheless, ED structures do exist in the PDB. For more on ED, see this Wikipedia article.

Structure Prediction of Large Complexes

Large macromolecular complexes and molecular machines present a particular challenge in structure determination. Generally too large to be crystallized, and too complex to solve by NMR, determining the structure of these objects usually requires the combination of high-resolution microscopy combined with computational refinement and analysis. The main techniques used are cryo-electron microscopy (Cryo-EM) and standard light microscopy.

Protein Structure Repositories

Most of the protein structures discovered to date can be found in a large protein repository called the RCSB Protein DataBank (PDB). The Protein Data Bank (PDB) is a public domain repository that contains experimentally determined structures of three-dimensional proteins. The majority of the proteins in the PDB have been determined by x-ray crystallography, but the number of proteins determined using NMR methods has been increasing as efficient computational techniques to derive structures from NMR data have been developed. A few electron diffraction structures are also available. The PDB was originally established at Brookhaven National Laboratory in October, 1971, with 7 structures. Currently, the database is maintained by Rutgers University, the State University of New Jersey, the San Diego Supercomputer Center at the University of California, San Diego, and the National Institute of Standards and Technology. The current number of proteins (and/or nucleic acids) in the PDB database is displayed at the top-right corner of the main PDB page. The imaging method statistics of these structures (i.e., which methods were used for what fraction of the structures), as well as other classifications, can be found here. The European Bioinformatics Institute Macromolecular Structure Database group (UK) and the Institute for Protein Research at Osaka University (Japan) are international contributors to the contents of the PDB.

Visualizing Protein Structures

A Few Molecular Visualization Programs

  • Visual Molecular Dynamics (VMD) was originally developed for viewing molecular simulation trajectories. It is a very powerful, full-featured, and customizable molecular viewing package. Customization is available using Tcl/Tk scripting. Information on Tcl/Tk scripting can be found at this Tcl/Tk website.
  • PyMol is an open-source molecular viewer that can be used to generate professional-looking images. PyMol is highly customizable through the Python scripting language.
  • Protein Explorer is an easy-to-use, web browser-based visualization tool. Protein explorer is built using the MDL Chime browser plugin, which in turn is based on the RasMol viewer. Because Chime only works under Windows and Macintosh OS, the use of Protein Explorer is restricted to those platforms.
  • JMol is a Java-based molecular viewer. In applet form, it can be downloaded on-the-fly to view structures from the web. A stand-alone version also exists, which can be used independently of a web browser.
  • Chimera is a powerful visualizer and analysis tool that can be comfortably used with very large molecular complexes. It can also produce very high-quality images for use in presentations and publications.

Visualizing HLA-AW with VMD

What follows will be a very brief introduction to what can be done with VMD. Only the most basic viewing functionality will be discussed. For a complete description of the capabilities of VMD and how to use them, please refer to the VMD web site.

In this section, a human leukocyte-associated antigen, HLA-AW (PDB structure ID 2HLA), will be shown under various rendering methods in VMD. This section is intended to convey, first, a general idea of the types of visual representations that are available for protein structures, and second, what information is and is not conveyed by each representation.

VMD allows the user to load and view molecule description files in a wide variety of common formats, including trajectory files with multiple structures of the same molecule, such as might be generated by a simulation. Once the molecules are loaded, the way each molecule is rendered may be controlled using the Graphical Representations menu:

Figure 5: The built-in rendering options of VMD.
VMD Graphical Representations menu VMD atom coloring methods VMD molecule drawing methods
(a) This menu allows the user to control in detail how each molecule is rendered. (b) Coloring schemes to highlight features of interest. (c) Rendering methods in VMD. Which one to use depends on the features to highlight.
VMD Graphical Representations menu  (vmd_graphical_reps_interface.JPG)VMD atom coloring methods  (vmd_color_methods.JPG)VMD molecule drawing methods  (vmd_representations.JPG)

Molecules may be displayed by various rendering modes:

Figure 6: In this representation, each line represents a bond between two atoms. The color of each half-bond corresponds to the element of the atom at the corresponding end of the bond (red for oxygen, blue for nitrogen, yellow for sulfur, and teal for carbon). Line representation gives a clear idea of the molecule's connectivity, but for large molecules it can be difficult to isolate protein sub-structures.
HLA-AW. Drawing method: LINES. Coloring method: NAME
 HLA-AW.  Drawing method: LINES.  Coloring method: NAME  (VMD_MHC_1_lines-type.JPG)
Figure 7: Here each atom is represented by a sphere whose radius is the Van der Waals radius of the atom. The Van der Waals radius is half the separation of unbonded atoms packed as tightly as possible, and provides a rough notion of a collision radius, although it is not a firm barrier. This representation of the molecule gives a rough sense of its shape, and is sometimes called a space-filling model.
HLA-AW. Drawing method: VDW. Coloring method: NAME
 HLA-AW.  Drawing method: VDW.  Coloring method: NAME  (VMD_MHC_1_VDW-type.JPG)
Figure 8: This rendering is the same as in the previous figure, except that now the atoms are colored based on which polypeptide chain they belong to. HLA-AW consists of two chains, the alpha chain (blue), which folds into three domains and the smaller β2 microglobulin (red), which is a component of a whole class of HLA proteins. Coloring by chain allows an inspection of how the polypeptide subunits come together to form the whole quaternary structure of the protein. The black balls are water molecules near the surface of the protein that always appear in the same place in crystal structures, and may therefore be considered part of the structure for some applications.
HLA-AW. Drawing method: VDW. Coloring method: CHAIN
 HLA-AW.  Drawing method: VDW.  Coloring method: CHAIN  (VMD_MHC_1_VDW-chain.JPG)
Figure 9: The Surf drawing mode renders a surface swept out by a sphere of some set size skimming the protein. Usually, this size is approximately that of a water molecule, in which case the rendered surface is very similar to the solvent-accessible surface. Note that it is impossible to deduce the connectivity of the atoms from this image or from the space filling image in the previous figure. Overall shape, rather than connectivity, is the information conveyed by these representations. Hence, both backbone-based and surface-based renderings are necessary to fully understand a protein's structure.
HLA-AW. Drawing method: SURF. Coloring method: CHAIN
 HLA-AW.  Drawing method: SURF.  Coloring method: CHAIN  (VMD_MHC_1_surf-chain.JPG)
Figure 10: Here the protein has been rotated approximately 90 degrees toward the viewer, so that, compared to the previous image, we are looking down from above. The deep groove running from the top left to lower right is the binding pocket of the protein.
HLA-AW. Drawing method: SURF. Coloring method: CHAIN
 HLA-AW.  Drawing method: SURF.  Coloring method: CHAIN  (VMD_MHC_1_surf-chain_tilted.JPG)
Figure 11: Cartoon rendering places an emphasis on secondary structure. Beta sheets appear as flattened arrows, and alpha helices appear as cylinders. These are common conventions in representing protein secondary structure. By examining this image, we can see that the walls of the binding pocket observed in the previous figure consist of alpha helices, and the floor is an anti-parallel beta sheet. In anti-parallel beta sheets, adjacent strands run in the opposite direction (notice the arrow points alternate in direction). Note that this representation only conveys information about the backbone connectivity of the protein. Side chain atoms are omitted, and therefore the overall shape is only a very coarse approximation.
HLA-AW. Drawing method: CARTOON. Coloring method: CHAIN
 HLA-AW.  Drawing method: CARTOON.  Coloring method: CHAIN  (VMD_MHC_1_cartoon-chain_tilted.JPG)
Figure 12: Alternative coloring methods can provide additional insight into a protein's structure and function. Here each atom is colored based on whether the side chain of the amino acid residue to which it belongs is acidic (red), basic (blue), polar neutral (green), or apolar (gray). Note that residues on the surface of the protein tend to be hydrophilic (attracted to water, in red, blue, and green), whereas residues closer to the core of the protein tend to be hydrophobic (greasy or water repellant, in gray). This is characteristic of proteins that exist in aqueous solution in nature. Their native structure is stabilized by a tendency for the hydrophilic residues to interact with the solvent water molecules, while the hydrophobic residues are driven together away from the solvent. Clusters of hydrophobic residues on the surface often indicate a location that is usually protected from solvent in the natural state, either by interaction with another molecule or by part of the protein itself.
HLA-AW. Drawing method: SURF. Coloring method: RESTYPE
 HLA-AW.  Drawing method: SURF.  Coloring method: RESTYPE (VMD_MHC_1_surf-restype.JPG)

Visualizing HLA-AW with Protein Explorer

Protein Explorer is designed as a user-friendly but fairly full-featured visualizer. It is not as scriptable or as powerful as some other visualizers such as VMD and PyMol, but it is one of the quickest and easiest to get started with. It is used through a web browser, either by accessing it through the Protein Explorer website (via the Quick-Start Protein Explorer link), or as an offline version, downloadable from this page. Both versions require the MDL Chime molecular viewing plugin, which you can download from here (registration required).

As with VMD above, a human leukocyte-associated antigen, HLA-AW (PDB structure ID 2HLA), will be shown in various renditions.

Upon opening, Protein Explorer will load a default molecule and display it (this feature may be disabled via a setting under "preferences" in the lower left frame):

Figure 13: The interface contains three areas. The frame on the right contains the rendering window, where the molecule is displayed. The lower left frame contains an input box for text commands and a text box that displays general text output from the program: What commands have been executed, what the program is currently doing, etc. The top left frame generally contains the user interface in the form of buttons and links. Its exact contents vary with use.
Protein Explorer at Startup
 Protein Explorer at Startup (protein_explorer_interface.JPG)

Clicking on the "PE Site Map" link pops up a window containing Protein Explorer's top-level menu:

Figure 14: Each option contains a helpful tooltip which can be seen by hovering the mouse cursor over it. "New Molecule" allows the user to load a molecule either directly from the PDB or from the local filesystem. "Reset Session" returns to the default view and rendering style, which can be a useful shortcut. "Quick Views" opens up a menu from which the user can select how the molecule is rendered.
Protein Explorer Site Map Window
 Protein Explorer Site Map Window (protein_explorer_site_map.JPG)

Once a molecule is loaded, the "Quick Views" menu allows the user to control how it is displayed:

Figure 15: The "SELECT" pulldown menu allows the user to pick a group of atoms based on their properties, their location, the structural elements in which they are involved, or by directly clicking them. The "DISPLAY" pulldown menu then allows the user to determine the style in which the selected atoms are rendered. Most of the styles available through VMD are also available in Protein Explorer. The "COLOR" pulldown menu allows the user to determine how the atoms are colored. Options include coloring by secondary structure elements, atom type, subunit (chain), a spectrum from end to end of the protein, and by properties such as charge and polarity.
Protein Explorer QuickViews Interface
 Protein Explorer QuickViews Interface (protein_explorer_quick_views.JPG)

Figure 16: This rendering mode shows the protein backbone (no side chains) through the alpha carbons of each amino acid residue. It gives the user a sense of how the chains fold to form the structure, but not it's full shape, since all side chain atoms have been removed. The yellow bars are disulfide bonds, which are covalent bonds that lock distant parts of the chain together to help maintain the structure.
Protein Explorer: HLA-AW Backbone Rendering
 Protein Explorer: HLA-AW Backbone Rendering (protein_explorer_2HLA.JPG)

Figure 17: Cartoon rendering works as for VMD. As in the backbone rendering above, side chains are ignored, and the protein backbone is rendered as a smoothly curving tube. Beta sheets appear as flattened arrows, and alpha helices appear as spiraling ribbons.
Protein Explorer: HLA-AW Cartoon Style
 Protein Explorer: HLA-AW Cartoon Style (protein_explorer_2HLA_cartoon.JPG)

Figure 18: More advanced rendering methods are available through the Advanced Explorer Menu.
Protein Explorer Advanced Explorer Menu
 Protein Explorer Advanced Explorer Menu (protein_explorer_advanced.JPG)

Figure 19: The Surfaces menu allows the user to display the surface of the protein. Several variable are available, including the radius of the probe used to define the surface, as well as several methods of coloring the surface based on chemical and physical properties.
Protein Explorer Surfaces Menu
 Protein Explorer Surfaces Menu (protein_explorer_surfaces.JPG)

Figure 20: This rendering style shows the surface of the protein accessible to water. This image is tilted 90 degrees toward the viewer from the previous images.
Protein Explorer: HLA-AW Surface Rendering
 Protein Explorer: HLA-AW Surface Rendering (protein_explorer_2HLA_surface.JPG)

Figure 21: By setting the surface to be transparent, it is possible to superimpose another rendering style over it, and see how it fits into the surface. This can convey an idea of how the fold of the chain relates to the overall three-dimensional shape of the protein.
Protein Explorer: HLA-AW Superimposed Images
 Protein Explorer: HLA-AW Superimposed Images (protein_explorer_2HLA_surface_cartoon.JPG)

Recommended Reading and Resources:

  • A detailed introduction to protein structure and function can be found in most introductory biochemistry textbooks. For example, Lehninger Principles of Biochemistry, 4th Edition, by D. L. Nelson and M. Cox (sections 2.1, 3.1-3.5, 4.1-4.4, 5.1-5.3).
  • The Structures of Life at the NIH web site. This site is an introduction to protein structure, structure determination methods, drug design techniques, and other applications of structural biology.
  • Protein Structure and Function, by Gregory A. Petsko and Dagmar Ringe. This book provides an overview of the basic biochemistry of structural biology. Topics covered include protein structure, mechanisms of protein function, regulation of protein function, and case studies of the kinds of problems that arise in structural biology.
  • The MIT Biology Hypertextbook. This online textbook provides introductory level coverage of the field of microbiology. It includes cell biology, protein biochemistry, genetics, metabolism, and molecular biology. New content is typically added over time.
  • Artificial Intelligence and Molecular Biology. This online book includes chapters on classifying protein structures, predicting protein structure, and analyzing crystallographic and NMR data to determine protein structure. Of particular interest to readers of the current page who have a computer science background but need to understand more of the basic underlying biology is Chapter 1: Molecular Biology for Computer Scientists.

Content actions

Give Feedback:

E-mail the module author | Rate module ( How does the rating system work?)

Rating system

Ratings

Ratings allow you to judge the quality of modules. If other users have ranked the module then its average rating is displayed below. Ratings are calculated on a scale from one star (Poor) to five stars (Excellent).

How to rate a module

Hover over the star that corresponds to the rating you wish to assign. Click on the star to add your rating. Your rating should be based on the quality of the content. You must have an account and be logged in to rate content.

(0 ratings)

Download:

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.

| A lens (?)

Definition of a lens

Lenses

A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual Connexions member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks