Summary: This module contains motivational and biochemical background material for a computer scientist beginning to learn about computational structural biology.
Proteins are the molecular workhorses of all known biological systems. Among other functions, they are the motors that cause muscle contraction, the catalysts that drive life-sustaining chemical processes, and the molecules that hold cells together to form tissues and organs.
The following is a list of a few of the diverse biological processes mediated by proteins:
The estimate of the number of genes in the human genome has been changing dramatically since it was annotated (the latest gene count estimates can be found in this Wikipedia article on the human genome). Each gene encodes one or more distinct proteins. The total number of distinct proteins in the human body is larger than the number of genes due to alternate splicing. Of those, only a small fraction have been isolated and studied to the point that their purpose and mechanism of activity is well understood. If the functions and relationships between every protein were fully understood, we would most likely have a much better understanding of how our bodies work and what goes wrong in diseases such as cancer, amyotrophic lateral sclerosis, Parkinson's, heart disease and many others. As a result, protein science is a very active field. As the field has progressed, computer-aided modeling and simulation of proteins have found their place among the methods available to researchers.
An amino acid is a simple organic molecule consisting of a basic (hydrogen-accepting), amine group bound to an acidic (hydrogen-donating) carboxyl group via a single intermediate carbon atom:
| An α-amino acid |
|---|
![]() |
| A polypeptide |
|---|
![]() |
The primary structure of a protein is easily obtainable from its corresponding gene sequence, as well as by experimental manipulation. Unfortunately, the primary structure is only indirectly related to the protein's function. In order to work properly, a protein must fold to form a specific three-dimensional shape, called its native structure or native conformation. The three-dimensional structure of a protein is usually understood in a hierarchical manner. Secondary structure refers to folding in a small part of the protein that forms a characteristic shape. The most common secondary structure elements are α-helices and β-sheets, one or both of which are present in almost all natural proteins.
| Secondary Structure: α-helix |
|---|
| Secondary Structure: β-sheet | |||||||||
|---|---|---|---|---|---|---|---|---|---|
|
A structure of a protein is a three-dimensional arrangement of the atoms such that the integrity of the molecule (its connectivity) is maintained. The goal of a protein structure determination experiment is to find a set of three-dimensional (x, y, z) coordinates for each atom of the molecule in some natural state. Of particular interest is the native structure, that is, the structure assumed by the protein under its biological conditions, as well as structures assumed by the protein when in the process of interacting with other molecules. Brief sketches of the major structure determination methods follow:
The most commonly used and usually highest-resolution method of structure determination is x-ray crystallography. To obtain structures by this method, laboratory biochemists obtain a very pure, crystalline sample of a protein. X-rays are then passed through the sample, in which they are diffracted by the electrons of each atom of the protein. The diffraction pattern is recorded, and can be used to reconstruct the three-dimensional pattern of electron density, and therefore, within some error, the location of each atom. A high-resolution crystal structure has a resolution on the order of 1 to 2 Angstroms (Å). One Angstrom is the diameter of a hydrogen atom (10^-10 meter, or one hundred-millionth of a centimeter).
Unlike other structure determination methods, with x-ray crystallography, there is no fundamental limit on the size of the molecule or complex to be studied. However, in order for the method to work, a pure, crystalline sample of the protein must be obtained. For many proteins, including many membrane-bound receptors, this is not possible. In addition, a single x-ray diffraction experiment provides only static information - that is, it provides only information about the native structure of the protein under the particular experimental conditions used. As we will see later, proteins are often flexible, dynamic objects when in their natural state in solution, so a single structure, while useful, may not tell the full story. More information on X-ray Crystallography is available at Crystallography 101 and in the Wikipedia.
Nuclear Magnetic Resonance (NMR) spectroscopy has recently come into its own as a protein structure determination method. In an NMR experiment, a very strong magnetic field is transiently applied to a sample of the protein being studied, forcing any magnetic atomic nuclei into alignment. The signal given off by a nucleus as it returns to an unaligned state is characteristic of its chemical environment. Information about the atoms within two chemical bonds of the resonating nucleus can be deduced, and, more importantly, information about which atoms are spatially near each other can also be found. The latter information leads to a large system of distance constraints between the atoms of the protein, which can then be solved to find a three-dimensional structure. Resolution of NMR structures is variable and depends strongly on the flexibility of the protein. Because NMR is performed on proteins in solution, they are free to undergo spatial rearrangements, so for flexible parts of the protein, there may be many more than one detectable structures. In fact, NMR structures are generally reported as ensembles of 20-50 distinct structures. This makes NMR the only structure determination technique suited to elucidating the behavior of intrinsically unstructured proteins, that is, proteins that lack a well-defined tertiary structure. The reported ensemble may also provide insight into the dynamics of the protein, that is, the ways in which it tends to move.
NMR structure determination is generally limited to proteins smaller than 25-30 kilodaltons (kDa), because the signals from different atoms start to overlap and become difficult to resolve in that range. Additionally, the proteins must be soluble in concentrations of 0.2-0.5 mM without aggregation or precipitation. For more information on how NMR is used to find molecular structures, please see NMR Basics and The World of NMR: Magnets, Radio Waves, and Detective Work at the National Institutes of Health's The Structures of Life website.
Electron diffraction works under the same principle as x-ray crystallography, but instead of x-rays, electrons are used to probe the structure. Because of difficulties in obtaining and interpreting electron diffraction data, it is rarely used for protein structure determination. Nevertheless, ED structures do exist in the PDB. For more on ED, see this Wikipedia article.
Large macromolecular complexes and molecular machines present a particular challenge in structure determination. Generally too large to be crystallized, and too complex to solve by NMR, determining the structure of these objects usually requires the combination of high-resolution microscopy combined with computational refinement and analysis. The main techniques used are cryo-electron microscopy (Cryo-EM) and standard light microscopy.
Most of the protein structures discovered to date can be found in a large protein repository called the RCSB Protein DataBank (PDB). The Protein Data Bank (PDB) is a public domain repository that contains experimentally determined structures of three-dimensional proteins. The majority of the proteins in the PDB have been determined by x-ray crystallography, but the number of proteins determined using NMR methods has been increasing as efficient computational techniques to derive structures from NMR data have been developed. A few electron diffraction structures are also available. The PDB was originally established at Brookhaven National Laboratory in October, 1971, with 7 structures. Currently, the database is maintained by Rutgers University, the State University of New Jersey, the San Diego Supercomputer Center at the University of California, San Diego, and the National Institute of Standards and Technology. The current number of proteins (and/or nucleic acids) in the PDB database is displayed at the top-right corner of the main PDB page. The imaging method statistics of these structures (i.e., which methods were used for what fraction of the structures), as well as other classifications, can be found here. The European Bioinformatics Institute Macromolecular Structure Database group (UK) and the Institute for Protein Research at Osaka University (Japan) are international contributors to the contents of the PDB.
Numerous tools are available for visualizing the structures stored in the PDB and other repositories. Most such tools allow a detailed examination of the molecule in a variety of rendering modes. For example, sometimes it may be useful to have a detailed image of the surface of the molecule as experienced by a molecule of water. For other purposes, a simple, cartoonish representation of the major structural features may be sufficient.
What follows will be a very brief introduction to what can be done with VMD. Only the most basic viewing functionality will be discussed. For a complete description of the capabilities of VMD and how to use them, please refer to the VMD web site.
In this section, a human leukocyte-associated antigen, HLA-AW (PDB structure ID 2HLA), will be shown under various rendering methods in VMD. This section is intended to convey, first, a general idea of the types of visual representations that are available for protein structures, and second, what information is and is not conveyed by each representation.
VMD allows the user to load and view molecule description files in a wide variety of common formats, including trajectory files with multiple structures of the same molecule, such as might be generated by a simulation. Once the molecules are loaded, the way each molecule is rendered may be controlled using the Graphical Representations menu:
|
Molecules may be displayed by various rendering modes:
| HLA-AW. Drawing method: LINES. Coloring method: NAME |
|---|
| HLA-AW. Drawing method: VDW. Coloring method: NAME |
|---|
| HLA-AW. Drawing method: VDW. Coloring method: CHAIN |
|---|
| HLA-AW. Drawing method: SURF. Coloring method: CHAIN |
|---|
| HLA-AW. Drawing method: SURF. Coloring method: CHAIN |
|---|
| HLA-AW. Drawing method: CARTOON. Coloring method: CHAIN |
|---|
| HLA-AW. Drawing method: SURF. Coloring method: RESTYPE |
|---|
Protein Explorer is designed as a user-friendly but fairly full-featured visualizer. It is not as scriptable or as powerful as some other visualizers such as VMD and PyMol, but it is one of the quickest and easiest to get started with. It is used through a web browser, either by accessing it through the Protein Explorer website (via the Quick-Start Protein Explorer link), or as an offline version, downloadable from this page. Both versions require the MDL Chime molecular viewing plugin, which you can download from here (registration required).
As with VMD above, a human leukocyte-associated antigen, HLA-AW (PDB structure ID 2HLA), will be shown in various renditions.
Upon opening, Protein Explorer will load a default molecule and display it (this feature may be disabled via a setting under "preferences" in the lower left frame):
| Protein Explorer at Startup |
|---|
Clicking on the "PE Site Map" link pops up a window containing Protein Explorer's top-level menu:
| Protein Explorer Site Map Window |
|---|
Once a molecule is loaded, the "Quick Views" menu allows the user to control how it is displayed:
| Protein Explorer QuickViews Interface |
|---|
| Protein Explorer: HLA-AW Backbone Rendering |
|---|
| Protein Explorer: HLA-AW Cartoon Style |
|---|
| Protein Explorer Advanced Explorer Menu |
|---|
| Protein Explorer Surfaces Menu |
|---|
| Protein Explorer: HLA-AW Surface Rendering |
|---|
| Protein Explorer: HLA-AW Superimposed Images |
|---|