Summary: This module discusses how to represent proteins in terms of the Cartesian coordinates of their atoms and in terms of the angle values of their rotatable bonds. It then discusses Forward Kinematics, which allows the computation of Cartesian coordinates when the torsional angle values are known.
In order to construct efficient, maintainable software to deal with and manipulate protein structures, a suitable way to store these structures has to be adopted. Depending on the ultimate application, different representations may have advantages and disadvantages from a software perspective. For example, when designing a simple visualization software, the Cartesian (x,y,z) coordinates of each atom are useful and simple to render on the screen. However, if the program is to manipulate bond angles and bond lengths for example, a representation based on the internal degrees of freedom (see below) may be more appropriate. Some applications may even need to store more than one representation at a time; for example a simulation program that needs to compute a protein's Potential Energy, which is a function of both Cartesian and Internal coordinates, would benefit from keeping both representations at the same time.
The structure of a protein is the set of atoms it contains, and the bonds that join them, that is, its inherent connectivity. A particular geometric shape of a protein (that is, the spatial arrangement of the atoms in the molecule) is called its conformation. Thus, a given protein structure can have many different conformations. Next, we discuss the two most common ways to model protein structures and conformations for software applications: Cartesian and Dihedral representations.
The most essential information for modeling a protein structure is the relative position of each atom, given as (x,y,z) Cartesian coordinates. Popular imaging methods such as X-Ray Crystallography, Nuclear Magnetic Resonance (NMR) and Cryogenic Electron Microscopy (Cryo-EM) are used to experimentally obtain relative atom positions from protein crystals and solutions. This is precisely the information provided by Protein Databank (PDB) format coordinate files:
| First 19 atom coordinate records of PDB entry 2HLA |
|---|
The degrees of freedom of a system are a set of parameters that may be varied independently to define the state of the system. For example, the location of a point in the Cartesian 2D plane may be defined as a displacement along the x-axis and a displacement along the y-axis, given as a (x,y) pair. It may also be given as a rotation about the origin by θ degrees and a distance r from the origin, given as a (r,θ) pair. In either case, a point moving freely in a plane has exactly two degrees of freedom.
As mentioned before, the spatial arrangement of the atoms in a protein constitute its conformation. In the PDB coordinate file above, we can see that one obvious way to define a protein conformation is by giving x, y, and z coordinates for each atom, relative to some arbitrary origin. These are not independent degrees of freedom, however, because atoms within a molecule are not allowed to leave the vicinity of their neighboring atoms (if no chemical reaction takes place). Pairs of atoms bonded to each other, for example, are constrained to remain close, so moving one atom causes others connected to it to move in a dependent fashion. In the kinematics terminology, this means that the true, effective or independent number of degrees of freedom is much less than the input space parameters -an (x,y,z) tuple for each atom-. The remainder of this section defines a set of independent degrees of freedom that more readily model how proteins and other organic molecules can actually move.
The atoms in proteins are connected to one another through covalent bonds. Each pair of bonded atoms has a preferred separation distance called the bond length. The bond length can vary slightly with a spring-like vibration, and is thus a degree of freedom, but realistic variations in bond length are so small that most simulations assume it is fixed for any pair of atoms. This is a very common assumption in the literature and reduces the effective degrees of freedom of a protein; the remainder of this module makes this assumption.
Although bond lengths will not be allowed to vary in this work, the presence of bonds is still important because it allows us to represent the connectivity of the protein as an undirected graph data structure, where the atoms are the nodes and the bonds between them are undirected edges. In some cases, it is helpful to artificially break any cycles in the graph, and choose an atom from the interior as an anchor atom. The graph can then be treated as a tree data structure, with the anchor atom as the root.
| A Protein as a Graph Data Structure |
|---|
![]() |
Bond length is an independent degree of freedom given two connected atoms. A set of three atoms bonded in sequence defines another degree of freedom: the angle between the two adjacent bonds. This is, appropriately, referred to as the bond angle. The bond angle can be calculated as the angle between the two vectors corresponding to the bonds from the central atom to each of its neighbors. As a reminder, the angle between two vectors is the inverse cosine of the ratio of the dot product of the vectors to the product of their lengths. Like bond lengths, bond angles tend to be characteristic of the atom types involved, and, with few exceptions, vary little. Thus, like bond lengths, this module considers all bond angles as fixed (again, this is a common assumption).
In most organic molecules, including proteins, the most important internal degree of freedom is rotation about dihedral (torsional) angles. A dihedral angle is defined by four consecutively bonded atoms. Given four consecutive atoms
| A Dihedral Angle |
|---|
![]() |
All amino acids share the same core of one nitrogen, two carbon, and one oxygen atoms. This shared core makes up the backbone of the protein. There are two freely rotatable backbone dihedral angles per amino acid residue in the protein chain: the first, designated
The number of backbone dihedrals per amino acid is 2, but the number of side chain dihedrals varies with the length of the side chain. Its value ranges from 0, in the case of glycine, which has no sidechain, to 5 in the case of arginine.
| Dihedral Angles in Arginine |
|---|
![]() |
Kinematics is a branch of mechanics concerned with how objects move in the absence of mass (inertia) and forces. You can imagine that varying the dihedral angles will move a protein's atoms relative to each other in space. The problem of computing the new spatial locations of the atoms given a set of dihedral rotations is known as the forward kinematics problem.
The importance of this problem to protein modeling and simulation should be clear: as stated earlier, the only internal degrees of freedom usually considered for a protein are its dihedral angles. Thus, moving a protein will be achieved by setting some of its dihedral angles to new values. For some applications, such as the rendering of an image of the protein and the computation of its Energy, however, the Cartesian (x,y,z) coordinates for each atom are needed. These are obtained by forward kinematics.
The math involved in solving forward kinematics requires some background in linear algebra, specifically in the anatomy and application of transformation matrices. The links provided in this section should provide enough mathematical background to understand the rest of this module and eventually write a simple protein manipulation program.
As stated earlier, a common operation when manipulating proteins in silico is to retrieve the Cartesian coordinates of each atom in the protein from our knowledge of its dihedral angles and rotations applied to them. For simplicity, assume we have an anchor atom and we are modeling the protein backbone only, that is, the protein consists of a serial linkage composed of consecutive backbone atoms, as shown in Figure 5.
The simplest way to represent a protein chain is to store the Cartesian (x,y,z) coordinates of each atom at all times. These coordinates are relative to some global coordinate frame which is unimportant, for example that in which the atomic positions were obtained by X-Ray crystallography and which are typically read from the PDB files. These coordinates can be changed if so desired. Common changes are to remove the center of mass (thus centering the protein at the global origin), subtract the position of the anchor atom (to center the protein at this atom), etc.
But it was discussed earlier that the "natural" degrees of freedom for kinematic manipulations are usually the dihedral angles alone. This means that algorithms that operate on dihedral angles to achieve their goals will normally require a way to modify the Cartesian coordinates when dihedral rotations are performed, to reflect the new atomic positions. This can be easily done with rotation matrices as follows.
![]() |
When a rotation of θ degrees around bond i is performed, one can think of all atom positions starting at i+2 rotating around the axis defined by bond i, and all other atoms (from anchor to atom i+1 inclusive) remaining stationary. Thus, upon such a rotation, the Cartesian coordinates of the atoms after the bond need to be updated, and their new values are given by:

Where [x,y,z,1] is the position of a generic atom in homogeneous form, [x',y',z',1] is its position after the rotation (T is the transpose operator), and R(i,θ) is a 4x4 matrix that encodes a rotation of θ degrees around an axis coinciding with bond i that passes through atom

In the above formula, T(x) is a translation by the vector x and
Alternatively, if many rotations need to be performed at the same time (and the intermediate Cartesian coordinates are not needed), these rotations could be sorted by bond number and applied simultaneously, by noting that rotations can be performed in a cumulative way as the backbone is traversed from anchor to end atom. The ability to chain rotations around arbitrary vectors in space (i.e. not through the origin) is one of the main benefits of homogeneous transformations. For example, if two rotations need to be applied at the same time, one around bond 3 by 30 degrees and another around bond 7 by 15 degrees, the atoms between bonds 3 and 7 get updated by:

But the atoms after bond 7 are updated by:

In the above, bond n is the unit vector defined along bond n, easily computed by subtracting the coordinates of atoms n+1 and n, and then dividing by its norm. The chaining of transformations as explained above is very useful to achieve arbitrary rotations of bonds within a protein. Sections of the protein (i.e. atoms belonging to certain residues) can be updated when a dihedral rotation is performed simply by constructing the overall matrix that should affect them.