Skip to content Skip to navigation

Connexions

You are here: Home » Content » Modeling the genefinding problem

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

    • External bookmarks
  • E-mail the author

Recently Viewed

Modeling the genefinding problem

Module by: Devika Subramanian

Summary: A brief description of how genefinding is computationally modeled.

Our task is to find coding regions in eukaryotic genomes. We have already studied the complexity of the structure of eukaryotic genes. if you need a refresher, check out this self-paced tutorial from the University of Glasgow.

Early approaches to genefinding

One fundamental approach to finding genes is to detect functional sites in genomic DNA. Fixed length sites like splice sites, start and stop codons, polyA sites, ribosomal binding sites, and transcription factor binding sites are called signals, and algorithms that detect them are called signal sensors. Variable length regions like exons and introns in eukaryotic DNA are recognized by another family of methods called content sensors.

Figure 1
A consensus sequence
 A consensus sequence  (tatabox.png)

Show above is a sequence motif of size 6. The letters in each position are drawn in proportion to the probability of having that letter in that position. This probability information is summarized in a weight matrix shown below. Weight matrices are the simplest form of signal sensors.

Weight matrix
Position A C G T
1 0.028 0.034 0.026 0.912
2 0.805 0.031 0.123 0.041
3 0.046 0.158 0.022 0.774
4 0.669 0.019 0.253 0.059
5 0.024 0.044 0.028 0.904
6 0.962 0.012 0.014 0.012
The weight matrix is an example of a probabilistic sequence model. We treat each sequence as a string in the alphabet {A,C,G,T}. Each entry (i,j) in the matrix represents the probability of a base i in position j of the string. Given a DNA sequence or string s of length 6, we can evaluate its likelihood with respect to this model. That is, we can calculate the probability that the sequence is generated by the weight matrix above.

P(TATATA) = 0.912 * 0.805 * 0.774 * 0.669 * 0.904 * 0.962 = 0.33

P(ATATAT) = 0.028 * 0.041 * 0.046 * 0.059 * 0.024 * 0.012 = 8.9 * 10^(-10)

We can see that with respect to the sequence model described by the weight matrix, the string TATATA is overwhelmingly more likely than the string ATATAT. The weight matrix model assumes that the bases at each position are independent of each other. We will study more sophisticated models, called Markov models, which take dependencies between bases at different positions into account. The advantage of weight matrix models is their simplicity which allows them to be estimated with very little data. The disadvantage is that the models are quite rigid and do not accommodate the kind of variability seen in real biological sequences.

Content sensors include detectors of CpG islands, an example we will consider in detail later in this module. Exon and intron detectors are some of the most widely studied in the literature. The GRAIL system detects exons, polyAs,and CpG islands. You can submit a DNA sequence on the form linked above to check a content sensor program out.

Signal and content sensors alone cannot solve the genefinding problem. The statistical signals they are trying to recognize are too weak, and there are dependencies between signals and content that they cannot capture. Since the late nineties, attempts have been made to develop probabilistic systems that combine signal and content sensors to try to identify complete gene structure. One of the best known of these systems is Genscan, developed by Chris Burge and his advisor Samuel Karlin at Stanford University in 1997. Genscan is based on hidden Markov models.

An example: finding CpG islands

This example is taken from the excellent textbook Biological Sequence Analysis: probabilistic models of proteins and nucleic acids by Durbin, Eddy, Krogh and Mitchison. CpG islands are regions of the genome with a higher than normal percentage of C and G bases adjacent to each other. The usual percentage of adjacent CG bases in the genome is about 1%, but in CpG islands that percentage is over 6%. The reason that C followed by G is relatively rare in The "p" in "CpG" refers to the phosphodiester bond between the cytosine and the guanine, and serves to distinguish it from the C and G pairing on the double stranded DNA helix. CpG islands are bioogically intersting because they are in or near 40% of the promoters in mammalian genes and 70% in human promoter genes. CpG islands vary in length between 300 and 3000 basepairs. Thus fixed-length consensus sequence based approaches do not work well for detecting them. Effective identification of of CpG islands can aid in localizing genes in eukaryotes. CpG island detection also serves as an excellent problem to illustrate the power of Markov models.

We will consider two problems.

  • Given a short DNA sequence, does it come from a CpG island or not?
  • Given a long DNA sequence, find all the CpG islands on it, if any.

Generative models of biological sequences

We will construct generative models of CpG islands. A generative model produces strings, and the model parameters are tuned to reflect the characteristics of CpG islands.

Figure 2
Generative models for CpG island detection
 Generative models for CpG island detection  (genModel.png)

The simplest probabilistic generative DNA sequence model associates a probability with the occurrence of each base: P(A), P(C), P(G) and P(T) such that these probabilities all sum to 1. For H. influenzae, these probabilities are P(A) = 0.3, P(C) = 0.2, P(G) = 0.2, and P(T) = 0.3. To generate a sequence based on this model, we first choose the length L of the sequence that we wish to construct. Then we draw bases for each position based on the discrete distribution above, as shown in the code fragement below.


i = 1; 
while i less-than-or-equal-to L do  
 S[i] = a base drawn from the discrete probability distribution [0.3,0.2,0.2,0.3] (for A,C,G,T) 
 i = i+1 
end  

This model does not capture interdependencies between bases. It assumes that the choice of base in each position of the generated sequence is independent of the bases surrounding it. A more complex model of DNA sequences can be constructed using the theory of Markov chains. In Markov chains, the probability of observing a base at a given position in a sequence is conditioned on the bases preceding it. Thus, Markov chains can model local correlations among the nucleotides. A Markov chain of order 1 assumes that the probability of a base at position i is dependent only on the base at position i - 1. A first order Markov chain can be specified by a probability matrix as shown below.

A first order Markov model for generating DNA sequences
A C G T
A 0.6 0.2 0.1 0.1
C 0.1 0.1 0.8 0.0
G 0.2 0.2 0.3 0.3
T 0.1 0.8 0.0 0.1

Note that every row of the first order Markov model above sums to 1. Since the probability of T following G (row 3, column 4) is zero, no sequence generated by this model will contain the subsequence "GT". We can use the model to generate a sequence of length L, using a small variation of the code snippet shown above.


i = 1; S[i] = a base chosen uniformly randomly from {A,C,T,G}.
for (i = 1; i less-than L; i++) do
   S[i+1] = a base chosen from a discrete distribution from row corresponding to base S[i] in Markov model
end

We can use the model for evaluating the likelihood that a given sequence is generated from it. It represents the probability of that sequence given the model.

P(s1...sn) = P(s1) * P(s2|s1) * P(s3|s1) * ... * P(s{n-1}|sn)

We can factor the probability of the whole sequence into the probabilities of observing each transition starting from the first base. We can simplify the probabiltiy computation because of the first order Markov condition -- the probability of a base at a given position depends only the base before it in the sequence. Thus, the probability of observing sequence ACGT based on the first order Markov model shown above is:

P(ACGT) = P(A) * P(C|A) * P(G|C) * P(T|G) =0.25 * 0.2 * 0.8 * 0.3

We can use the model to compare the likelihood of two sequences. Thus, given sequence "ACGT" and "TGAC", we can calculate their individual likelihoods as shown above. P(TGAC) = P(T) * P(G|T) * P(A|G) * P(C|A) = 0.25 * 0 * 0.2 * 0.2 = 0. The sequence "TGAC" can never be generated by this model!

If we have a first order Markov model of CpG islands, we can now calculate the probability of a DNA sequence with respect to that model. If that probability exceeds a given threshold (say, 0.8), then we will assert that the given DNA sequence is in fact from a CpG island. How can we acquire a first order Markov model of CpG islands? There are databases of CpG islands available from the NCBI. The CpG islands for human chromosomes can be obtained from here.

Given a set of CpG island sequences, we can calculate the probability P(a|b) in the model, for a, b in {A,C,G,T} by counting the percentage of times the subsequence "ba" occurs in those sequences. We can then estimate all sixteen entries in the first order Markov model over the nucleotides. These models are extremely easy to acquire, requiring just counting operations. A first order Markov model learned from CpG islands, and another from a data set of non-CpG islands are shown below.

A first order Markov model for CpG islands
A C G T
A 0.18 0.27 0.43 0.12
C 0.17 0.37 0.28 0.19
G 0.16 0.34 0.38 0.13
T 0.08 0.36 0.38 0.18
A first order Markov model for non-CpG islands
A C G T
A 0.30 0.21 0.29 0.21
C 0.32 0.30 0.08 0.30
G 0.25 0.25 0.30 0.20
T 0.18 0.24 0.30 0.30

You can see that P(G|C) = 0.28 in the CpG island model, while P(G|C)=0.08 in the non-CpG islands model.

Using generative models to classify sequences

Given first order models of CpG and non-CpG islands and a DNA sequence s, we can determine whether the sequence comes from a CpG island or not by computing its log-odds ratio with respect to the models. If P(s|CpG) > P(s|non-CpG), then s is classified as a CpG island. The decision rule can be alternately cast as P(s|CpG)/P(s|non-CpG) > 1, or taking logarithms on both sides, we get log[P(s|CpG)/P(s|non-CpG)] > 0. The logarithm of the ratios of the two probabilities is called the log-odds ratio. If the log-odds ratio is greater than 0, then s is part of a CpG island.

Figure 3
Histogram of log-odds ratios
 Histogram of log-odds ratios (histogram.png)

Comments, questions, feedback, criticisms?

Send feedback