<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE document PUBLIC "-//CNX//DTD CNXML 0.5 plus MathML//EN" "http://cnx.rice.edu/technology/cnxml/schema/dtd/0.5/cnxml_mathml.dtd">
<document xmlns="http://cnx.rice.edu/cnxml" xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:bib="http://bibtexml.sf.net/" xmlns:m="http://www.w3.org/1998/Math/MathML" id="new">
  <name>Modeling the genefinding problem</name>
  <metadata>
  <md:version>1.2</md:version>
  <md:created>2007/05/30 17:17:55 GMT-5</md:created>
  <md:revised>2007/09/05 12:44:08.963 GMT-5</md:revised>
  <md:authorlist>
      <md:author id="devika">
      <md:firstname>Devika</md:firstname>
      
      <md:surname>Subramanian</md:surname>
      <md:email>devika@rice.edu</md:email>
    </md:author>
  </md:authorlist>

  <md:maintainerlist>
    <md:maintainer id="devika">
      <md:firstname>Devika</md:firstname>
      
      <md:surname>Subramanian</md:surname>
      <md:email>devika@rice.edu</md:email>
    </md:maintainer>
  </md:maintainerlist>
  
  <md:keywordlist>
    <md:keyword>computational genefinding</md:keyword>
    <md:keyword>content and signal sensors</md:keyword>
    <md:keyword>hidden Markov models</md:keyword>
    <md:keyword>prokaryotic and eukaryotic genomes</md:keyword>
  </md:keywordlist>

  <md:abstract>A brief description of how genefinding is computationally modeled.</md:abstract>
</metadata>
  <content>
    <para id="delete_me"><!-- Insert module text here -->
Our task is to find coding regions in eukaryotic genomes.
We have already studied the complexity of the structure of eukaryotic genes. if you need a refresher, check out this self-paced <link src="http://www.gla.ac.uk/faculties/vet/teaching/CAL/biomolecular/module1.htm"> tutorial</link> from the University of Glasgow. 
</para>  

<section id="section1">
<name> Early approaches to genefinding </name>
<para id="para2">One fundamental approach to finding genes is to detect functional sites in genomic DNA. Fixed length sites like splice sites, start and stop codons, polyA sites, ribosomal binding sites, and transcription factor binding sites are called signals, and algorithms that detect them are called <emphasis> signal sensors</emphasis>. Variable length regions like exons and introns in eukaryotic DNA are recognized by another family of methods called <emphasis>content sensors</emphasis>.

</para><figure id="element-946"><name> A consensus sequence </name>
<media type="image/png" src="tatabox.png"/></figure><para id="element-338">Show above is a sequence motif of size 6. The letters in each position are drawn in proportion to the probability of having that letter in that position. This probability information is summarized in a weight matrix shown below. Weight matrices are the simplest form of signal sensors.
<table id="wm1" frame="all">
<name> Weight matrix </name>
<tgroup cols="5" colsep="1" rowsep="1">
<thead>
<row>
  <entry> Position </entry>
  <entry> A </entry>
  <entry> C </entry>
  <entry> G </entry>
  <entry> T </entry>
</row>
</thead>
<tbody>
<row>
<entry>1 </entry>
<entry> 0.028 </entry>
<entry> 0.034 </entry>
<entry> 0.026 </entry>
<entry> 0.912 </entry>
</row>
<row>
<entry>2 </entry>
<entry> 0.805 </entry>
<entry> 0.031 </entry>
<entry> 0.123 </entry>
<entry> 0.041 </entry>
</row>
<row>
<entry>3 </entry>
<entry> 0.046 </entry>
<entry> 0.158 </entry>
<entry> 0.022 </entry>
<entry> 0.774 </entry>
</row>
<row>
<entry>4 </entry>
<entry> 0.669 </entry>
<entry> 0.019 </entry>
<entry> 0.253 </entry>
<entry> 0.059 </entry>
</row>
<row>
<entry>5</entry>
<entry> 0.024 </entry>
<entry> 0.044 </entry>
<entry> 0.028 </entry>
<entry> 0.904 </entry>
</row>
<row>
<entry>6</entry>
<entry> 0.962 </entry>
<entry> 0.012 </entry>
<entry> 0.014 </entry>
<entry> 0.012 </entry>
</row>
</tbody>
</tgroup>
</table>


The weight matrix is an example of a probabilistic sequence model. We treat each sequence as a string in the alphabet {A,C,G,T}. Each entry (i,j) in the matrix represents the probability of a base i in  position j of the string. Given a DNA sequence or string s of length 6, we can evaluate its likelihood with respect to this model. That is, we can calculate the probability that the sequence is generated by the weight matrix above.


</para><para id="element-959">P(TATATA) = 0.912 * 0.805 * 0.774 * 0.669 * 0.904 * 0.962 = 0.33</para><para id="element-920">P(ATATAT) = 0.028 * 0.041 * 0.046 * 0.059 * 0.024 * 0.012 = 8.9 * 10^(-10)</para><para id="element-749">We can see that with respect to the sequence model described by the weight matrix, the string TATATA is overwhelmingly more likely than the string ATATAT. The weight matrix model assumes that the bases at each position are independent of each other. We will study more sophisticated models, called Markov models, which take dependencies between bases at different positions into account. The advantage of weight matrix models is their simplicity which allows them to be estimated with very little data. The disadvantage is that the models are quite rigid and do not accommodate the kind of variability seen in real biological sequences.</para><para id="element-384">Content sensors include detectors of CpG islands, an example we will consider in detail later in this module. Exon and intron detectors are some of the most widely studied in the literature. The <link src="http://compbio.ornl.gov/Grail-1.3/">GRAIL </link>  system detects exons, polyAs,and CpG islands. You can submit a DNA sequence on the form linked above to check a content sensor program out.</para><para id="element-608">Signal and content sensors alone cannot solve the genefinding problem. The statistical signals they are trying to recognize are too weak, and there are dependencies between signals and content that they cannot capture. Since the late nineties, attempts have been made to develop probabilistic systems that combine signal and content sensors to try to identify complete gene structure. One of the best known of these systems is Genscan, developed by Chris Burge and his advisor Samuel Karlin at Stanford University in 1997. Genscan is based on hidden Markov models. </para>

</section> 

<section id="cpg-islands">
<name> An example: finding CpG islands </name>

<para id="pp1">
This example is taken from the excellent textbook <link src="http://www.amazon.com/Biological-Sequence-Analysis-Probabilistic-Proteins/dp/0521629713">Biological Sequence Analysis: probabilistic models of proteins and nucleic acids</link> by Durbin, Eddy, Krogh and Mitchison. CpG islands are regions of the genome with a higher than normal percentage of C and G bases adjacent to each other. The usual percentage of adjacent CG bases in the genome is about 1%, but in CpG islands that percentage is over 6%. The reason that C followed by G is relatively rare in The "p" in "CpG" refers to the phosphodiester bond between the cytosine and the guanine, and serves to distinguish it from the C and G pairing on the double stranded DNA helix. CpG islands are bioogically intersting because they are in or near 40% of the promoters in mammalian genes and 70% in human promoter genes. CpG islands vary in length between 300 and 3000 basepairs. Thus fixed-length consensus sequence based approaches do not work well for detecting them. Effective identification of of CpG islands can aid in localizing genes in eukaryotes. CpG island detection also serves as an excellent problem to illustrate the power of Markov models.

</para>
<para id="pp2">
We will consider two problems. 
<list id="l1">
<name> </name>
<item> Given a short DNA sequence, does it come from a CpG island or not?
</item>
<item> Given a long DNA sequence, find all the CpG islands on it, if any.</item>
</list>
</para>
</section>
<section id="s222">
<name> Generative models of biological sequences </name>
<para id="genpara">
We will construct generative models of CpG islands. A generative model produces strings, and the model parameters are tuned to reflect the characteristics of CpG islands.
<figure id="cpg-model"><name> Generative models for CpG island detection </name>
	<media type="image/png" src="genModel.png">
	</media>
</figure>
</para><para id="element-547">The simplest probabilistic generative DNA sequence model associates a probability with the occurrence of each base: P(A), P(C), P(G) and P(T) such that these probabilities all sum to 1. For H. influenzae, these probabilities are P(A) = 0.3, P(C) = 0.2, P(G) = 0.2, and P(T) = 0.3. To generate a sequence based on this model, we first choose the length L of the sequence that we wish to construct. Then we draw bases for each position based on the discrete distribution above, as shown in the code fragement below.</para>
<code type="block">
i = 1; 
while i less-than-or-equal-to L do  
 S[i] = a base drawn from the discrete probability distribution [0.3,0.2,0.2,0.3] (for A,C,G,T) 
 i = i+1 
end  
</code><para id="element-464">This model does not capture interdependencies between bases. It assumes that the choice of base in each position of the generated sequence is independent of the bases surrounding it. A more complex model of DNA sequences can be constructed using the theory of Markov chains. In Markov chains, the probability of observing a base at a given position in a sequence is conditioned on the bases preceding it. Thus, Markov chains can model local correlations among the nucleotides. A Markov chain of order 1 assumes that the probability of a base at position i is dependent only on the base at position i - 1. A first order Markov chain can be specified by a probability matrix as shown below.
</para>

<table id="markov1" frame="all">
<name> A first order Markov model for generating DNA sequences</name>
<tgroup cols="5" colsep="1" rowsep="1">
<thead>
<row>
<entry> </entry>
<entry> A </entry>
<entry> C </entry>
<entry> G </entry>
<entry> T </entry>
</row>
</thead>
<tbody>
<row>
<entry> <emphasis> A </emphasis> </entry>
<entry> 0.6 </entry>
<entry> 0.2 </entry>
<entry> 0.1 </entry>
<entry> 0.1 </entry>
</row>
<row>
<entry> <emphasis> C </emphasis></entry>
<entry> 0.1 </entry>
<entry> 0.1 </entry>
<entry> 0.8 </entry>
<entry> 0.0 </entry>
</row>
<row>
<entry> <emphasis> G </emphasis> </entry>
<entry> 0.2 </entry>
<entry> 0.2 </entry>
<entry> 0.3 </entry>
<entry> 0.3 </entry>
</row>
<row>
<entry> <emphasis> T </emphasis> </entry>
<entry> 0.1 </entry>
<entry> 0.8 </entry>
<entry> 0.0 </entry>
<entry> 0.1 </entry>
</row>
</tbody>
</tgroup>
</table><para id="element-928">Note that every row of the first order Markov model above sums to 1. Since the probability of T following G (row 3, column 4) is zero, no sequence generated by this model will contain the subsequence "GT".  We can use the model to generate a sequence of length L, using a small variation of the code snippet shown above.
</para>

<code id="m2code" type="block">
i = 1; S[i] = a base chosen uniformly randomly from {A,C,T,G}.
for (i = 1; i less-than L; i++) do
   S[i+1] = a base chosen from a discrete distribution from row corresponding to base S[i] in Markov model
end
</code><para id="element-985">We can use the model for evaluating the likelihood that a given sequence is generated from it. It represents the probability of that sequence given the model.</para><para id="element-804">P(s1...sn) = P(s1) * P(s2|s1) * P(s3|s1) * ... * P(s{n-1}|sn)</para><para id="element-36">We can factor the probability of the whole sequence into the probabilities of observing each transition starting from the first base. We can simplify the probabiltiy computation because of the first order Markov condition -- the probability of a base at a given position depends only the base before it in the sequence. Thus, the probability of observing sequence ACGT based on the first order Markov model shown above is:</para><para id="element-647">P(ACGT) = P(A) * P(C|A) * P(G|C) * P(T|G) =0.25 * 0.2 * 0.8 * 0.3 </para><para id="element-950">We can use the model to compare the likelihood of two sequences. Thus, given sequence "ACGT" and "TGAC", we can calculate their individual likelihoods as shown above.
P(TGAC) = P(T) * P(G|T) * P(A|G) * P(C|A) = 0.25 * 0 * 0.2 * 0.2 = 0. The sequence "TGAC" can never be generated by this model!</para><para id="element-324">If we have a first order Markov model of CpG islands, we can now calculate the probability of a DNA sequence with respect to that model. If that probability exceeds a given threshold (say, 0.8), then we will assert that the given DNA sequence is in fact from a CpG island. How can we acquire a first order Markov model of CpG islands? There are databases of CpG islands available from the NCBI. The CpG islands for human chromosomes can be obtained from
<link src="http://www.ncbi.nlm.nih.gov/projects/genome/guide/human/"> here</link>. </para><para id="element-332">Given a set of CpG island sequences, we can calculate the probability P(a|b) in the model, for a, b in {A,C,G,T} by counting the percentage of times the subsequence "ba" occurs in those sequences. We can then estimate all sixteen entries in the first order Markov model over the nucleotides. These models are extremely easy to acquire, requiring just  counting operations. A first order Markov model learned from CpG islands, and another from a data set of non-CpG islands are shown below.</para>

<table id="markov2" frame="all">
<name> A first order Markov model for CpG islands</name>
<tgroup cols="5" colsep="1" rowsep="1">
<thead>
<row>
<entry> </entry>
<entry> A </entry>
<entry> C </entry>
<entry> G </entry>
<entry> T </entry>
</row>
</thead>
<tbody>
<row>
<entry> <emphasis> A </emphasis> </entry>
<entry> 0.18 </entry>
<entry> 0.27 </entry>
<entry> 0.43 </entry>
<entry> 0.12 </entry>
</row>
<row>
<entry> <emphasis> C </emphasis></entry>
<entry> 0.17 </entry>
<entry> 0.37 </entry>
<entry> 0.28 </entry>
<entry> 0.19 </entry>
</row>
<row>
<entry> <emphasis> G </emphasis> </entry>
<entry> 0.16 </entry>
<entry> 0.34 </entry>
<entry> 0.38 </entry>
<entry> 0.13 </entry>
</row>
<row>
<entry> <emphasis> T </emphasis> </entry>
<entry> 0.08 </entry>
<entry> 0.36 </entry>
<entry> 0.38 </entry>
<entry> 0.18 </entry>
</row>
</tbody>
</tgroup>
</table>


<table id="markov3" frame="all">
<name> A first order Markov model for non-CpG islands</name>
<tgroup cols="5" colsep="1" rowsep="1">
<thead>
<row>
<entry> </entry>
<entry> A </entry>
<entry> C </entry>
<entry> G </entry>
<entry> T </entry>
</row>
</thead>
<tbody>
<row>
<entry> <emphasis> A </emphasis> </entry>
<entry> 0.30 </entry>
<entry> 0.21 </entry>
<entry> 0.29 </entry>
<entry> 0.21 </entry>
</row>
<row>
<entry> <emphasis> C </emphasis></entry>
<entry> 0.32 </entry>
<entry> 0.30 </entry>
<entry> 0.08 </entry>
<entry> 0.30 </entry>
</row>
<row>
<entry> <emphasis> G </emphasis> </entry>
<entry> 0.25 </entry>
<entry> 0.25 </entry>
<entry> 0.30 </entry>
<entry> 0.20 </entry>
</row>
<row>
<entry> <emphasis> T </emphasis> </entry>
<entry> 0.18 </entry>
<entry> 0.24 </entry>
<entry> 0.30 </entry>
<entry> 0.30 </entry>
</row>
</tbody>
</tgroup>
</table>

<para id="gen23">
You can see that P(G|C) = 0.28 in the CpG island model, while P(G|C)=0.08 in the non-CpG islands model.
</para>

</section>

<section id="using1">
<name> Using generative models to classify sequences </name>
<para id="modeluse">
Given first order models of CpG and non-CpG islands and a DNA sequence s, we can determine whether the sequence comes from a CpG island or not by computing its log-odds ratio with respect to the models. If P(s|CpG) &gt; P(s|non-CpG), then s is classified as a CpG island. The decision rule can be alternately cast as
P(s|CpG)/P(s|non-CpG) &gt; 1, or taking logarithms on both sides, we get
log[P(s|CpG)/P(s|non-CpG)] &gt; 0. The logarithm of the ratios of the two probabilities is called the <emphasis> log-odds ratio</emphasis>. If the log-odds ratio is greater than 0, then s is part of a CpG island.

<figure id="element-958"><name> Histogram of log-odds ratios</name>
<media type="image/png" src="histogram.png"/></figure>
</para>

</section>
  </content>
  
</document>
