Summary: This course is a short series of lectures on Statistical Bioinformatics. Topics covered are listed in the Table of Contents. The notes were prepared by Ewa Paszek, Lukasz Wita and Marek Kimmel. The development of this course has been supported by NSF 0203396 grant.
In a Boolean network, each (target) gene is ‘predicted’ by several other genes by means of a Boolean function (predictor). Thus, after having inferred such a function from gene expression data, it could be concluded that if we observe the values of the predictive genes, we know, with full certainty, the value of the target gene. Conceptually, such an inherent determinism seems problematic as it assumes an environment with no uncertainty. However, the data that used for the inference exhibits uncertainty on several levels.
Another class model called Probabilistic Boolean Networks (PBNs) (Shmulevich et al., 2002) shares the appealing properties of Boolean networks, but is able to cope with uncertainty, both in the data and the model selection. A model incorporates only a partial description of a physical system. This means that a Boolean function giving the next state of a variable is likely to be only partially accurate.
The basic idea is to extend the Boolean network to accommodate more than one possible function for each node. Thus, to every node xi. , their corresponds a set Fi={ fj },j=1,..., l(i), Where each fj is a possible function determining the value of gene xi and l(i) is the number of possible functions for gene xi. A realization of the PBN at a given instant of time is determined by a vector of Boolean functions, where the ith element of that vector contains the predictor selected at that instant for gene xi. In other words, the vector function fk:{0,1}^n mapps to {0,1}^n acts as a transition function (mapping) representing a possible realization of the entire PBN. Such functions are commonly referred to as multiple-output Boolean functions Each of the N possible realizations can be thought of as a standard Boolean network operates for one time step. In other words, at every state x(t) belongs to {0,1}^n, one of the N Boolean networks is chosen and used to make the transition to the next state x(t+1) belongs to {0,1}^n . The probability Pi that the ith (Boolean) network or realization is selected can be easily expressed in terms of the individual selection probabilities Cj see (Shmulevich et al., 2002). The dynamics of the PBN are essentially the same as for Boolean networks, but at any given point in time, the value of each node is determined by one of the possible predictors, chosen according to its corresponding probability.This can be interpreted by saying that at any point in time, we have one out of N possible networks. The basic building block of a PBN is shown in the Figure1.
| AN EXAMPLE |
|---|
![]() |
The well-studied statistical tool, Bayesian networks (Friedman et al.,2000; Pearl, 1988), represent the dependence structure between multiple interacting quantities (e.g., expression levels of different genes). Bayesian networks are a promising tool for analyzing gene expression patterns. First, they are particularly useful for describing processes composed of locally interacting components; that is, the value of each component directly depends on the values of a relatively small number of components. Second, statistical foundations for learning Bayesian networks from observations, and computational algorithms to do so, are well understood and have been used successfully in many applications. Finally, Bayesian networks provide models of causal influence: Although Bayesian networks are mathematically defined strictly in terms of probabilities and conditional independence statements, a connection can be made between this characterization and the notion of direct causal influence. (Heckermanet al., 1999; Pearl and Verma, 1991; Spirtes et al.,1993). Although this connection depends on several assumptions that do not necessarily hold in gene expression data, the conclusions of Bayesian network analysis might be indicative of some causal connections in the data.
A Bayesian network (also known as causal probabilistic networks) is an annotated directed acyclic graph that encodes a joint probability distribution of a set of random variables X. Formally, a Bayesian network for X is a pair B=(G,Q). The first component, G, is a directed acyclic graph (DAG) whose vertices correspond to the random variables x1, . . . , xn, and whose edges represent direct dependencies between the variables. The graph G encodes the following set of independence statements: each variable xi is independent of its nondescendants given its parents G. The second component of the pair, namely Q, represents the set of parameters that quantifies the network and describes a conditional distribution for each variable, given its parents in G. Together, these two components specify a unique distribution on x1, . . . , xn. The graph G represents conditional independence assumptions that allow the joint distribution to be decomposed, economizing on the number of parameters. The graph G encodes the Markov Assumption: (Each variable Xi is independent of its nondescendants, given its parents in G. Given a Bayesian network, we might want to answer many types of questions that involve the joint probability (e.g., what is the probability of X = x given observation of some of the other variables?) or independencies in the domain (e.g., are X and Y independent once we observe Z?). The literature contains a suite of algorithms that can answer such queries efficiently by exploiting the explicit representation of structure (Jensen, 1996; Pearl, 1988).
Let apply the approach to the data of Spellman,(Spellman et al., 1998). This data set contains 76 gene expression measurements of the mRNA levels of 6177 S. cerevisiae ORFs. These experiments measure six time series under different cell cycle synchronization methods. Spellman et al., (1998) identified 800 genes whose expression varied over the different cell-cycle stages. In learning from this data, one treat each measurement as an independent sample from a distribution and do not take into account the temporal aspect of the measurement. Since it is clear that the cell cycle process is of a temporal nature, compensatation is done by introducing an additional variable denoting the cell cycle phase. This variable is forced to be a root in all the networks learned. Its presence allows one to model dependency of expression levels on the current cell cycle phase.3 Two experiments were performed, one with the discrete multinomial distribution, the other with the linear Gaussian distribution. The learned features show that we can recover intricate structure even from such small data sets. It is important to note that a learning algorithm uses no prior biological knowledge nor constraints. All learned networks and relations are based solely on the information conveyed in the measurements themselves. These results are available at the following web page: http://www.cs.huji.ac.il/labs/compbio/expression. The Figure2. illustrates the graphical display of some results from this analysis.
| SVS1 Gene Interaction Network |
|---|
![]() |