Inside Collection (Course): Bios 533 Bioinformatics
Summary: This module is designed to familiarize the student with the basic principals behind microarray experiments and microarray data analysis.
Microarry chips are devices that enable the scientist to simultaneously measure the transcription level of every gene within a cell. Microarrays are commercially available from a number of companies, such as Affymetrix, Invitrogen and Sigma-Genosys, to name a few. The chip is usually constructed by amplifying all the genes within the selected genome, yeast, for example, using polymerase chain reaction (PCR) methodology. The PCR products would then be "spotted" onto the chips by a robot, as single-stranded DNA that is linked by covalent bonds to the glass slide. The spots would be positioned in an array on a grid pattern, where each spot contains many identical copies of an individual gene. A discussion of the chemistry involved in creating a microarry can be found on the technology page of the Affymetrix website. The position of the genes are recorded by spot location, so that the appropriate gene can be identified any time a probe hybridizes with, or binds to, its complementary DNA strand on the chip.
Microarray chips measure transcriptomes, which are the entire collection of RNA transcripts within a cell under the given conditions. To use the chip to measure an experimental transcriptome against a reference transcriptome requires cells grown under two different conditions, the experimental conditions and the reference conditions. The mRNA from the two different conditions are harvested separately, and reverse transcriptase (1) is used to transcribe the mRNA into cDNA. The nucleotides used to synthesize the cDNA will be labeled with either a green or red dye, one color for the reference conditions and the other for the experimental conditions. The microarry chip is then incubated overnight with both populations of cDNAs, and a given cDNA will hybridize with the complementary strand from its gene that is covalently bound to a grid spot on the chip. The chips are washed to remove any unbound cDNAs and then two computerized images are produced by scanning first to detect the grid spots containing cDNAs labeled with green dye, and second to detect the spots contain red-labeled cDNAs. The computer also produces a merged image that will show a yellow spot for grid spots that contain both red- and green-labeled cDNAs, indicating transcripts that are expressed under both sets of conditions. A very nice on-line, animated demonstration of the entire protocol is offered by the Genomics Course on the Davidson College website (2).
In addition to producing a qualitative image that is easy visualize, a microarray experiment yields quantitative data for each spot, consisting of the measured fluorescence intensity of the red signal, the fluorescence intensity of the green signal, and the ratio of red signal to green signal. It is in storing and analyzing the quantitative data that bioinformatics really comes into play in microarray technology. These data sets are incredibly large. For instance, a typical mammalian cell is estimated to have between 10,000 to 20,000 different species of mRNA expressed at a given time.
As a demonstration, view the Stanford Microarray Database website. Under the Public Data section, click on the "Public Login" link. Limit the data set search to the organism Arabidopsis thaliana and the author Gutierrez. Click on the button entitled "Display data", and a table of microarray datasets should be returned. Choose one of the experiments, making note of the Experiment ID number from the table and select the clickable image icon. (There is a legend for the icons at the top of the web page, if there is uncertainty as to which is the clickable image icon.) This yields the qualitative visualization of the microarray experiment, as described previously in this module. Take a look at the array image and note that it is difficult to draw many conclusions from this kind of visualization.
What is the Experiment ID number for the viewed microarray image?
Is it possible to get a feeling for which color dot, green, red, or yellow, is most predominant just by viewing the image? (If so, which color?)
Are all of the dots, over the entire grid, well-shaped? (Give a brief explanation.)
Click on one of the individual spots in the microarray grid. This will open a new window that contains a close up of the individual spot and all the experimental information about that spot.
Retrieve and list the following information about the chosen spot: a. spot number, b. description (under biological information), c. the Channel 1 intensity (mean), d. the Channel 1 background (median), e. the Channel 1 net intensity (mean), and f. the Log(base2) of R/G Normalized Ratio (Mean).
Return to the table of Arabidopsis thaliana data sets and for Experiment ID #11374, select the "Data" icon, which is the first icon under the "Options" column. Next to "Sort By", select "Log(base2) of R/G Normalized Ratio (Mean)", and "Descending". Under "Display:", click on "Spot", then scroll down and hold down the control key (or the apple key on macintoshes) while selecting "Log(base2) of R/G Normalized Ratio (Mean)". (The control key allows selection of additional choices without deselecting the previous choice.) Accept the default values for all remaining options and select "Display" at the bottom of the page. Recall that the data are converted to numbers representing the fluorescence intensity of red dye, green dye, and the ratios of red to green. Scientists commonly use a log transformation of the ratio data, because the logs are more mathematically tractable in reference to statistical analysis. The results page will show the top ranking spots from this chip, ranked from highest log red/green value to lowest.
What are the spot numbers of the three highest ranking spots?
What are the "Log(base2) of R/G Normalized Ratio (Mean)" values for the three highest ranking spots?
Use the browser's back button to go back and select new data for display. Chage the sorting selection to R/G Normalized (Mean), in descending order. under the "Display:" window select "Spot", then scroll down and hold down the control key while selecting "R/G Normalized (Mean)".
What are the spot numbers for the three highest ranking spots?
What are the "R/G Normalized (Mean)" values for the three highest ranking spots?
Does it change the ranking of the spots to use the log transformations of the ratios instead of the ratios?
To demonstrate a different method of visualizing and analyzing microarry data, take a look at the MIT Cancer Genomics Microarray Data Sets. Scroll down to the section entitled "Gene Expression Correlates of Clinical Prostate Cancer Behavior". Click on the first data set, for Prostate tumor and normal samples, entitled "Prostate_TN_final0701_allmeanScale.res". This data set originates from Affymetrix chips. In this case, the signal is recorded as "A" for absent, "P" for present, and "M" for marginal, as determined by the Affymetrix GeneChip software. The numerical values are scaled average difference units for tumor vs. normal prediction, and these values are also generated by the Affymetrix software. A more complete discussion of gene expression data analysis for Affymetrix GeneChip Arrays can be found at the Affymetrix web site.
So far, the discussion has been primarily about visualizing and quantifying the fluorescence signal from a microarray experiment. However, analysis of gene expression under experimental conditions versus reference conditions requires determining whether observed differences are significant or not. There are many sources of noise and variability in microarray data, including experimental sources such as image scanning inconsistencies, issues involved in computer interpretation and quantification of spots, hybridization variables such as temperature and time discrepancies between experiments, and experimental errors caused by differential probe labeling and efficacy of RNA extraction. In addition, as the size of the sample increases, so does the probability of finding some large differences due to chance. Therefore, statistical analysis is required to show that gene expression differences are real.
There are some complex problems underlying statistical analysis of microarray data, primarily related to the fact that the number of samples is very, very large, but the number of times that each measurement is repeated is comparatively very small. (This is due mostly to cost and time issues.) Also, the simplest statistical techniques commonly assume a normal distribution, which cannot necessarily be assumed in microarray experiments. For a detailed discussion, D. K. Slonim (3) has authored a good review of the most current approaches to gene expression data analysis.
This tutorial will provide an oversimplified example of the type of statistical analysis that needs to be applied to microarray data, using the t-test. For a given gene, A, the gene will have two associated vectors: {a(ref)1, ..., a(ref)n} and {a(exp)1, ..., a(exp)n}, where a(ref) contains n measurements of expression levels under reference conditions and a(exp) contains n measurements of expression levels under experimental conditions.
The mean of each vector will be equal to:
The standard deviation of each vector will be equal to:
The standard error of each vector will be equal to:
The formula for the t test is as follows:
The t-test is used to test the difference between the means of two test sets, as in before and after studies or matched-pairs studies. There is a confidence interval for the mean and a critical value for t for the chosen level of significance associated with the t-test. For instance, a level of significance equal to 0.05 means that 95% of the cases will be within the confidence range if there is no significant difference between the means of the two test sets, or experiments, being compared. The confidence limits set upper and lower bounds on an estimate of the mean for the chosen level of significance (0.05). The confidence interval is the range within the bounds of the confidence limits. The confidence interval can be computed, if you know the shape of your distribution. For normally distributed data, the confidence limits at the 0.05 significance level for an estimated mean are the sample mean plus or minus 1.96 times the standard error.
confidence interval (normal distribution): mean +/- 1.96 * SE
For example, if the sample mean is 10 and the standard error is 1.2, then 95% of the cases will be within the range of 10 plus or minus 1.96 times 1.2, or 10 plus or minus 2.4, which is the range from 7.6 to 12.4. Thus, if the experimental mean is outside the limits of this range computed for the reference mean, then the difference between the means of the two test sets is considered to be significant within a probability of 95%. The critical value for t at a given significance level for a specific type of distribution can be looked up in a table; most statistics books contain them. In the case of microarray data, if the absolute value of t is greater than the critical value, this indicates a significant difference in the gene expression between the reference and experimental test sets. Because the t-test is a parametric test that assumes a normal distribution, the statistical tests that are commonly used to analyze microarray data are more complex variations that are used for distributions other than normal distributions.
red:green ratio | red:green ratio | red:green ratio | red:green ratio | red:green ratio | red:green ratio | red:green ratio | |
---|---|---|---|---|---|---|---|
Gene | measurement 1 | measurement 2 | measurement 3 | measurement 4 | measurement 5 | measurement 6 | measurement 7 |
A(ref) | 0.97 | 1.54 | 1.32 | 0.89 | 1.06 | 1.21 | |
A(exp) | 1.37 | 1.25 | 1.15 | 0.99 | 1.30 | 1.53 | 1.07 |
B(ref) | 1.67 | 1.78 | 2.01 | 1.89 | 1.75 | 1.81 | 1.69 |
B(exp) | 6.21 | 6.03 | 5.94 | 6.14 | 6.11 |
Assumptions for example problem:
What are the means for each row of data?
What are the standard deviations for each row of data?
What is the standard error for each row of data?
What is the value for t for the comparison between the reference and experimental test sets for Gene A?
What is the 95% confidence interval computed for the Gene A reference set?
Is there a significant difference between the mean values of the experimental versus the reference set for Gene A? (Explain the answer both in terms of the t value and the confidence interval.)
What is the value for t for the comparison between the reference and experimental test sets for Gene B?
What is the 95% confidence interval computed for the Gene B reference set?
Is there a significant difference between the mean values of the experimental versus the reference set for Gene B? (Explain the answer both in terms of the t value and the confidence interval.)
If there was a significant difference between the gene expression under experimental conditions versus the gene expression under reference conditions for either Gene A or Gene B, then estimate the significant increase or decrease observed.
There are many software packages available that have been designed expressly for microarray data analysis. In addition to testing gene expression under a set of experimental conditions versus reference conditions, it is possible to identified "clustered" genes that seem to have similar responses under similar conditions. Also, genes can be identified that show related responses under similar conditions, such as one gene's expression always increases when another's decreases. When two or more genes show this kind of clustered behavior, it can be an indication that they are part of the same pathway, or that they are regulating each other. Using this type of microarray data analysis, the scientist can combine the cluster analysis results with what is known through laboratory experiments and often come up with new hypotheses about biochemical pathways and regulation.