In this activity you will work in groups of two to use bioinformatics methods to build
Part I: A rRNA-based tree of life
In this part and the next, you will use some standard bioinformantics software to build phylogenetic trees. In this part you will be building a tree using the DNA sequences corresponding to the 18s ribosomal RNA. As we discussed in class, the ribosomal RNAs make a good sequence for building such trees since they are relatively conserved across all of life. This process will consist of two steps:
You will be working with 10 DNA sequences coding for the 18s rRNAs in the following organisms (the names have been deliberately chosen to be a bit obscure):
| 1. C. Elegans |
| 2. Drosophila |
| 3. Homo Sapiens |
| 4. Musculus |
| 5. Norvegius |
| 6. Cerevisiae |
| 7. Pombe |
| 8. Xenopus |
| 9. Zea Mays |
| 10. Arabidopsis |
Step 1: Go to the Activity 13 folder which will be in the My Documents folder on the computer desktop. Double click on the clustalx icon to start ClustalX. A window should open on your desktop.
Step 2: Input the 10 DNA sequences into ClustalX. For the first sequence you will click on file::Load Sequences and then select the first DNA sequence. All 10 DNA sequences are in the rRNA sub-folder and have file names like Pombe rRNA. After loading the first sequence, you should see a sequence appearing in the ClustalX window. For the next 9 sequences, you will click on File::Append Sequences. In the end, you should end up with all 10 sequences appearing in the ClustalX window.
Step 3: Run the alignment by clicking the Alignment::Do Complete Alignment button. Before the alignment begins, the program will ask you for two output file names. You can just click Align since you won’t be using these output files. The alignment may take a few minutes to complete—progress will be displayed at the bottom of the window
Step 4: Once the alignment is complete, use the scroll bar to look at the overall alignment. Since each DNA base is highlighted with a different color, it’s easy to see where the alignment is good. An asterix (*) is printed at the top of the alignment where all sequences agree on a particular base location, and a hyphen (-) is put in where a gap has been inserted by the program to make the alignment work well. Approximately what fraction of the overall rRNA sequence appears to have a good alignment (i.e. at least 7 of the sequences agree on the bases)?
Step 5: Now save to disk the tree representation of this alignment by clicking the Trees::Draw N-J Tree. When asked for a filename, be sure that the directory is rRNA and set the filename to rRNA.phy.
Step 6: Now you will plot tree arising from this alignment. Without closing the ClustalX program (just in case you need to go back to it), open the Activity 13 folder and double click on drawtree which will open a window. You will be asked for the filename for the input data, type rRNA/rRNA.phy. Now you will specify a few settings:
Step 7 [Optional]: If you’d like to see a rooted tree diagram of your alignment results, you can follow the exact same procedure in Step 6, using the program drawgram.
Part II: Phylogeny from a Protein Sequence
In this part you will follow almost the exact same procedure as in Part II, but will be using protein sequences, rather that rRNA sequences to build your tree. (ClustalX is smart enough to automatically recognize that you are now working with protein sequences.) You will be aligning and tree-building from a set of sequences from different species of a protein chosen from the NIH’s HomoloGene database of homologous (related) genes among eukaryotes. This protein is Ca2+/calmodulin-dependent protein kinase, (the name is appreviated CaMk in this write-up). Note that for some of the species, the protein function is inferred by comparison with similar protein sequences in other species where it has been biochemically characterized. You will be working with 11 CaMk sequences--the names are a lot less obscure than in Part II:
| 1. Boar |
| 2. C. Elegans |
| 3. Chicken |
| 4. Ferret |
| 5. Frog |
| 6. Human |
| 7. Mouse |
| 8. Rabbit |
| 9. Rat |
| 10. Sponge |
| 11. Zebrafish |
Step 1: Go to the Activity 13 folder which will be in the My Documents folder on the computer desktop. Double click on the clustalx icon to start ClustalX. A window should open on your desktop
Step 2: Input the 11 protein sequences into ClustalX. For the first sequence you will click on file::Load Sequences and then select the first protein sequence. All 11 protein sequences are in the CaMk sub-folder and have file names like Boar CaMk. After loading the first sequence, you should see a sequence appearing in the ClustalX window. For the next 10 sequences, you will click on File::Append Sequences. In the end, you should end up with all 11 sequences appearing in the ClustalX window.
Step 3: Run the alignment by clicking the Alignment::Do Complete Alignment button. Before the alignment begins, the program will ask you for two output file names. You can just click Align since you won’t be using these output files.
Step 4: Now save to disk the tree representation of this alignment by clicking the Trees::Draw N-J Tree. When asked for a filename, be sure that the directory is CaMk and set the filename to CaMk.phy.
Step 5: Now you will plot tree arising from this alignment. Without closing the ClustalX program (just in case you need to go back to it), open the Activity 13 folder and double click on drawtree which will open a window. You will be asked for the filename for the input data, type CaMk/CaMk.phy. Now you will specify a few settings:
Part IV: Analysis and Questions
1. Using the internet (i.e. google.com searches), identify each of the organisms on the phylogenetic tree you printed out in Part II by their common name (e.g. Rabbit). Next, circle the sets of nodes on your tree that are similar organisms (e.g. plants, yeast, etc.). Looking at the organisms listed on your tree, does your tree seem reasonable? [Optional—compare your tree to the locations of the organisms on the Tree of Life website or the following protein-based tree: http://www.tarweed.com/pgr/PGR98-058.figure1.jpeg]
Coming soon ...
2. [Optional] If you choose to create a rooted tree (Step 7), label the tree produce just as you labeled the unrooted tree in Part IV, Step 1. Does your rooted tree look like a reasonable evolutionary tree? (If not, ask the instructor for a quick overview of the challenges to properly “rooting” an unrooted tree.)
Coming soon ...
3. Now look at the phylogenetic tree you created using the CaMk protein sequences. Does the tree seem to reasonably represent the actual evolutionary distances between all selected organisms? (Look both at the big picture; i.e. the distances between the obviously very different organisms, and well as at individual pairs of organisms. How do your results compare with those from Part II?)
Coming soon ...
4. What hypotheses would you propose to explain any unexpected features of the tree you produced in Part III? Some concepts to stimulate your thinking about this: a. The “higher” eukaryotes, especially mammals, often have many versions of a given protein all somewhat specialized over the ancestral protein. a. Sometimes the optimal protein sequence to perform a particular task can emerge from two different protein precursors (so-called convergent evolution). c. Building the optimal phylogenetic tree is actually a very hard computational task (technically it is “NP-complete”) so that all practical software tree-builders use approximate methods to estimate the optimal tree.
Coming soon ...