Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007
Special Thanks: Sourav Chatterji Jason Raymond
The lack of phylogenetic diversity is evident in the current whole genome databases Certain phyla have been heavily sampled others have only sparse representatives Many phyla have been ignored The missing gaps in the current genome data are the obstacles for us: Getting the full picture of the “tree of life” Understanding of a full range of ecosystems and biological mechanisms Anchoring Metagenomic sequencing data Proteobacteria Firmicutes
Solutions Tree of Life the Genomic Encyclopedia Project for Bacteria and Archaea Greengenes ssu rRNAs: 134423 sequence entries ATCC: 18000 strains in more than 750 genera DSMZ: 13000 cultures representing 6900 species and 1400 genera (1207 bacteria and 77 archaea genera)
Prioritize Organism Selection to Optimize Phylogenetic Diversity Phylogenetic diversity (PD): if T is a tree whose leaf labels comprise a set X of species, and whose edges have non-negative real-valued lengths, then for a subset Y of X, the PD score of Y is the sum of the lengths of the edges of the minimal subtree of T that connects Y
Input: 2 + A tree (optional: a sub tree) A number (N) Output: A list of N taxa that gives the maximum PD for the sub-tree
Algorithm: Greedy Algorithm Reference: Vincent Moulton, Charles Semple and Mike Steel, Optimizing phylogenetic deverusyt under constraints, Journal of Theoretical Biology, doi:10.1016/j.jtbi.2006.12.021,2006
Take a tree and a sub-tree Calculate the added PD for each taxon to the subtree Grown the subtree to the taxon that adds the maximum PD Repeat the above steps N times, the resulting subtree is the one gives the maximum PD given the imposed constrains
Glory Details How tree structure is store in PERL ? Two Dimension Matrix. A C Node1 Node2 B D
Build Subtree: Base upon Index Paths Chose any taxon from the original sub-tree as a reference taxon, index all the paths connect the reference taxon and other taxa. A A is the reference taxon C B: B, Node1, A C: C, Node2, Node1, A D: D, Node2, Node1, A Node1 Node2 B D
{ C B: B, Node1, A C: C, Node2, Node1, A D: D, Node2, Node1, A Node1 Build subtree: combine the paths Subtree A-B-C: B: B, Node1, A C: C, Node2, Node1, A Calculate and grow subtree: Follow each path Calculate added PD if subtree grows to D: { D, Node2, Node1, A
If no starting subtree is defined, the program will identify the longest path as the starting subtree Step 1: pick any taxon, identify the farthest taxon Step 2: Start from the taxon picked from step 1, identify the longest path. It is the longest path for the whole tree. A A C C B B D D
Run the program: On Bobcat: /home/dwu/dwu_scripts/public_scripts/maxPD.pl -t input_tree -n number -o output -l input_list(optional) -i: input tree -n: the number of taxa that the user need for the output list -o: output -l: input list, the user can define a list of taxa, that must be included in the PD calculations (for example, species the user have to include) -gml: yes or no, output gml file option
Output Format: Taxon ID PD Addition to the subtree ID00032 2.3960 ID99033 0.6701 ID23890 0.5024
Results Visualization Free software to visualize network/tree structure: yEd http://www.yworks.com/en/products_yed_about.htm
GML Input format: graph [ node [id 1 label "A" graphics [ w 50 h 50 type "circle" fill "#AA0000"]] node [id 2 label "B" graphics [ w 50 h 50 type "circle" fill "#666666"]] node [id 3 label "C" graphics [ w 50 h 50 type "circle" fill "#AA0000"]] node [id 4 label "D" graphics [ w 50 h 50 type "circle" fill "#666666"]] node [id 5 label "node1" graphics [ w 3 h 3 type "circle" fill "#666666"]] node [id 6 label "node2" graphics [ w 3 h 3 type "circle" fill "#666666"]] edge [source 1 target 5 graphics [ fill "#AA0000" width 4 ]] edge [source 2 target 5 graphics [ fill "#666666" width 4 ]] edge [source 5 target 6 graphics [ fill "#AA0000" width 4 ]] edge [source 6 target 3 graphics [ fill "#AA0000" width 4 ]] edge [source 6 target 4 graphics [ fill "#666666" width 4 ]] ]
Select 300 out of 30000 based upon a ssu-RNA neighbor join tree
Y - Added PD X- added taxon (30000 picks /30000 taxa)
Y – PD of subtree X – added taxon (30000 picks / 30000 taxa)