Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007.

Similar presentations


Presentation on theme: "Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007."— Presentation transcript:

1 Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007

2 Special Thanks: Sourav Chatterji Jason Raymond

3 The lack of phylogenetic diversity is evident
in the current whole genome databases Certain phyla have been heavily sampled others have only sparse representatives Many phyla have been ignored The missing gaps in the current genome data are the obstacles for us: Getting the full picture of the “tree of life” Understanding of a full range of ecosystems and biological mechanisms Anchoring Metagenomic sequencing data Proteobacteria Firmicutes

4 Solutions Tree of Life the Genomic Encyclopedia Project for Bacteria and Archaea Greengenes ssu rRNAs: sequence entries ATCC: strains in more than 750 genera DSMZ: cultures representing 6900 species and 1400 genera (1207 bacteria and 77 archaea genera)

5 Prioritize Organism Selection to Optimize Phylogenetic Diversity
Phylogenetic diversity (PD): if T is a tree whose leaf labels comprise a set X of species, and whose edges have non-negative real-valued lengths, then for a subset Y of X, the PD score of Y is the sum of the lengths of the edges of the minimal subtree of T that connects Y

6 Input: 2 + A tree (optional: a sub tree) A number (N) Output: A list of N taxa that gives the maximum PD for the sub-tree

7 Algorithm: Greedy Algorithm
Reference: Vincent Moulton, Charles Semple and Mike Steel, Optimizing phylogenetic deverusyt under constraints, Journal of Theoretical Biology, doi: /j.jtbi ,2006

8 Take a tree and a sub-tree
Calculate the added PD for each taxon to the subtree Grown the subtree to the taxon that adds the maximum PD Repeat the above steps N times, the resulting subtree is the one gives the maximum PD given the imposed constrains

9 Glory Details How tree structure is store in PERL ?
Two Dimension Matrix. A C Node1 Node2 B D

10 Build Subtree: Base upon Index Paths
Chose any taxon from the original sub-tree as a reference taxon, index all the paths connect the reference taxon and other taxa. A A is the reference taxon C B: B, Node1, A C: C, Node2, Node1, A D: D, Node2, Node1, A Node1 Node2 B D

11 { C B: B, Node1, A C: C, Node2, Node1, A D: D, Node2, Node1, A Node1
Build subtree: combine the paths Subtree A-B-C: B: B, Node1, A C: C, Node2, Node1, A Calculate and grow subtree: Follow each path Calculate added PD if subtree grows to D: { D, Node2, Node1, A

12 If no starting subtree is defined, the program will identify the
longest path as the starting subtree Step 1: pick any taxon, identify the farthest taxon Step 2: Start from the taxon picked from step 1, identify the longest path. It is the longest path for the whole tree. A A C C B B D D

13 Run the program: On Bobcat:
/home/dwu/dwu_scripts/public_scripts/maxPD.pl -t input_tree -n number -o output -l input_list(optional) -i: input tree -n: the number of taxa that the user need for the output list -o: output -l: input list, the user can define a list of taxa, that must be included in the PD calculations (for example, species the user have to include) -gml: yes or no, output gml file option

14 Output Format: Taxon ID PD Addition to the subtree ID ID ID

15 Results Visualization
Free software to visualize network/tree structure: yEd

16 GML Input format: graph [
node [id 1 label "A" graphics [ w 50 h 50 type "circle" fill "#AA0000"]] node [id 2 label "B" graphics [ w 50 h 50 type "circle" fill "#666666"]] node [id 3 label "C" graphics [ w 50 h 50 type "circle" fill "#AA0000"]] node [id 4 label "D" graphics [ w 50 h 50 type "circle" fill "#666666"]] node [id 5 label "node1" graphics [ w 3 h 3 type "circle" fill "#666666"]] node [id 6 label "node2" graphics [ w 3 h 3 type "circle" fill "#666666"]] edge [source 1 target 5 graphics [ fill "#AA0000" width 4 ]] edge [source 2 target 5 graphics [ fill "#666666" width 4 ]] edge [source 5 target 6 graphics [ fill "#AA0000" width 4 ]] edge [source 6 target 3 graphics [ fill "#AA0000" width 4 ]] edge [source 6 target 4 graphics [ fill "#666666" width 4 ]] ]

17 Select 300 out of 30000 based upon a ssu-RNA neighbor join tree

18 Y - Added PD X- added taxon (30000 picks /30000 taxa)

19 Y – PD of subtree X – added taxon (30000 picks / 30000 taxa)


Download ppt "Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007."

Similar presentations


Ads by Google