Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007.

Slides:



Advertisements
Similar presentations
Great Theoretical Ideas in Computer Science
Advertisements

1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
 Graph Graph  Types of Graphs Types of Graphs  Data Structures to Store Graphs Data Structures to Store Graphs  Graph Definitions Graph Definitions.
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
5 - 1 Chap 5 The Evolution Trees Evolutionary Tree.
CIS786, Lecture 3 Usman Roshan.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Chapter 5 The Evolution Trees.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
CIS786, Lecture 4 Usman Roshan.
Minimum Spanning Trees. Subgraph A graph G is a subgraph of graph H if –The vertices of G are a subset of the vertices of H, and –The edges of G are a.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Fast Failover for Control Traffic in Software-defined Networks Globecom 2012 Neda B. & Ying Z. Presented by: Szu-Ping Wang.
Metagenomic Analysis Using MEGAN4
©Silberschatz, Korth and Sudarshan12.1Database System Concepts B + -Tree Index Files Indexing mechanisms used to speed up access to desired data.  E.g.,
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence for phylogenetic novelty Dongying Wu 03/2015.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Introduction Of Tree. Introduction A tree is a non-linear data structure in which items are arranged in sequence. It is used to represent hierarchical.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
1 Section 1.4 Graphs and Trees A graph is set of objects called vertices or nodes where some pairs of objects may be connected by edges. (A directed graph.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Algorithm Paradigms High Level Approach To solving a Class of Problems.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based.
Minimal Spanning Tree Problems in What is a minimal spanning tree An MST is a tree (set of edges) that connects all nodes in a graph, using.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Union-Find  Application in Kruskal’s Algorithm  Optimizing Union and Find Methods.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Lecture 19 Minimal Spanning Trees CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Graph Search Applications, Minimum Spanning Tree
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Phylogeny - based on whole genome data
Greedy Technique.
Chapter 5. Greedy Algorithms
Inference in Bayesian Networks
Research in Computational Molecular Biology , Vol (2008)
12. Graphs and Trees 2 Summary
Multiple Alignment and Phylogenetic Trees
Slide 1: Thank you Elizabeth for the introduction, and hello everybody. So, I have been a PhD student with Charles Semple and Mike Steel at the UoC since.
Spanning Trees.
Minimum Spanning Tree.
Finding a Eulerian Cycle in a Directed Graph
CSE 373 Data Structures and Algorithms
Intro to Alignment Algorithms: Global and Local
CS 581 Tandy Warnow.
Minimum Spanning Tree Algorithms
A (simple) graph is basically a network: it is a (finite) collection of points (called vertices or nodes) combined with a collection of connections between.
CSE 373: Data Structures and Algorithms
Phylogeny.
September 1, 2009 Tandy Warnow
Dynamic Graph Algorithms
Dynamic Programming II DP over Intervals
Lecture 6 Dynamic Programming
Presentation transcript:

Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007

Special Thanks: Sourav Chatterji Jason Raymond

The lack of phylogenetic diversity is evident in the current whole genome databases Certain phyla have been heavily sampled others have only sparse representatives Many phyla have been ignored The missing gaps in the current genome data are the obstacles for us: Getting the full picture of the “tree of life” Understanding of a full range of ecosystems and biological mechanisms Anchoring Metagenomic sequencing data    Proteobacteria   Firmicutes

Solutions Tree of Life the Genomic Encyclopedia Project for Bacteria and Archaea Greengenes ssu rRNAs: 134423 sequence entries ATCC: 18000 strains in more than 750 genera DSMZ: 13000 cultures representing 6900 species and 1400 genera (1207 bacteria and 77 archaea genera)

Prioritize Organism Selection to Optimize Phylogenetic Diversity Phylogenetic diversity (PD): if T is a tree whose leaf labels comprise a set X of species, and whose edges have non-negative real-valued lengths, then for a subset Y of X, the PD score of Y is the sum of the lengths of the edges of the minimal subtree of T that connects Y

Input: 2 + A tree (optional: a sub tree) A number (N) Output: A list of N taxa that gives the maximum PD for the sub-tree

Algorithm: Greedy Algorithm Reference: Vincent Moulton, Charles Semple and Mike Steel, Optimizing phylogenetic deverusyt under constraints, Journal of Theoretical Biology, doi:10.1016/j.jtbi.2006.12.021,2006

Take a tree and a sub-tree Calculate the added PD for each taxon to the subtree Grown the subtree to the taxon that adds the maximum PD Repeat the above steps N times, the resulting subtree is the one gives the maximum PD given the imposed constrains

Glory Details How tree structure is store in PERL ? Two Dimension Matrix. A C Node1 Node2 B D

Build Subtree: Base upon Index Paths Chose any taxon from the original sub-tree as a reference taxon, index all the paths connect the reference taxon and other taxa. A A is the reference taxon C B: B, Node1, A C: C, Node2, Node1, A D: D, Node2, Node1, A Node1 Node2 B D

{ C B: B, Node1, A C: C, Node2, Node1, A D: D, Node2, Node1, A Node1 Build subtree: combine the paths Subtree A-B-C: B: B, Node1, A C: C, Node2, Node1, A Calculate and grow subtree: Follow each path Calculate added PD if subtree grows to D: { D, Node2, Node1, A

If no starting subtree is defined, the program will identify the longest path as the starting subtree Step 1: pick any taxon, identify the farthest taxon Step 2: Start from the taxon picked from step 1, identify the longest path. It is the longest path for the whole tree. A A C C B B D D

Run the program: On Bobcat: /home/dwu/dwu_scripts/public_scripts/maxPD.pl -t input_tree -n number -o output -l input_list(optional) -i: input tree -n: the number of taxa that the user need for the output list -o: output -l: input list, the user can define a list of taxa, that must be included in the PD calculations (for example, species the user have to include) -gml: yes or no, output gml file option

Output Format: Taxon ID PD Addition to the subtree ID00032 2.3960 ID99033 0.6701 ID23890 0.5024

Results Visualization Free software to visualize network/tree structure: yEd http://www.yworks.com/en/products_yed_about.htm

GML Input format: graph [ node [id 1 label "A" graphics [ w 50 h 50 type "circle" fill "#AA0000"]] node [id 2 label "B" graphics [ w 50 h 50 type "circle" fill "#666666"]] node [id 3 label "C" graphics [ w 50 h 50 type "circle" fill "#AA0000"]] node [id 4 label "D" graphics [ w 50 h 50 type "circle" fill "#666666"]] node [id 5 label "node1" graphics [ w 3 h 3 type "circle" fill "#666666"]] node [id 6 label "node2" graphics [ w 3 h 3 type "circle" fill "#666666"]] edge [source 1 target 5 graphics [ fill "#AA0000" width 4 ]] edge [source 2 target 5 graphics [ fill "#666666" width 4 ]] edge [source 5 target 6 graphics [ fill "#AA0000" width 4 ]] edge [source 6 target 3 graphics [ fill "#AA0000" width 4 ]] edge [source 6 target 4 graphics [ fill "#666666" width 4 ]] ]

Select 300 out of 30000 based upon a ssu-RNA neighbor join tree

Y - Added PD X- added taxon (30000 picks /30000 taxa)

Y – PD of subtree X – added taxon (30000 picks / 30000 taxa)