Supertrees: Algorithms and Databases Roderic Page University of Glasgow DIMACS Working Group Meeting on Mathematical and Computational.

Slides:



Advertisements
Similar presentations
1 Modified Mincut Supertrees Roderic Page University of Glasgow.
Advertisements

An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Reading Phylogenetic Trees Gloria Rendon NCSA November, 2008.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Plant Molecular Systematics (Phylogenetics). Systematics classifies species based on similarity of traits and possible mechanisms of evolution, a change.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Phylogenetic trees Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Chapter 2.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Lectures on Network Flows
Phylogeny Reconstruction II. The edges of tree can be freely rotated without changing the relationships among the terminal nodes. Trees are like mobiles.
The Tree of Life From Ernst Haeckel, 1891.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
Fast Computation of the Exact Hybridization Number of Two Phylogenetic Trees Yufeng Wu and Jiayin Wang Department of Computer Science and Engineering University.
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
Chapter 2 Opener How do we classify organisms?. Figure 2.1 Tracing the path of evolution to Homo sapiens from the universal ancestor of all life.
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Trees. Introduction to Trees Trees are very common in computer science They come in different forms They are used as data representation in many applications.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
SuperTriplets: a triplet-based supertree approach to phylogenomics Vincent Ranwez, Alexis Criscuolo and Emmanuel J.P. Douzery.
Christian M Zmasek, PhD Burnham Institute for Medical Research Bioinformatics and Systems Biology
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
BINF6201/8201 Molecular phylogenetic methods
Binary Trees. Binary Tree Finite (possibly empty) collection of elements A nonempty binary tree has a root element The remaining elements (if any) are.
Constructing evolutionary trees from rooted triples Bang Ye Wu Dept. of Computer Science and Information Engineering Shu-Te University.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Introduction to Phylogenetic Trees
Reading Phylogenetic Trees
The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, August 2006 Barbara Holland.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Prof. Amr Goneid, AUC1 CSCE 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 4. Trees.
Algorithmic Detection of Semantic Similarity WWW 2005.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
COSC 2007 Data Structures II Chapter 14 Graphs I.
TreeBASE and Phyloinformatics Roderic Page University of Glasgow.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Understanding sets of trees CS 394C September 10, 2009.
M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
The Big Issues in Phylogenetic Reconstruction Randy Linder Integrative Biology, University of Texas
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
1 The tree data structure Outline In this topic, we will cover: –Definition of a tree data structure and its components –Concepts of: Root, internal, and.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
PHYLOGENY AND THE TREE OF LIFE.  Phylogeny is the evolutionary history of a species or a group of species.  To determine how an organism is classified,
Proof of correctness of Dijkstra’s algorithm: Basically, we need to prove two claims. (1)Let S be the set of vertices for which the shortest path from.
1 Trees. 2 Trees Trees. Binary Trees Tree Traversal.
Tree Terminologies. Phylogenetic Tree - phylogenetic relationships are normally displayed in a tree-like diagram (phylogenetic tree/cladogram) - a cladogram.
Applied Discrete Mathematics Week 15: Trees
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007.
Taibah University College of Computer Science & Engineering Course Title: Discrete Mathematics Code: CS 103 Chapter 10 Trees Slides are adopted from “Discrete.
CS 581 Tandy Warnow.
Reading Phylogenetic Trees
CS 581 Tandy Warnow.
And the Final Subject is…
Phylogeny.
September 1, 2009 Tandy Warnow
Presentation transcript:

Supertrees: Algorithms and Databases Roderic Page University of Glasgow DIMACS Working Group Meeting on Mathematical and Computational Aspects Related to the Study of The Tree of Life

What do we mean by the “Tree of Life” or Supertrees, datatypes, databases, taxonomy Tree algorithms, models, genomics, lateral gene transfer Our perception of what the tree is may affect what we view as being the “interesting” problems

Topics Supertrees (MinCut) Phylogenetic databases

Tree terminology abc d {a,b} {a,b,c} {a,b,c,d} root leaf internal node cluster edge

Nestings and triplets abc d {a,b} < T {a,b,c,d} {b,c} < T {a,b,c,d} (bc)d bc|d Nestings Triplets

Supertree abcbcd abc d supertree T 1 T 2 + =

Some desirable properties of a supertree method (Steel et al., 2000) The supertree can be computed in polynomial time A grouping in one or more trees that is not contradicted by any other tree occurs in the supertree

Aho et al.’s algorithm (OneTree) Aho, A. V., Sagiv, Y., Syzmanski, T. G., and Ullman, J. D Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 10: Input: set of rooted trees 1. If set is compatible (i.e., will agree on a tree), output that tree. 2. If set is not compatible, stop!

abcbcd T 1 T 2 a b c d a, bda, b, c, d a b c a, b, cabc Aho et al.’s OneTree algorithm supertree

Mincut supertrees Semple, C., and Steel, M A supertree method for rooted trees. Discrete Appl. Math. 105: Modifies OneTree by cutting graph Requires rooted trees (no analogue of OneTree for unrooted trees) Recursive Polynomial time

abcdeabcd T 1 T 2 a b c de {T 1,T 2 } S Semple and Steel (2000)

a b c de a,b c de {T 1,T 2 } S max S/E {T 1,T 2 }{T 1,T 2 } Collapsing the graph (Semple and Steel mincut algorithm) This edge has maximum weight

Cut the graph to get supertree abcde supertree a,b c de max S/E {T 1,T 2 }{T 1,T 2 }

My mincut supertree implementation darwin.zoology.gla.ac.uk/~rpage/supertree Written in C++ Uses GTL (Graph Template Library) to handle graphs (formerly a free alternative to LEDA) Finds all mincuts of a graph faster than Semple and Steel’s algorithm

A counter example: two input trees... a b c x 1 x 2 x 3 c b a y 1 y 2 y 3 y 4

Mincut gives this (strange) result c x 1 x 2 x 3 b a y 1 y 2 y 3 y 4 Disputed relationships among a, b, and c are resolved x1, x2, and x3 collapsed into polytomy

Problem: Cuts depend on connectivity (in this example it is a function of tree size) a x1 x2 y1 y3 y4 x3 y2 c b {T 1,T 2 } S

So, mincut doesn’t work But, Semple and Steel said it did My program seems to work Argh!!! What is happening….?

What mincut does… …and does not do Mincut supertree is guaranteed to include any nesting which occurs in all input trees Makes no claims about nestings which occur in only some of the trees “Does exactly what it says on the tin™”

Modifying mincut supertree Can we incorporate more of the information in the input trees? Three categories of information Unanimous (all trees have that grouping) Contradicted (trees explicitly disagree) Uncontradicted (some trees have information that no other tree disagrees with)

Uncontradicted information assume we have k input trees ab a and b co-occur in a tree a and b nested in a tree ab c n c - n = 0  uncontradicted (if c = k then unanimous) c - n > 0  contradicted

Uncontradicted information assume we have k input trees ab a and b co-occur in a tree a and b nested in a tree ab c n c - n -f = 0  uncontradicted (if c = k then unanimous) c - n - f > 0  contradicted ab a and b in a fan f

a b c x 1 x x 3 y 1 y 2 y 3 y 4 2 a b c y 1 y 3 y 4 x 1 x 2 x 3 y 2 Uncontradicted Uncontradicted but adjacent to contradicted Contradicted Classifying edges {T 1,T 2 } S

Modified mincut Species a, b, and c form a polytomy x1, x2, and x3 resolved as per the input tree modifiedmincut a b c x 1 x 2 x 3 y 1 y 2 y 3 y 4

(12)5 (45)1 (23)5 (34)1 If no tree contradicts an item of information, is that information always in the supertree?

No! Steel, Dress, & Böcker 2000 The four trees display (12)5, (23)5, (34)1, and (45)1 No tree displays (IK)J or (JK)I for any (IJ)K above Triplets are uncontradicted, but cannot form a tree

Future directions for supertrees Improve handling of uncontradicted information Add support for constraints Visualising very big trees Better integration into phylogeny databases ( darwin.zoology.gla.ac.uk/~rpage/supertree

Supertree Challenge (proposed by Mike Sanderson The TreeBASE database currently contains over 1000 phylogenies with over 11,000 taxa among them. Many of these trees share taxa with each other and are therefore candidates for the construction of composite phylogenies, or "supertrees", by various algorithms. A challenging problem is the construction of the largest and "best" supertree possible from this database. "Largest" and "best" may represent conflicting goals, however, because resolution of a supertree can be easily diminished by addition of "inappropriate" trees or taxa.

It’s a scandal We cannot answer even the most basic question: “what is the phylogeny for group x?” GenBank is currently the best phylogenetic database (!) Can't even say how many species are in a given group Little idea of who is doing what

Tree of Life tolweb.org Provides text and images Relies on extensive manual effort (e.g., writing text) Can’t do any computations with it Limited research value

TreeBASE Relational database Query by author, taxon, study number Compute supertrees Submit NEXUS data files

TreeBASE

TreeBASE and mincut supertrees User selects two or more trees Clicks on button and script on darwin.zoology.gla.ac.uk is run to create supertree Can view as PS, PDF, treefile, or in Java applet (ATV)

What’s wrong with TreeBASE? No consistency of taxon names (e.g., Human, Homo sapiens, Homo sapiens X ) No consistency of data names (e.g., gene names, morphological characters, etc.)

The same organism may have multiple names

Starting December 1, the ALL Species Foundation will close its San Francisco office because of a lack of funding for the Foundation. Press Release: November 13, 2002 “The ALL Species Foundation is a non-profit organization dedicated to the complete inventory of all species of life on Earth within the next 25 years - a human generation.”

The first challenge We need a taxonomic name server that can resolve the name of any organism This server needs to reconcile multiple classifications (e.g., GenBank, ITIS, etc.) Must handle at least 1 million names, perhaps 100 million

How do we query trees? Trees can be classifications or phylogenies Second Challenge

SQL Queries on Trees Oracle SQL Transitive Closure Query (recursion) Nested queries Node path queries

1. All ancestors of node A A

2. Least Common Ancestor (LCA) of A and B A B

3. Spanning Clade of A and B A B

4. Path Length from A and B A B 5

Node paths /1 /1/1 /1/1/2 /1/1/1/2 /1/2/2 /1/2 /1/2/1 /2 /1/1/1 /1/1/1/1

Node paths - selecting subtree /1 /1/1 /1/1/2 /1/1/1/2 /1/2/2 /1/2 /1/2/1 /2 /1/1/1 /1/1/1/1 SELECT node WHERE (path LIKE “/1/1/%”) AND (path < “/1/10/%”);

Node paths - selecting subtree /1 /1/1 /1/1/2 /1/1/1/2 /1/2/2 /1/2 /1/2/1 /2 /1/1/1 /1/1/1/1 SELECT node WHERE (path LIKE “/1/1/%”) AND (path < “/1/10/%”) AND (num_children IS 0);

Node paths - LCA /1 /1/1 /1/1/2 /1/1/1/2 /1/2/2 /1/2 /1/2/1 /2 /1/1/1 /1/1/1/1 Common substring starting from left

What do we do now…? Setup a taxonomic name server (TNS) Develop a phylogenetic genetic database linked to TNS, PubMed, GenBank, etc. Develop easy ways to populate database (e.g., from TreeBASE, GenBank, journal databases) Develop standard set of tree queries Deploy