Download presentation
Presentation is loading. Please wait.
1
Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational Aspects Related to the Study of The Tree of Life
2
What do we mean by the “Tree of Life” or Supertrees, datatypes, databases, taxonomy Tree algorithms, models, genomics, lateral gene transfer Our perception of what the tree is may affect what we view as being the “interesting” problems
3
Topics Supertrees (MinCut) Phylogenetic databases
4
Tree terminology abc d {a,b} {a,b,c} {a,b,c,d} root leaf internal node cluster edge
5
Nestings and triplets abc d {a,b} < T {a,b,c,d} {b,c} < T {a,b,c,d} (bc)d bc|d Nestings Triplets
6
Supertree abcbcd abc d supertree T 1 T 2 + =
7
Some desirable properties of a supertree method (Steel et al., 2000) The supertree can be computed in polynomial time A grouping in one or more trees that is not contradicted by any other tree occurs in the supertree
8
Aho et al.’s algorithm (OneTree) Aho, A. V., Sagiv, Y., Syzmanski, T. G., and Ullman, J. D. 1981. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 10: 405-421. Input: set of rooted trees 1. If set is compatible (i.e., will agree on a tree), output that tree. 2. If set is not compatible, stop!
9
abcbcd T 1 T 2 a b c d a, bda, b, c, d a b c a, b, cabc Aho et al.’s OneTree algorithm supertree
10
Mincut supertrees Semple, C., and Steel, M. 2000. A supertree method for rooted trees. Discrete Appl. Math. 105: 147-158. Modifies OneTree by cutting graph Requires rooted trees (no analogue of OneTree for unrooted trees) Recursive Polynomial time
11
abcdeabcd T 1 T 2 a b c de {T 1,T 2 } S Semple and Steel (2000)
12
a b c de a,b c de 1 1 1 1 11 1 2 {T 1,T 2 } S max S/E {T 1,T 2 }{T 1,T 2 } Collapsing the graph (Semple and Steel mincut algorithm) This edge has maximum weight
13
Cut the graph to get supertree abcde supertree a,b c de 1 1 1 max S/E {T 1,T 2 }{T 1,T 2 }
14
My mincut supertree implementation darwin.zoology.gla.ac.uk/~rpage/supertree Written in C++ Uses GTL (Graph Template Library) to handle graphs (formerly a free alternative to LEDA) Finds all mincuts of a graph faster than Semple and Steel’s algorithm
15
A counter example: two input trees... a b c x 1 x 2 x 3 c b a y 1 y 2 y 3 y 4
16
Mincut gives this (strange) result c x 1 x 2 x 3 b a y 1 y 2 y 3 y 4 Disputed relationships among a, b, and c are resolved x1, x2, and x3 collapsed into polytomy
17
Problem: Cuts depend on connectivity (in this example it is a function of tree size) a x1 x2 y1 y3 y4 x3 y2 c b {T 1,T 2 } S
18
So, mincut doesn’t work But, Semple and Steel said it did My program seems to work Argh!!! What is happening….?
19
What mincut does… …and does not do Mincut supertree is guaranteed to include any nesting which occurs in all input trees Makes no claims about nestings which occur in only some of the trees “Does exactly what it says on the tin™”
20
Modifying mincut supertree Can we incorporate more of the information in the input trees? Three categories of information Unanimous (all trees have that grouping) Contradicted (trees explicitly disagree) Uncontradicted (some trees have information that no other tree disagrees with)
21
Uncontradicted information assume we have k input trees ab a and b co-occur in a tree a and b nested in a tree ab c n c - n = 0 uncontradicted (if c = k then unanimous) c - n > 0 contradicted
22
Uncontradicted information assume we have k input trees ab a and b co-occur in a tree a and b nested in a tree ab c n c - n -f = 0 uncontradicted (if c = k then unanimous) c - n - f > 0 contradicted ab a and b in a fan f
23
a b c x 1 x x 3 y 1 y 2 y 3 y 4 2 a b c y 1 y 3 y 4 x 1 x 2 x 3 y 2 Uncontradicted Uncontradicted but adjacent to contradicted Contradicted Classifying edges {T 1,T 2 } S
24
Modified mincut Species a, b, and c form a polytomy x1, x2, and x3 resolved as per the input tree modifiedmincut a b c x 1 x 2 x 3 y 1 y 2 y 3 y 4
25
12345 12345 12345 12345 (12)5 (45)1 (23)5 (34)1 If no tree contradicts an item of information, is that information always in the supertree?
26
12 3 4 5 No! Steel, Dress, & Böcker 2000 The four trees display (12)5, (23)5, (34)1, and (45)1 No tree displays (IK)J or (JK)I for any (IJ)K above Triplets are uncontradicted, but cannot form a tree
27
Future directions for supertrees Improve handling of uncontradicted information Add support for constraints Visualising very big trees Better integration into phylogeny databases (www.treebase.org) darwin.zoology.gla.ac.uk/~rpage/supertree
28
Supertree Challenge (proposed by Mike Sanderson mjsanderson@ucdavis.edu) The TreeBASE database currently contains over 1000 phylogenies with over 11,000 taxa among them. Many of these trees share taxa with each other and are therefore candidates for the construction of composite phylogenies, or "supertrees", by various algorithms. A challenging problem is the construction of the largest and "best" supertree possible from this database. "Largest" and "best" may represent conflicting goals, however, because resolution of a supertree can be easily diminished by addition of "inappropriate" trees or taxa.
29
It’s a scandal We cannot answer even the most basic question: “what is the phylogeny for group x?” GenBank is currently the best phylogenetic database (!) Can't even say how many species are in a given group Little idea of who is doing what
31
Tree of Life tolweb.org Provides text and images Relies on extensive manual effort (e.g., writing text) Can’t do any computations with it Limited research value
32
TreeBASE www.treebase.org Relational database Query by author, taxon, study number Compute supertrees Submit NEXUS data files
33
TreeBASE
34
TreeBASE and mincut supertrees User selects two or more trees Clicks on button and script on darwin.zoology.gla.ac.uk is run to create supertree Can view as PS, PDF, treefile, or in Java applet (ATV)
35
What’s wrong with TreeBASE? No consistency of taxon names (e.g., Human, Homo sapiens, Homo sapiens X54666-1) No consistency of data names (e.g., gene names, morphological characters, etc.)
36
The same organism may have multiple names
37
Starting December 1, the ALL Species Foundation will close its San Francisco office because of a lack of funding for the Foundation. www.all-species.org Press Release: November 13, 2002 “The ALL Species Foundation is a non-profit organization dedicated to the complete inventory of all species of life on Earth within the next 25 years - a human generation.”
38
The first challenge We need a taxonomic name server that can resolve the name of any organism This server needs to reconcile multiple classifications (e.g., GenBank, ITIS, etc.) Must handle at least 1 million names, perhaps 100 million
39
How do we query trees? Trees can be classifications or phylogenies Second Challenge
40
SQL Queries on Trees Oracle SQL Transitive Closure Query (recursion) Nested queries Node path queries
41
1. All ancestors of node A A
42
2. Least Common Ancestor (LCA) of A and B A B
43
3. Spanning Clade of A and B A B
44
4. Path Length from A and B A B 5
46
Node paths /1 /1/1 /1/1/2 /1/1/1/2 /1/2/2 /1/2 /1/2/1 /2 /1/1/1 /1/1/1/1
47
Node paths - selecting subtree /1 /1/1 /1/1/2 /1/1/1/2 /1/2/2 /1/2 /1/2/1 /2 /1/1/1 /1/1/1/1 SELECT node WHERE (path LIKE “/1/1/%”) AND (path < “/1/10/%”);
48
Node paths - selecting subtree /1 /1/1 /1/1/2 /1/1/1/2 /1/2/2 /1/2 /1/2/1 /2 /1/1/1 /1/1/1/1 SELECT node WHERE (path LIKE “/1/1/%”) AND (path < “/1/10/%”) AND (num_children IS 0);
49
Node paths - LCA /1 /1/1 /1/1/2 /1/1/1/2 /1/2/2 /1/2 /1/2/1 /2 /1/1/1 /1/1/1/1 Common substring starting from left
50
What do we do now…? Setup a taxonomic name server (TNS) Develop a phylogenetic genetic database linked to TNS, PubMed, GenBank, etc. Develop easy ways to populate database (e.g., from TreeBASE, GenBank, journal databases) Develop standard set of tree queries Deploy
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.