Species Trees & Constraint Programming: recent progress and new challenges By Patrick Prosser Presented by Chris Unsworth at CP06.

Slides:



Advertisements
Similar presentations
1 Modified Mincut Supertrees Roderic Page University of Glasgow.
Advertisements

Global Constraints Toby Walsh National ICT Australia and University of New South Wales
Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Traveling Salesperson Problem
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
Frequent Closed Pattern Search By Row and Feature Enumeration
1 Steiner Tree on graphs of small treewidth Algorithms and Networks 2014/2015 Hans L. Bodlaender Johan M. M. van Rooij.
Specialised N-ary Constraint for the Stable Marriage Problem By Chris Unsworth and Patrick Prosser.
What’s the problem? Something like stable marriage problem … but without sex.
Lecture 11 CSS314 Parallel Computing
Precedence Constrained Scheduling Abhiram Ranade Dept. of CSE IIT Bombay.
Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship.
Phylogenetic trees Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Chapter 2.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
June 3, 2015Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Assignment 2: (Due at 10:30 a.m on Friday of Week 10) Question 1 (Given in Tutorial 5) Question 2 (Given in Tutorial 7) If you do Question 1 only, you.
The Tree of Life From Ernst Haeckel, 1891.
Fast Computation of the Exact Hybridization Number of Two Phylogenetic Trees Yufeng Wu and Jiayin Wang Department of Computer Science and Engineering University.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Priority Queues1 Part-D1 Priority Queues. Priority Queues2 Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Algorithm Animation for Bioinformatics Algorithms.
AVL Trees / Slide 1 Deletion  To delete a key target, we find it at a leaf x, and remove it. * Two situations to worry about: (1) target is a key in some.
Supertrees: Algorithms and Databases Roderic Page University of Glasgow DIMACS Working Group Meeting on Mathematical and Computational.
Equivalence Class Testing
1 Efficient packet classification using TCAMs Authors: Derek Pao, Yiu Keung Li and Peng Zhou Publisher: Computer Networks 2006 Present: Chen-Yu Lin Date:
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
Species Trees & Constraint Programming. Ongoing work with Ian Gent, Barbara Smith, Wu Wei (Christine)
COSC2007 Data Structures II
Tractable Symmetry Breaking Using Restricted Search Trees Colva M. Roney-Dougal, Ian P. Gent, Tom Kelsey, Steve Linton Presented by: Shant Karakashian.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
2-3 Trees Extended tree.  Tree in which all empty subtrees are replaced by new nodes that are called external nodes.  Original nodes are called internal.
Symmetry Breaking Ordering Constraints Zeynep Kiziltan Department of Information Science Uppsala University, Sweden A progress.
CS261 Data Structures Trees Introduction and Applications.
Querying Structured Text in an XML Database By Xuemei Luo.
1 B-Trees & (a,b)-Trees CS 6310: Advanced Data Structures Western Michigan University Presented by: Lawrence Kalisz.
Week 11 - Wednesday.  What did we talk about last time?  Graphs  Euler paths and tours.
Constructing evolutionary trees from rooted triples Bang Ye Wu Dept. of Computer Science and Information Engineering Shu-Te University.
30/09/04 AIPP Lecture 3: Recursion, Structures, and Lists1 Recursion, Structures, and Lists Artificial Intelligence Programming in Prolog Lecturer: Tim.
Reading Phylogenetic Trees
The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, August 2006 Barbara Holland.
Data Structures and Algorithm Analysis Trees Lecturer: Jing Liu Homepage:
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Complexity of Functions with Cartesian GP and Recursion. John Woodward. School of Computer Science, The University of Birmingham, United Kingdom. 1 OVERVIEW.
LIMITATIONS OF ALGORITHM POWER
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Lecture 8CSE Intro to Cognitive Science1 Interpreting Line Drawings II.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Species Trees & Constraint Programming
Backtracking And Branch And Bound
B+ Tree.
Part-D1 Priority Queues
Find in a linked list? first last 7  4  3  8 NULL
CS 581 Tandy Warnow.
By Patrick Prosser Presented by Chris Unsworth at CP06
Artificial Intelligence
Phylogeny.
Implementation of Learning Systems
Presentation transcript:

Species Trees & Constraint Programming: recent progress and new challenges By Patrick Prosser Presented by Chris Unsworth at CP06

Outline Tree of life (what’s that then?) Previous work (conventional and CP model) What’s new? (enhanced model, new problems) Conclusions (what have I told you!?) Future work (will this never end?)

Tree of life A central goal of systematics construct the tree of life a tree that represents the relationship between all living things The leaf nodes of the tree are species The interior nodes are hypothesized species extinct, where species diverged

Not to be confused with this

Not to be confused with this either

Something like this

To date, biologists have cataloged about 1.7 million species yet estimates of the total number of species ranges from 4 to 100 million. “Of the 1.7 million species identified only about 80,000 species have been placed in the tree of life” E. Pennisi “Modernizing the Tree of Life” Science 300:

Properties of a Species Tree We have a set of leaf nodes, each labelled with a species the interior nodes have no labels (maybe) each interior node has 2 children and one parent (maybe/ideally) – a bifurcating tree (maybe/ideally) Note: recently there has been a requirements that interior nodes have divergence dates leaf nodes correspond to other trees (such as a leaf “cats”) trees might not bifurcate

Super Trees We are given two trees, T1 and T2 S1 and S2 are the sets of leaves for T1 and T2 respectively – remember, leaves are species! S1 and S2 have a non-empty intersection – some species appear in both trees We want to combine T1 and T2 – respecting the relationships in T1 and T2 –form a “super tree”

combine superTree

Overlap is highlighted in the trees and the superTree

Overlap is leafs “a” and “f” A simple wee example

Most Recent Common Ancestors (mrca) ab c mrca(a,b)  mrca(a,c) mrca(a,b)  mrca(b,c) mrca(a,c)  mrca(b,c) We have 3 species, a, b, and c Species a and b are more closely related to each other than they are to c The most recent common ancestor of a and b is further from the root than the most recent common ancestor of a and c (and b and c) mrca(a,b) mrca(a,c) = mrca(b,c) a is closer to b than c NOTE: mrca(x,y) = mrca(y,x)

Most Recent Common Ancestors (mrca) ab c mrca(a,b)  mrca(a,c) mrca(a,b)  mrca(b,c) mrca(a,c)  mrca(b,c) mrca(a,b) mrca(a,c) = mrca(b,c) Note: thisdefines that Think of mrca(x,y) having integer value “depth”

Ultrametric relationship Given 3 leaf nodes labelled a, b, and c there are only 4 possible situations abcacbbcabca triples fan

abcacbbca abc That’s all that there can be, for 3 leafs

abcacbbca abc Another view A space made up of triangles a bc Given any three vertices the triangle is either isosceles or equilateral

Ultrametric relationship Given 3 leaf nodes labelled a, b, and c there are only 4 possible situations We can represent this using primitive constraints Where D[i,j] is a constrained integer variable representing the depth in the tree of the most recent common ancestor of the i th and j th species

Ultrametric constraint Therefore the ultrametric constraint is as follows Constraint acting between leaf nodes/species a, b, and c Where D[x,y] is depth in tree of mrca(x,y) D[x,y] can also be thought of as distance

How it goes (part 1) 1.Take 2 species trees T1 and T2 2. Use the “breakUp” algorithm (Ng & Wormald 1996) on T1 then T2 - This produces a set of triples and fans 3. Use the “oneTree” algorithm (Ng & Wormald 1996) - Generates a superTree or fails This is the “conventional” (non-CP) approach Different versions of oneTree and breakUp from Semple and Steel (I think) that treats fans differently (ignores them) oneTree is essentially the algorithm of Aho, Sagiv, Szymanski and Ullman in SIAM J.Compt 1981 Conventional technology (circa 1981)

breakUp generates constraints! AB C DE FG 1. Find deepest interior node 2. Get its descendants (leaf nodes) 3. Get a cousin or uncle leaf node 4. Generate a triple or fan 5. Delete one of the leafs in 2 6. Take the other leaf in 2 and make its parent that leaf 7. Go to 1 unless we are at the root with degree 2

breakUp generates constraints! AB C DE FG Generate triple AB|C This is the constraint D[A,C] = D[B,C] < D[A,B] A deepest interior node

breakUp generates constraints! B C DE FG Generate triple DE|C This is the constraint D[D,C] = D[E,C] < D[D,E] A deepest interior node

breakUp generates constraints! Generate fan BCE This is the constraint D[B,C] = D[B,E] = D[C,E] A deepest interior node B C E FG

breakUp generates constraints! Generate triple FG|E This is the constraint D[E,F] = D[F,G] < D[F,G] A deepest interior node E FG

breakUp generates constraints! EG Done The triples and fans can be viewed as constraints that break the ultrametric disjunctions

The 1 st CP approach

How it goes (part 2) This is the CP approach proposed by Gent, Prosser, Smith & Wei in CP03 (a great great paper, go read it ) 1.Generate an n by n array of constrained integer variables 2.For all 0<i<j<k<n post the ultrametric constraint - Yes, we have a cubic number of constraints - Yes, we have a quadratic number of variables - This gives us an “ultrametric matrix” 3. Use breakUp on trees T1 and T2 to produce triples and fans 4. Post the triples and fans as constraints, breaking disjunctions 5. Find a first solution 6. Convert the ultrametric matrix to an ultrametric tree Algorithm for ultrametric matrix to ultrametric tree given by Dan Gusfield CP approach (circa 2003)

Key here is that we have an array of variables Representing distances and this space must be ultrametric

3 45 B8CD EA An min ultrametric tree and its min ultrametric matrix As we go down a branch values on interior nodes increase Matrix value is the value of the most recent common ancestor of two leaf nodes Matrix is symmetric

The state of play in 2003 Coded up in claire & choco more a ”proof of concept” than a useful tool small data sets only

Two species trees of sea birds from the CP03 paper

Resultant superTree On the left by oneTree and on the right by CP model

What’s new 1.Reimplemented in java & JChoco (so faster) 2.More robust (thanks to Pierre Flener’s help) 3.Can now deal with larger trees (about 70 species) 4.Can generate all solutions up to symmetry 5.Can handle divergence dates on interior nodes 6.Reimplemented breakUp & oneTree in Java 7.All code available on the web 2006

Bigger Trees Attempted to reconstruct the supertree in Kennedy & Page’s “Seabird supertrees: Combining partial estimates of rocellariiform phylogeny” in “The Auk: A Quarterly Journal of Ornithology” 119: trees of seabirds (A through G) Varying in size from 14 to 90 species

From the paper Table shows on the diagonal the size of each tree, A through G A table entry is the size of the combined tree A table entry in () if trees are incompatible A table entry of – if trees are too big for CP model The only compatible trees are A, B, D and F The resultant supertree has 69 species This takes 20 seconds to produce

A “lifted” representation Rather than instantiate the “D” variables why not just break the disjunctions? Now the decision variables are P[i,j,k] And yes, we have a cubic number of P variables

A “lifted” representation Rather than instantiate the “D” variables why not just break the disjunctions? Now the decision variables are P[i,j,k] Now we can: 1.Enumerate all solutions eliminating value symmetries 2.Allow ranges of values on interior nodes of trees - input and output!

Ranked Trees A new problem where input trees have ancestral divergence dates on interior nodes A new “conventional” technique is the RANKED TREE algorithm

Ranked Trees using “lifted” CP model A new problem where input trees have ancestral divergence dates on interior nodes We do this in the “lifted” model by merely 1. reading in divergence dates for pairs of species and posting these as constraints into the “D” variables 2. Then solve using the disjunction breaking “P” variables 3. Interior nodes retain range values 4. In addition can enumerate all solutions eliminating value symmetries

Two trees of cats. Ranks (divergence information) on interior nodes Common species in boxes

Two ranked cats trees on left, and on the right one of the ranked supertrees NOTE: range of values [6..9] on mrca(PTE,LTI)

7 of the 17 solutions have ranges on interior nodes Without the “lifted” representation we get 30 solutions (some redundant)

Is this a 1 st ? We thinks so (or at least Patrick thinks so) 1. enumerate all solutions for ranked supertrees 2. remove value symmetries

What next? Reduce the size of the model. with a specialised ultrametric constraint - over 3 variables - over 3 variables plus the P decision variable - over an entire n by n array Improve propagation of ultrametric constraint - Bound GAC - GAC New application - Identify common features (back bone) of all supertrees - Address nested taxa - combine all we have Already underway with Neil Moore

Conclusion presented a new (non-conventional) way of addressing the supertree problem constraint model has been shown to be versatile enumerate all solutions removing symmetries address divergence dates on interior nodes enumerate all solutions for ranked trees model is bulky/large we are working on this future extensions find the backbone of forest of supertrees address nested taxa

I did it all on my own NO WAY!

Thanks for helping Pierre Flener Xavier Lorca Rod Page Mike Steel Charles Semple Chris Unsworth Neil Moore Christine Wu Wei Barbara Smith Ian Gent

Any questions?