Absolute Fast Converging Methods

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Based on the paper by D.Huson, S.Nettles, T.Warnow
Perfect Phylogeny MLE for Phylogeny Lecture 14
CIS786, Lecture 4 Usman Roshan.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
CS 466 and BIOE 498: Introduction to Bioinformatics
Constrained Exact Optimization in Phylogenetics
Distance-based phylogeny estimation
The Disk-Covering Method for Phylogenetic Tree Reconstruction
Phylogenetic basis of systematics
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Statistical tree estimation
Distance based phylogenetics
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
Challenges in constructing very large evolutionary trees
Algorithm Design and Phylogenomics
CIPRES: Enabling Tree of Life Projects
Mathematical and Computational Challenges in Reconstructing Evolution
Mathematical and Computational Challenges in Reconstructing Evolution
Technion – Israel Institute of Technology
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
CS 581 Algorithmic Computational Genomics
Tandy Warnow Department of Computer Sciences
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
Recent Breakthroughs in Mathematical and Computational Phylogenetics
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Imputing Supertrees and Supernetworks from Quartets
Presentation transcript:

Absolute Fast Converging Methods CS 581 Algorithmic Computational Genomics

Part 1: Absolute Fast Convergence

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

Markov Model of Site Evolution Simplest (Jukes-Cantor): The model tree T is binary and has substitution probabilities p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleotides) If a site (position) changes on an edge, it changes with equal probability to each of the remaining states. The expected number of changes on edge e is given by w(e), and is a function of the probability p(e) of a change being observed on the edge: w(e) = - ¾ ln(1- 4/3 p(e)). The evolutionary process is Markovian. More complex models (such as the General Markov model) are also considered, often with little change to the theory.

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

“Convergence rate” or sequence length requirement The sequence length (number of sites) that a phylogeny reconstruction method M needs to reconstruct the true tree with probability at least 1- depends on M (the method)  f = min w(e), g = max w(e), and n, the number of leaves We fix everything but n.

Afc methods A method M is “absolute fast converging”, or afc, if for all positive f, g, and , there is a polynomial p(n) s.t. Pr(M(S)=T) > 1- , when S is a set of sequences generated on T of length at least p(n). Notes: 1. The polynomial p(n) will depend upon M, f, g, and . 2. The method M is not “told” the values of f and g.

Distance-based estimation

Are distance-based methods statistically consistent Are distance-based methods statistically consistent? And if so, what are their sequence length requirements?

Theorem (Erdos et al., Atteson): Neighbor joining (and some other methods) will return the true tree w.h.p. provided sequence lengths are exponential in the evolutionary diameter of the tree. Sketch of proof: NJ (and other distance methods) guaranteed correct if all entries in the estimated distance matrix have sufficiently low error. Estimations of large distances require long sequences to have low error w.h.p.

Performance on large diameter trees Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] 0.8 NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

Designing an afc method You often don’t need the entire distance matrix to get the true tree (think of the caterpillar tree) The problem is you don’t know which entries have sufficiently low error, and which ones are needed to determine the tree. But you can guess!

Designing an afc method You often don’t need the entire distance matrix to get the true tree (think of the caterpillar tree) The problem is you don’t know which entries have sufficiently low error, and which ones are needed to determine the tree. But you can guess!

Designing an afc method You often don’t need the entire distance matrix to get the true tree (think of the caterpillar tree) The problem is you don’t know which entries have sufficiently low error, and which ones are needed to determine the tree. But you can guess!

Fast converging methods (and related work) 1997: Erdos, Steel, Szekely, and Warnow (ICALP). 1999: Erdos, Steel, Szekely, and Warnow (RSA, TCS); Huson, Nettles and Warnow (J. Comp Bio.) 2001: Warnow, St. John, and Moret (SODA); Cryan, Goldberg, and Goldberg (SICOMP); Csuros and Kao (SODA); Nakhleh, St. John, Roshan, Sun, and Warnow (ISMB) 2002: Csuros (J. Comp. Bio.) 2006: Daskalakis, Mossel, Roch (STOC), Daskalakis, Hill, Jaffe, Mihaescu, Mossel, and Rao (RECOMB) 2007: Mossel (IEEE TCBB) 2008: Gronau, Moran and Snir (SODA) 2010: Roch (Science) 2017: Roch and Sly (Prob. Theory and Related Fields) and others

II: Short Quartet Methods The first “absolute fast converging” methods were based on “short quartets”, which are quartet trees formed by taking the nearest leaf in each subtree around some edge. “Nearest” can be based on any branch lengths, including just unit branch lengths.

Short Quartets Define the Tree Theorem: Let (T,w) be a tree with branch lengths, and let Q be the set of short quartet trees of T. If T’ is some tree on the same leaf set, and Q is a subset of Q(T’), then T=T’. Proof: Recall that T=T’ iff Q(T)=Q(T’). Then we will show that the dyadic closure(Q) = Q(T), and the result follows.

Dyadic Closure AB|CD + BC|DE defines a tree on A,B,C,D,E, and so implies quartets AB|CE AB|DE AC|DE AB|CD + AB|CE => AB|DE

The first short quartet method Given distance matrix D and threshold q, DO: Erase all entries in D that are bigger than q. For all quartets i,j,k,l such that all pairwise distances are at most q, use the Four Point Method to compute a tree on i,j,k,l. Compute the Dyadic Closure Q of this set of quartet trees. If no conflicts occur, then Q = Q(T) for some tree; compute Tq using the Naïve Quartet Method. Else reject q.

The Short Quartet Method After you compute Tq for each q in D, see which case is true: All threshold values for q are rejected At least one value is not rejected, and all non-rejected values return the same tree At least two values are not rejected but they return different trees

The Short Quartet Method The outcome we want is: At least one value is not rejected, and all non-rejected values return the same tree We can prove that this outcome happens with high probability given polynomial length sequences, and that it returns the true tree! In other words, the Dyadic Closure Method is absolute fast converging.

Nice, but Although the Dyadic Closure method is absolute fast converging, it generally has bad performance: it returns the true tree or no tree, and most often it will return no tree. So it has good theory but bad performance, like the Naïve Quartet Method.

DCM1: another afc method DCM: disk-covering method Idea is to use divide-and-conquer to decompose a dataset into subsets, apply your favored method to construct trees on the subsets, and then combine these trees into a tree on the full dataset. But, the details matter (see Stendhal)

DCM1-boosting: Warnow, St. John, and Moret, SODA 2001 Absolute fast converging (DCM1-boosted) method Exponentially converging (base) method DCM1 SQS The DCM1 phase produces a collection of trees (one for each threshold), and the SQS phase picks the “best” tree. For a given threshold, the base method is used to construct trees on small subsets (defined by the threshold) of the taxa. These small trees are then combined into a tree on the full set of taxa.

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] Theorem (Warnow et al., SODA 2001): DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

DCM1-NJ+SQS Theorem 1: For all f,g,, there is a polynomial p(n) such that given sequences of length at least p(n), then with probability at least 1- , the DCM1-phase produces a set containing the true tree. Theorem 2: For all f, g, , there is a polynomial p(n) such that given sequences of length at least p(n), then with probability at least 1- , if the set contains the true tree, then the SQS phase selects the true tree.

DCM1-boosting: Warnow, St. John, and Moret, SODA 2001 Absolute fast converging (DCM1-boosted) method Exponentially converging (base) method DCM1 SQS The DCM1 phase produces a collection of trees (one for each threshold), and the SQS phase picks the “best” tree. How to compute a tree for a given threshold: Handwaving description: erase all the entries in the distance matrix above that threshold, and compute a tree from the remaining entries using the “base” method. The real technique uses chordal graph decompositions.

Chordal (triangulated) graphs A graph is chordal iff it has no simple induced cycles of at least four vertices.

More about chordal graphs If G is not a clique, then for any pair of vertices a,b that are not adjacent, the minimum vertex separator is a clique

Chordal graphs A chordal graph has a perfect elimination scheme (an ordering on the vertices so that for every vertex, the set of neighbors of the vertex that follow it in the ordering form a clique). In fact, any graph that has a perfect elimination scheme is chordal! Hence we can determine if a graph is chordal using a greedy algorithm.

More about chordal graphs A graph is chordal if and only if it is the intersection graph of a set of subtrees of a tree. This theorem is why the Perfect Phylogeny Problem and the Triangulating Colored Graphs problem are equivalent.

More about chordal graphs If D is an additive distance matrix and q is a positive number, then the Threshold Graph TG(d,q) is chordal, where TG(d,q) has n vertices v1, v2, …, vn and has edges (i,j) if and only if D[i,j] <= q.

Every chordal graph has at most n maximal cliques, and the Maxclique decomposition can be found in polynomial time.

DCM1 Given distance matrix for the species: 1. Define a triangulated (i.e. chordal) graph so that its vertices correspond to the input taxa 2. Compute the max clique decomposition of the graph, thus defining a decomposition of the taxa into overlapping subsets. 3. Compute tree on each max clique using the “base method”. 4. Merge the subtrees into a single tree on the full set of taxa.

DCM1 Decompositions Input: Set S of sequences, distance matrix d, threshold value 1. Compute threshold graph 2. Perform minimum weight triangulation (note: if d is an additive matrix, then the threshold graph is provably triangulated). DCM1 decomposition : Compute maximal cliques

DCM1-boosting: Warnow, St. John, and Moret, SODA 2001 Absolute fast converging (DCM1-boosted) method Exponentially converging (base) method DCM1 SQS The DCM1 phase produces a collection of trees (one for each threshold), and the SQS phase picks the “best” tree. For a given threshold, the base method is used to construct trees on small subsets (defined by the threshold) of the taxa. These small trees are then combined into a tree on the full set of taxa.

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] Theorem (Warnow et al., SODA 2001): DCM1-NJ converges to the true tree from polynomial length sequences. Many other afc methods, but none (so far) outperform NJ in practice. 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

Summary and Open Questions DCM-NJ has better accuracy than NJ DCM-boosting of other distance-based method also produces very big improvements in accuracy Other afc methods have been developed with even better theoretical performance Roch and collaborators have established a threshold for branch lengths, below which logarithmic sequence lengths can suffice for accuracy Still to be developed: other afc methods with improved empirical performance compared to NJ and other methods Sebastien Roch recently proved maximum likelihood is afc

What about more complex models? These results only apply when sequences evolve under these nice substitution-only models. What can we say about estimating trees when sequences evolve with insertions and deletions (“indels”)?