Based on the paper by D.Huson, S.Nettles, T.Warnow

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
WSPD Applications.
Divide and Conquer. Subject Series-Parallel Digraphs Planarity testing.
Bayesian Networks, Winter Yoav Haimovitch & Ariel Raviv 1.
Greedy Algorithms Greed is good. (Some of the time)
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.
Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.
A Randomized Linear-Time Algorithm to Find Minimum Spanning Trees David R. Karger David R. Karger Philip N. Klein Philip N. Klein Robert E. Tarjan.
Steps in DP: Step 1 Think what decision is the “last piece in the puzzle” –Where to place the outermost parentheses in a matrix chain multiplication (A.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
1 Separator Theorems for Planar Graphs Presented by Shira Zucker.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Phylogenetic Trees Lecture 2
CIS786, Lecture 4 Usman Roshan.
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
 2004 SDU Lecture 7- Minimum Spanning Tree-- Extension 1.Properties of Minimum Spanning Tree 2.Secondary Minimum Spanning Tree 3.Bottleneck.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.
5.5.2 M inimum spanning trees  Definition 24: A minimum spanning tree in a connected weighted graph is a spanning tree that has the smallest possible.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
5.5.3 Rooted tree and binary tree  Definition 25: A directed graph is a directed tree if the graph is a tree in the underlying undirected graph.  Definition.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Flows in Planar Graphs Hadi Mahzarnia. Outline O Introduction O Planar single commodity flow O Multicommodity flows for C 1 O Feasibility O Algorithm.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Phylogeny Ch. 7 & 8.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Great Theoretical Ideas in Computer Science for Some.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Lecture 12 Algorithm Analysis Arne Kutzner Hanyang University / Seoul Korea.
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Distance based phylogenetics
dij(T) - the length of a path between leaves i and j
Computing Connected Components on Parallel Computers
Lecture 12 Algorithm Analysis
Autumn 2016 Lecture 11 Minimum Spanning Trees (Part II)
Enumerating Distances Using Spanners of Bounded Degree
Autumn 2015 Lecture 11 Minimum Spanning Trees (Part II)
3.5 Minimum Cuts in Undirected Graphs
CS 581 Tandy Warnow.
Introduction Wireless Ad-Hoc Network
Lecture 12 Algorithm Analysis
September 1, 2009 Tandy Warnow
Discrete Mathematics for Computer Science
Lecture 12 Algorithm Analysis
Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)
Switching Lemmas and Proof Complexity
Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)
Presentation transcript:

Based on the paper by D.Huson, S.Nettles, T.Warnow Disk-Covering Method Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona Based on the paper by D.Huson, S.Nettles, T.Warnow Presented by Galiya S. , Eduard S.

Phylogenetic Tree From the Desert Vista high school, Phoenix, Arizona A phylogenetic tree is a tree showing the evolutionary interrelationships among various species.

Jukes-Cantor model site Definition 1: Let T be a fixed rooted tree with leaves labeled 1,…,n. The Jukes-Cantor model makes the following assumptions: The possible states for each site are A,C,T,G. 2. The sequence length is an input parameter and for each site, the state at the root is drawn from a distribution (typically uniform). site AGACTT 3. The sites evolve identically and independently (i.i.d) down the tree from the root. GGACTT AGGCCT

Jukes-Cantor model (cont.) 4. For each edge with u the parent of v, if the state of a site is different at u than at v, then the probability that v has any state of the three remaining states is equal. GGGCAT AGCCCT GCACTT AGACTT GGACTT AGGCCT e u v The example above based on CIPRES ppt. University of Texas at Austin.

Jukes-Cantor model (cont.) 5. To each edge e in the tree T associated a Poisson random variable for the number of mutations of a randomly selected site on that edge. 6. Each edge has an expectancy , . AGTCAC AGTCAG AGTCTG 3 Multiple changes at a single site – hidden changes: seq1 AGTCAG seq2 AGTCAC Number of changes: Seq1 T G C A Seq2 T A 2 1

Definition 2 split - Removing an edge e from an unrooted phylogenetic tree T partitions the leaf set S of the tree into two not empty sets. We denote it . Example: 5 T: 1 e 4 2 3 S={1,2,3,4,5} Definition 2: T is the unrooted true tree, and T’ is the unrooted inferred tree, both with leaves labeled 1,…,n. e is internal edge. let define:

Definition 2 (cont.) T: T’: Example: Any split is called a false negative (FN). Any split is called a false positive (PN). An edge is recovered in T’ if the split appears in . Example: 5 e2 e1 1 2 3 4 FN T: e1 1 2 3 4 e2 5 FP T’:

Definition 2 (cont.) T: T’: FP rate: Example: FN rate: 5 1 e2 e1 FN 4 3 4 FN T’: FP FN=0.5=50% FP=0.5=50%

Additive matrix Definition 3: A matrix D is called additive if there exists a tree T with positive edge weighting w such that . is the path in T between leaves i and j. Given an additive matrix D the tree T can be uniquely reconstruct in . A dissimilarity matrix is a symmetric matrix that is 0 on the diagonal.

True distance remainder: Let T be the unrooted true tree. is the path in T between leaves i and j. we represent the evolutionary process by a set of Poisson process. i Xe1 Xe2 Xe3 j Xij= Xe1+Xe2 +Xe3 is called the true distance between i and j. is an additive matrix.

Hamming Distance is the sequence length. is the number of different sites between sequences i and j. is called the Hamming Distance. is the sequence length. is the normalized Hamming distance. Example: s1 CAACCCCGGT H(s1, s2) = 4 s2 TAATTTCGGT k = 10 h(s1, s2) = 4/10 = 0.4

distance correction Jukes-Cantor distance correction for each two leaves i, j is: If : Afterwards, compute the maximum Jukes-Cantor distance, multiply that value by the number n of leaves and replace all undefined values. Example: 3 TCAAG 4 TTGGA TTGCC 1 TGGCC 2 The 4 leaves are: The matrix d is: Replace * with * 0.778

The error Definition 7: Let be a real number. Then: and Example: q=3.2 1 3 1.2 1.5 2.8 3.1 1.1 0.2 0.4 0.3

Threshold Graph Let d be an dissimilarity matrix and let be any real number. The threshold graph Thresh(d,q) is defined as: Vertex set is {1,2,…,n }. The edges are: (i,j) is an edge if and only if q. For example: q = 4.5 Thresh(d,4.5): 2 4 1

Triangulated graph Definetion: A graph is triangulated if no subset of nodes induced a cycle of size four or more. Taken from wikipedia

Disk Covering Method A generic disk-covering method has four steps: Decomposition: Compute a decomposition of the dataset into overlapping subsets. Solution: Construct trees on the subsets using a base method. Merge: Use a supertree method to merge the trees on the subsets into a tree on the full dataset. 4. Refinement: Compute the asymetric median tree of all posible supertrees. The example above based on CIPRES ppt. University of Texas at Austin.

Simplicial elimination order Lemma: Simplicial elimination order is ordering of the vertices of G so the set Form a clique. Every triangulated graph G has a simplicial elimination ordering. The maximal clique in G are of the form This ordering can be found at . So maximal cliques of G can be found at Example: 3 5 7 8

Constructing Tq input: d dissimilarity matrix, Real number q>0. output: reconstructed tree, Tq. 1. Compute Thresh(d,q) 2. Triangulate Thresh(d,q) Polynomial Complexity 3. Compute Buneman Trees far all Maximal Cliques in Triangulated Thresh(d,q). 4. Merge subtrees into a supertree. Overall Complexity: Polynomial Complexity

Intersection graph Intersection graph is undirected graph formed by sets of sets of vertices: by choosing one vertex for each set and connecting two vertices when the corresponding sets have none empty intersection. Taken from wikipedia

Triangulaing Tresh(d,q) Complexity Lemma: If d is an additive matrix, then Tresh(d,q) is triangulated. Proof: let d be an arbitrary additive matrix, and let (T,w) be the edge weighted tree associated uniquily to d. Let q > 0. Add intermediate vertices to the edges of T and re-weight the edges so that the path between leaf pair are unchanged, but for every pair of leaves u and v in T if then there is a node x in the enlarged tree T’ so that subtree of T’ tree T’

Triangulaing Tresh(d,q) Complexity Now let denote the subtree of T’ of distance at most q/2 of u. Note that if only if , and so the Thresh(d,q) is identical to the intersection graph of the as u ranges over the leaves of T. Consecuntly Thresh(d,q) is triangulated. tree T Intersection Graph Thresh(d,q) Taken from wikipedia

Supertree Construction Algorithm (SCA) Step 1 : First obtain a simplicial elemination ordering for G. Compute where For each Ci find a maximal clique C containing Ci and compute a tree ti for Ci by deleting the leaves in C-Ci form Tc. Step 2 : Construct tree for i = n-3,n-4,…,1 compute the tree Ti formed by merging ti and using Consensus Subtree Merger method Example: C: {1,2,3,4} C2: { 2,3,4} C-C2{1 } left { 2,3,4}

Strict Consenseus Subtree Merger 1 2 3 7 4 1 2 3 4 6 5 This method contracts a minimum set of edges in each tree in order to make them identical on the subtree they induce, lets denote that subtree by X and call it the backbone. Merging two tree is done by attaching the pieces of each tree appropriately to the different edges of the backbone. The situatuion in which the some piece of each tree attaches onto the same edge of the backbone, called collision. 1 2 3 4 1 3 2 4 1 2 3 4 1 2 3 4 5 6 7

Short Quartet Definition Let (T,w) be a binary tree edge weighted by , and leaf laled by the set of spieces. Let e be an edge in T that is not incident to a leaf of T. Aroun e there is four subtrees A,B,C,D. Let a,b,c,d be four laves of the subtrees A,B,C,D repectivly, closest to e.Where the distance between leaves i and j measured as . We call {a,b,c,d} a short quartet around e. and the collection of all short quartets around internal nodes of T is denoted by subtree of B subtree of A subtree of D subtree of C d c b a e

Gsq Definition Let be the additive distance matrix associated to T. The Graph Gsq on the vertex set S = {1,2,…,n} is defined by if i and j are in same short quatet Examples: T j j i i

Proof of Tq correctness Theorem: Let T be a leaf-labeled tree, Let G be a triangulated graph such that . Let Be the collection of Buneman trees applied to on the maximal cliques of G and assume this collection reconstructs the correct subtree, and let T* be the tree obtained by applying SCA to (G, ). Then T*=T. Proof: We will show that under this conditions, Ti and the T restricted to the same vertices are identical and no collision occur. Part I: Let T be a tree whose leaves are labeled by . Let G be a triangulated graph on S, and let where is a tree on leaf set A for every maximal clique A in G. Let be a simplicial elimination ordering of G. Let show that for every i Base: this is true since we assumed that all buneman trees are correct.

Proof of Tq correctness(Cont.) Lets assume for some . forms the leaf set of the back bone of the strict consensus merger of . So we get Consequently there is no edge contraction when we compute the back bone. Part II: There can be a collision only if the backbone contains an edge onto which both and some other attach, denote this edge by e. Thus, some subtree t’ of Ti attached onto e. Let the leaf set of t’ by . Let P be a path in T corresponding to edge e and let its endpoints be a and b. Let denote T0 be subtree of T obtained by deleting all the nodes in T that are separated from a by the deletion of b, and vice versa. Let be the leaves of T0. The following are true: 1. and all leaves in t’ are also in 2. restricted to is path connected. 3.

Proof of Tq correctness(Cont.) Now, let P’ be a path lying in form to some node in Y. Let y be the first node in Y on the path P’. by (3) also lies entirely in so Consequently But this contradicts earlier assumption that

Experimental Results-Buneman FN rate of DCM-Buneman is lower than Buneman for every sequnce length. FP rate of DCM-Buneman is slightly higher than Buneman 3% and 0% respectively FN rate of DCM-Buneman reaches 5% at 10,000 sequence length,Buneman doesn’t reach this value.

Experimental Results - NJ FN and FP rates of DCM-NJ is significantly lower than NJ. DCM-NJ becomes lower then 5% at 250 sequence length. DCM-NJ can reconstruct the true tree at sequence beyond length of 900.

Distance Methods A distance matrix D is a symmetric, non-negative with zero diagonal. The goal is a phylogenetic tree T such that the distance between species in T approximate The distance in D. we now describe some distance methods.

Buneman Input: a dissimilarity matrix d. Output: tree T. 1. Topology on every four-leaf subset is inferred using Four-Point Method: Input – 4*4 dissimilarity matrix on i, j ,k, l. Output – if dij+dkl< min {dik+djl, dil+djk} then: The topology ij | kl (i, j are separated from k, l by an edge) is returned. if dij+dkl= min {dik+djl, dil+djk} then a star tree is returned. i l j k e ij | kl j i k l star

Buneman (cont.) A={1,2,3} B={4,5} Q: 1 5 2 4 1,2 | 4,5 3 1,3 | 4,5 Let Q be a set of four-leaf trees, defined by the FPM. The buneman tree is the maximally resolved tree satisfying: for all quartets i, j, k, l if T restricted to i, j, k, l induces a binary tree, then: the tree in Q in i, j, k, l is the same binary tree. Lemma 1: Let d be an input dissimilarity matrix. Let T be the buneman tree defined by d. Then C(T) is the set of splits (A, B) defined by: complexity: polynomial time. A={1,2,3} B={4,5} Q: 1 5 2 4 1,2 | 4,5 3 1,3 | 4,5 2,3 | 4,5 C(T)={(A,B)}

Neighbor - Joining Input: a distance matrix d. Output: unrooted binary tree T. Algorithm Description: For every 2 species, it determines a score, based on the distance matrix. At each step the algorithm joins the pair with the minimum score: make a subtree whose root replaces the two chosen species in the matrix. The distance are recalculated to this new node. This is reapeted until only tree nodes remain. Finally, it connects the remaining two vertices with edge. complexity: polynomial time - o(n3)

THE END!