Reconstruction on trees and Phylogeny 3

Slides:



Advertisements
Similar presentations
Routing Complexity of Faulty Networks Omer Angel Itai Benjamini Eran Ofek Udi Wieder The Weizmann Institute of Science.
Advertisements

6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 19.
CS 336 March 19, 2012 Tandy Warnow.
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
2/14/13CMPS 3120 Computational Geometry1 CMPS 3120: Computational Geometry Spring 2013 Planar Subdivisions and Point Location Carola Wenk Based on: Computational.
Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Graphs III (Trees, MSTs) (Chp 11.5, 11.6)
Reading Phylogenetic Trees Gloria Rendon NCSA November, 2008.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Bioinformatics Algorithms and Data Structures
Complexity 19-1 Complexity Andrei Bulatov More Probabilistic Algorithms.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogeny Tree Reconstruction
Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan Gronau.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Mixing Times of Markov Chains for Self-Organizing Lists and Biased Permutations Prateek Bhakta, Sarah Miracle, Dana Randall and Amanda Streib.
Mixing Times of Self-Organizing Lists and Biased Permutations Sarah Miracle Georgia Institute of Technology.
cover times, blanket times, and majorizing measures Jian Ding U. C. Berkeley James R. Lee University of Washington Yuval Peres Microsoft Research TexPoint.
Terminology of phylogenetic trees
0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer Science Northwestern University Evanston,
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Phylogenetics II.
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
11/4/20151 Markovian Models of Genetic Inheritance Elchanan Mossel, U.C. Berkeley
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Markov Chains and Random Walks. Def: A stochastic process X={X(t),t ∈ T} is a collection of random variables. If T is a countable set, say T={0,1,2, …
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Graph Colouring Lecture 20: Nov 25. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including.
2/1/20161 Markovian Models of Genetic Inheritance – Lecs 3,4 Correlation Decay and Phylogenetic Reconsruction Elchanan Mossel, U.C. Berkeley
Great Theoretical Ideas in Computer Science for Some.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
COMPSCI 102 Introduction to Discrete Mathematics.
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 23.
Trees.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Distance based phylogenetics
Graph theory Definitions Trees, cycles, directed graphs.
Character-Based Phylogeny Reconstruction
Path Coupling And Approximate Counting
Reconstruction on trees and Phylogeny 1
Advanced Algorithms Analysis and Design
Lecture 18: Uniformity Testing Monotonicity Testing
Gibbs measures on trees
From Branching processes to Phylogeny: A history of reproduction
Glauber Dynamics on Trees and Hyperbolic Graphs
Phase Transitions In Reconstruction Yuval Peres, U.C. Berkeley
Reconstruction on trees and Phylogeny 2
Phylognetic trees: What to look for and where? Lessons from Statistical Physics Elchanan Mossel, U.C. Berkeley and Microsoft.
Reconstruction on trees and Phylogeny 4
The Tree of Life From Ernst Haeckel, 1891.
On the effect of randomness on planted 3-coloring models
Reading Phylogenetic Trees
Unit Genomic sequencing
Discrete Mathematics for Computer Science
Arun ganesh (UC BERKELEY)
Presentation transcript:

Reconstruction on trees and Phylogeny 3 Elchanan Mossel, U.C. Berkeley mossel@stat.berkeley.edu, http://www.cs.berkeley.edu/~mossel/ Supported by Microsoft Research and the Miller Institute 1/18/2019

Phylogeny “Phylogeny is the true evolutionary relationships between groups of living things” Noah Shem Japheth Ham Cush Kannan Mizraim 1/18/2019

History of Phylogeny Prehistory: “animal kingdom” or “plant kingdom.” Intuitively: More scientifically: morphology, fossils, etc. Darwin … But: Is a human more like a great ape or like a chimpanzee? No brain, Can’t move Stupid Walks Stupid Swims Stupid Flies Too smart Barely moves 1/18/2019

Molecular Phylogeny Molecular Phylogeny: Based on DNA, RNA or protein sequences of organisms. Rooted / Unrooted trees: Evolution from common ancestor modeled on a rooted tree. Usually reconstruct unrooted trees. Mutation mechanisms: Substitutions Transpositions Insertions, Deletions, etc. Will only consider substitutions and assume sequences are aligned. Noah acctga Shem Japheth Ham acctaa acctga acctga Put Cush Kannan Mizraim acctga 1/18/2019 acctga agctga acctga

Genetic substitution models and trees Assumption 1: Letters of sequences (“characters”) evolve independently and identically. Assumption 2: Trees are binary -- All internal degrees are 3 (bifurcating speciation; results valid if degrees are ¸ 3). Given a set of species (labeled vertices) X, an X-tree is a tree which has X as the set of leaves. Two X-trees T1 and T2 are identical if there’s a graph isomorphism between T1 and T2 that is the identity map on X. u u Me’ v Me’ Me’’ Me’’ w w d a c b d a b c c a b d 1/18/2019

Substitution model – finite state space Finite set A of information values (|A| = 4 for DNA). Tree T=(V,E) rooted at r. Vertex v 2 V, has information σv 2 A. Edge e=(v, u), where v is the parent of u, has a mutation matrix Me of size |A| £ |A|: Mi,j (v,u) = P[u = j | v = i] Will focus on the CFN model: A character is (v)v 2 T. For each character , the data is T = (v)v 2 T, where T is the boundary of the tree; |T| = n. We are given k independent characters 1T,…, kT. 1/18/2019

A diagram Length of sequence! Interested to know k = #characters needed to reconstruct the tree with n = #leaves, given a range [max,min] for mutation rate . 1/18/2019

Phylogeny: Conjectures and results Statistical physics Phylogeny Binary tree in ordered phase conj k = O(log n) Binary tree unordered conj k = poly(n) Percolation critical  = 1/2 Random Cluster M-Steel2003 CFN M-2003 Ising model critical : 22 = 1 Sub-critical representation High mutation M-2003 Problems: How general? What is the critical point? (extremality vs. spectral) 1/18/2019

Cavendar-Farris-Neyman model: The CFN model Cavendar-Farris-Neyman model: 2 data types: 1 and –1 (“purine-pyrimidine”) Mutation along edge e: with probability (e) copy data from parent. Otherwise, choose 1/-1 with probability ½ independently of everything else Thm[CFN] Suppose that for all e, 1 -  > (e) >  > 0. Then given k characters of the process at n leaves, It is possible to reconstruct the underlying topology with probability 1 - , if k = nO(-log ). Steel 94: Trick to extend to general Me provided that det(Me) [-1,-1+]  [- , ]  [1 - , 1], 1/18/2019

Phase transition for the CFN model Th1[M2003]: Suppose that n=3 £ 2q and T is a uniformly chosen (q+1)-level 3-regular X-tree. For all e, (e) < , and 22 < 1. Then in order to reconstruct the topology with probability  > 0.1, at least k = (n(-2log2() - 1)) characters are needed. Proof: Information theoretic variant of the proof for random cluster model. Same proof applies to any model for which the reconstruction problem is unsolvable. more formally, for models for which I(,n) decays exp. fast in n. 1/18/2019

CFN Logarithmic reconstruction Th2[M2003]: If T is an X-tree on n leaves s.t. For all e, min < (e)< max and 22min > 1, max < 1. Then k = O(log n – log ) characters suffice to reconstruct the topology with probability 1- . Need either a “balanced tree” – all leaves at the same distance from a root. Or, “molecular clock” – (e) = e-t(e), where t(e) is the time interval between the two endpoints of the interval + all leaves are at the same time. 1/18/2019

Main Lemma [M2003] Lemma: Suppose that 2  min2 > 1, then there exists an L, and  > 0 such that the CFN model on the binary tree of L levels with (e)   min, for all e not adjacent to ∂T. (e)    min , for all e adjacent to ∂T. satisfies E[σr Maj(σ∂)]  . Roughly, given boundary data of “quality  ”, we can reconstruct the root data with “quality  ”. In phylogeny – can treat known pieces of the tree as vertices. Main problem: how to reconstruct pieces of the tree? 1/18/2019

Metric spaces on trees Let D be a positive function on the edges E. Define D(u,v) =  {D(e) : e 2 path(u,v)}. Claim: Given D(v,u) for all v and u in T, it is possible to reconstruct the topology of T. Proof: Suffices to find d(u, v) for all u, v 2 T where d is the graph metric distance. d(u1,u2) = 2 iff for all w1 and w2 it holds that D’(u1,u2,w1,w2) := D(u1,w1)+D(u2,w2) –D(u1,u2)–D(w1,w2) ¸ 0 (“Four point condition”). w1 u1 w1 u1 w2 u2 w2 u2 1/18/2019

Metric spaces on trees Continue by replacing known sub-trees T on vertices (v1,…,vr) by a single vertex v. The distance between (v1,…,vr) and (u1,…us) is defined as d(v1,u1). D’(u1,u2,w1,w2) > 0 ) D’(u1,u2,w1,w2)  2 min_e D(e). Suffices to have D with accuracy min_e D(e)/4. 1/18/2019

Metric spaces on trees Let T be a balanced tree. The L-topology of T is d¤(u,v) := min{d(u,v},2L}. Claim: If T is balanced, then in order to recover the L-topology of T it suffices to have For each leaf u of T a set U(u) containing all elements at distance · 2L+2 from u. For all u and all w1,w2,w3,w4 2 U(u) the sign of D’(w1,w2,w3,w4). “proof”: If d(u1,u2) > 2, then either u2 is not in U(u1), or Let v be a sister of u1 and v’ a cousin of v. D’(u1,v,u2,v’) > 0. We have a witness that u1 and u2 are not siblings. u2 v’ v u1 1/18/2019

Proof of CFN theorem Define D(e) = - log (e). D(u,v) = -log(Cov(v,u)), where Cov(v, u) = E[vu]. Estimate Cov(v, u) by Cor(v, u) where Need D with accuracy m = min D(e)/4 = c, or Cor = (1  c)Cov. Cor(v, u) is a sum of k i.i.d.  1 variables with expected value Cov(v, u). Cov(v, u) may be a small as  2 depth(T) = n-O(-log ). Given k = nΩ(-log ) characters, it is possible to estimate D and therefore reconstruct T with high probability. 1/18/2019

Reconstructing the topology [M2003] The algorithm: Repeat the following: Reconstruct the topology up to l levels from the boundary using 4-points method. For each sample, reconstruct the data l levels from the boundary using majority algorithm. + - - + Reconstruction near the boundary take O(log n) samples. By main lemma quality stays above . 1/18/2019

Proving main Lemma Need to estimate E[σr Maj(σ∂)]. Estimate has two parts: Case 1: For all e adjacent to ∂T, (e) is small. Here we use a perturbation argument, i.e. estimate partial derivatives of E[σr Maj(σ∂)] with respect to various variables (using something like Russo formula). Case 2: Some e adjacent to ∂T has large (e). Use percolation theory arguments. Both cases uses isoperimetric estimates for the discrete cube. 1/18/2019