Reconstruction on trees and Phylogeny 3

1 Reconstruction on trees and Phylogeny 3
Elchanan Mossel, U.C. Berkeley Supported by Microsoft Research and the Miller Institute 1/18/2019

2 Phylogeny “Phylogeny is the true evolutionary relationships between groups of living things” Noah Shem Japheth Ham Cush Kannan Mizraim 1/18/2019

3 History of Phylogeny Prehistory: “animal kingdom” or “plant kingdom.”
Intuitively: More scientifically: morphology, fossils, etc. Darwin … But: Is a human more like a great ape or like a chimpanzee? No brain, Can’t move Stupid Walks Stupid Swims Stupid Flies Too smart Barely moves 1/18/2019

4 Molecular Phylogeny Molecular Phylogeny: Based on DNA, RNA or protein sequences of organisms. Rooted / Unrooted trees: Evolution from common ancestor modeled on a rooted tree. Usually reconstruct unrooted trees. Mutation mechanisms: Substitutions Transpositions Insertions, Deletions, etc. Will only consider substitutions and assume sequences are aligned. Noah acctga Shem Japheth Ham acctaa acctga acctga Put Cush Kannan Mizraim acctga 1/18/2019 acctga agctga acctga

5 Genetic substitution models and trees
Assumption 1: Letters of sequences (“characters”) evolve independently and identically. Assumption 2: Trees are binary -- All internal degrees are 3 (bifurcating speciation; results valid if degrees are ¸ 3). Given a set of species (labeled vertices) X, an X-tree is a tree which has X as the set of leaves. Two X-trees T1 and T2 are identical if there’s a graph isomorphism between T1 and T2 that is the identity map on X. u u Me’ v Me’ Me’’ Me’’ w w d a c b d a b c c a b d 1/18/2019

6 Substitution model – finite state space
Finite set A of information values (|A| = 4 for DNA). Tree T=(V,E) rooted at r. Vertex v 2 V, has information σv 2 A. Edge e=(v, u), where v is the parent of u, has a mutation matrix Me of size |A| £ |A|: Mi,j (v,u) = P[u = j | v = i] Will focus on the CFN model: A character is (v)v 2 T. For each character , the data is T = (v)v 2 T, where T is the boundary of the tree; |T| = n. We are given k independent characters 1T,…, kT. 1/18/2019

7 A diagram Length of sequence! Interested to know k = #characters needed to reconstruct the tree with n = #leaves, given a range [max,min] for mutation rate . 1/18/2019

8 Phylogeny: Conjectures and results
Statistical physics Phylogeny Binary tree in ordered phase conj k = O(log n) Binary tree unordered conj k = poly(n) Percolation critical  = 1/2 Random Cluster M-Steel2003 CFN M-2003 Ising model critical : 22 = 1 Sub-critical representation High mutation M-2003 Problems: How general? What is the critical point? (extremality vs. spectral) 1/18/2019

9 Cavendar-Farris-Neyman model:
The CFN model Cavendar-Farris-Neyman model: 2 data types: 1 and –1 (“purine-pyrimidine”) Mutation along edge e: with probability (e) copy data from parent. Otherwise, choose 1/-1 with probability ½ independently of everything else Thm[CFN] Suppose that for all e, 1 -  > (e) >  > 0. Then given k characters of the process at n leaves, It is possible to reconstruct the underlying topology with probability 1 - , if k = nO(-log ). Steel 94: Trick to extend to general Me provided that det(Me) [-1,-1+]  [- , ]  [1 - , 1], 1/18/2019

10 Phase transition for the CFN model
Th1[M2003]: Suppose that n=3 £ 2q and T is a uniformly chosen (q+1)-level 3-regular X-tree. For all e, (e) < , and 22 < 1. Then in order to reconstruct the topology with probability  > 0.1, at least k = (n(-2log2() - 1)) characters are needed. Proof: Information theoretic variant of the proof for random cluster model. Same proof applies to any model for which the reconstruction problem is unsolvable. more formally, for models for which I(,n) decays exp. fast in n. 1/18/2019

11 CFN Logarithmic reconstruction
Th2[M2003]: If T is an X-tree on n leaves s.t. For all e, min < (e)< max and 22min > 1, max < 1. Then k = O(log n – log ) characters suffice to reconstruct the topology with probability 1- . Need either a “balanced tree” – all leaves at the same distance from a root. Or, “molecular clock” – (e) = e-t(e), where t(e) is the time interval between the two endpoints of the interval + all leaves are at the same time. 1/18/2019

12 Main Lemma [M2003] Lemma: Suppose that 2  min2 > 1, then there exists an L, and  > 0 such that the CFN model on the binary tree of L levels with (e)   min, for all e not adjacent to ∂T. (e)    min , for all e adjacent to ∂T. satisfies E[σr Maj(σ∂)]  . Roughly, given boundary data of “quality  ”, we can reconstruct the root data with “quality  ”. In phylogeny – can treat known pieces of the tree as vertices. Main problem: how to reconstruct pieces of the tree? 1/18/2019

13 Metric spaces on trees Let D be a positive function on the edges E.
Define D(u,v) =  {D(e) : e 2 path(u,v)}. Claim: Given D(v,u) for all v and u in T, it is possible to reconstruct the topology of T. Proof: Suffices to find d(u, v) for all u, v 2 T where d is the graph metric distance. d(u1,u2) = 2 iff for all w1 and w2 it holds that D’(u1,u2,w1,w2) := D(u1,w1)+D(u2,w2) –D(u1,u2)–D(w1,w2) ¸ 0 (“Four point condition”). w1 u1 w1 u1 w2 u2 w2 u2 1/18/2019

14 Metric spaces on trees Continue by replacing known sub-trees T on vertices (v1,…,vr) by a single vertex v. The distance between (v1,…,vr) and (u1,…us) is defined as d(v1,u1). D’(u1,u2,w1,w2) > 0 ) D’(u1,u2,w1,w2)  2 min_e D(e). Suffices to have D with accuracy min_e D(e)/4. 1/18/2019

15 Metric spaces on trees Let T be a balanced tree.
The L-topology of T is d¤(u,v) := min{d(u,v},2L}. Claim: If T is balanced, then in order to recover the L-topology of T it suffices to have For each leaf u of T a set U(u) containing all elements at distance · 2L+2 from u. For all u and all w1,w2,w3,w4 2 U(u) the sign of D’(w1,w2,w3,w4). “proof”: If d(u1,u2) > 2, then either u2 is not in U(u1), or Let v be a sister of u1 and v’ a cousin of v. D’(u1,v,u2,v’) > 0. We have a witness that u1 and u2 are not siblings. u2 v’ v u1 1/18/2019

16 Proof of CFN theorem Define D(e) = - log (e).
D(u,v) = -log(Cov(v,u)), where Cov(v, u) = E[vu]. Estimate Cov(v, u) by Cor(v, u) where Need D with accuracy m = min D(e)/4 = c, or Cor = (1  c)Cov. Cor(v, u) is a sum of k i.i.d.  1 variables with expected value Cov(v, u). Cov(v, u) may be a small as  2 depth(T) = n-O(-log ). Given k = nΩ(-log ) characters, it is possible to estimate D and therefore reconstruct T with high probability. 1/18/2019

17 Reconstructing the topology [M2003]
The algorithm: Repeat the following: Reconstruct the topology up to l levels from the boundary using 4-points method. For each sample, reconstruct the data l levels from the boundary using majority algorithm. + - - + Reconstruction near the boundary take O(log n) samples. By main lemma quality stays above . 1/18/2019

18 Proving main Lemma Need to estimate E[σr Maj(σ∂)]. Estimate has two parts: Case 1: For all e adjacent to ∂T, (e) is small. Here we use a perturbation argument, i.e. estimate partial derivatives of E[σr Maj(σ∂)] with respect to various variables (using something like Russo formula). Case 2: Some e adjacent to ∂T has large (e). Use percolation theory arguments. Both cases uses isoperimetric estimates for the discrete cube. 1/18/2019

