Reconstruction on trees and Phylogeny 1 Elchanan Mossel, U.C. Berkeley, Supported by Microsoft Research and the Miller Institute 9/18/2018
General plan Study stochastic process on bounded degree trees. Vertices of tree T are labeled by random variables. Interested in asymptotic problems where |T| ! 1. + + + + - + + + - - + - + + 9/18/2018
The reconstruction problem We discuss two related problems. In both, want to reconstruct/estimate unknown parameters from observations. The first is the “reconstruction problem”. Here we are given the tree and the values of the random variables at the leaves. Want to reconstruct the value of the random variable at a specific vertex (“root”). Algorithmically “easy” – but when does it “work”? ?? 9/18/2018
Phylogeny Here the tree is unknown. Given a sequence of collections of random variables at the leaves (“species”). Collections are i.i.d.! Want to reconstruct the tree (un-rooted). 9/18/2018
Phylogeny Algorithmically “hard”. 9/18/2018
Lecture plan Talk 1 [GW 19th cent.; M-Steel,2003] Introduction. The “random cluster” model – reconstruction. The random cluster model – phylogeny. Talk 2 [Hi77, EKPS2000, M1998,MSW2003] The Ising = CFN model – reconstruction. Talk 3 [M2003] The Ising = CFN model – phylogeny. Talk 4 [M2003, ¸ 2004] General Markov model. Open problems etc. + + + + - + + + - - + - + + 9/18/2018
Trees (3-)regular trees. Binary -- All internal degrees are 3 (bifurcating speciation; results valid if degrees are ¸ 3, or ¸ b+1). General trees. + + + + - + + + - - + - + + 9/18/2018
Trees In biology, all internal degrees ¸ 3. Given a set of species (labeled vertices) X, an X-tree is a tree which has X as the set of leaves. Two X-trees T1 and T2 are identical if there’s a graph isomorphism between T1 and T2 that is the identity map on X. u u Me’ v Me’ Me’’ Me’’ w w d a c b d a b c c a b d 9/18/2018
The “random cluster” model Infinite set A of colors. “real life” – large |A|; e.g. gene order. Defined on an un-rooted tree T=(V,E). Edge e has (non-mutation) probability (e). Character: Perform percolation – edge e open with probability (e). All the vertices v in the same open-cluster have the same color v. Different clusters get different colors. This is the “random cluster” model. 9/18/2018
Galton-Watson 9/18/2018
Galton-Watson Theorem For the random cluster model on a rooted binary tree. If (e) > ½ + for all e, then for all v 2 T, with probability at least s() = 2 / (½ + )2, there exists u 2 T (below v), with (v) = (u). If (e) < ½ - for all e, then the probability that such u 2 T exists is at most 3 (1 – 2 )d(v, T) 9/18/2018
Reconstruction on random clusters For the random cluster model on a rooted binary tree. If (e) > ½ + for all e, then for all v 2 T, we may reconstruct (v) with probability ¸ (½ + )2s2(e) from T (below v). Proof: v If (e) < ½ - for all e, then the probability of reconstructing (v) is · 3 (1 – 2 )d(v, T). Proof: True even given more info (open/closed edges). 9/18/2018
Phylogeny from log characters for R.C. Th1[M-Steel,2003]: Suppose that T is an X-tree on n leaves and for all e, ½ + < (e)< 1 - . Then k = (2 log n – log )/165 = O(log n - log ) characters suffice to reconstruct the topology with probability ¸ 1-. Colors of leaves Definition: A cherry is a pair of leaves at distance 2. Fact: Every X-tree has at least one cherry. 9/18/2018
Testing cherries If x,y is a cherry then there exist no characters and leaves x’,y’ 2 T - {x,y} s.t. (x) = (x’) (y) = (y’). x’ x y’ y If x,y is a not a cherry then for each character , the probability that 9 x’,y’ 2 T - {x,y} s.t. (x) = (x’) (y) = (y’) is at least r = s2 /16, where s() = 2 / (½+)2. x Repeating for k characters, we may find all cherries with error probability bounded by n2 (1-r)k . y’ x’ y 9/18/2018
From cherries to trees We wish to continue by replacing each cherry (u,v) by replacing the vertex w at distance 1 from v and u. Problem: We may not know what the color of w is. But: for each character , with probability at least (½ + )2s2(e) we can reconstruct (w) . Now we can repeat. u x w v y 9/18/2018
Poly. lower bound for R.C. Phylogeny Th1[M-Steel,2003]: Suppose that n=3 £ 2q and T is a uniformly chosen (q+1)-level 3-regular X-tree. For all e, (e)< , and < 1/2. Then in order to reconstruct the tree with probability ¸ 0.1, the number of characters must satisfy k ¸ (2)–q+1/100 = (n-log2()+1). Proof: Suffices to prove the same bound given the topology of the bottom q-L levels and the status of the edges there. 9/18/2018
Poly. lower bound for R.C. Phylogeny Proof: X=T ? ? L * k Known Known q-L * k If for all k characters random cluster “dies” in bottom q-L levels, then X is independent of the data. This happens with probability ¸ 1 –k 2L (2 )q-L. 9/18/2018
Phylogeny: Conjectures and results Statistical physics Phylogeny Binary tree in ordered phase conj k = O(log n) Binary tree unordered conj k = poly(n) Percolation critical = 1/2 Random Cluster M-Steel2003 CFN M-2003 Ising model critical : 22 = 1 Sub-critical representation High mutation M-2003 Problems: How general? What is the critical point? (extremality vs. spectral) 9/18/2018