From Branching processes to Phylogeny: A history of reproduction

Elchanan Mossel, U.C. Berkeley 9/22/2018

2 Trees in genetics Darwin + Today - tree of species.
Today: tree of fathers. Today: tree of mothers. Today: Descendants for a few generations if there are no crossovers (graph has no cycles). Noah Shem Japheth Ham Put Cush Kannan Mizraim 9/22/2018

3 Information on the tree
No brain, Can’t move Information on the tree Tree of species: characteristics. Won’t discuss. Tree of mothers: Mitochondria. Tree of fathers: Y chromosome. No crossover: Full DNA. Stupid Flies Stupid Walks Stupid Swims Too smart Barely moves acctga Noah Shem Japheth Ham acctaa acctga acctga Put Cush Kannan Mizraim acctga 9/22/2018 acctga agctga acctga

4 Genetic mutation models
How does the D.N.A and other genetic data mutates along each edge of the tree? Actual mutation is complicated (for Y: “indels, snips, micro satellites, mini satellites”). A simplified model: Split sequence to subsequences. All subsequences have same mutation rate. That’s the standard model in theoretical phylogeny. 9/22/2018

5 Formal model – finite state space
Finite set A of information values. Tree T=(V,E) rooted at r. Vertex v in V, has information σv in A. Edge e=(v, u), where v is the parent of u, has a mutation matrix Me of size |A| by|A|: A sample of the process is (v)v in T. For each sample, we know v, for v in B(T), where B(T) is the boundary of the tree; |B(T)| = n. We are given k independent samples 1B(T),…, kB(T). 9/22/2018

6 Formal model – infinite state space
Infinite set A – assuming no mutations back – “Homoplasy-free models” Defined on an un-rooted tree T=(V,E). Edge e has (non-mutation) parameter (e). Sample: Perform percolation – edge e open with probability (e). All the vertices v in the same open-cluster have the same color v. Different clusters get different colors. We are given k independent samples 1B(T),…, kB(T). 9/22/2018

7 Reconstructing the topology
Given 1B(T),…, kB(T), want to reconstruct the topology, i.e., the structure of the underlying tree on the set of labeled leaves. Formally, we want to find for all u and v in B(T) their graph-metric distance d(u, v). Assume all internal degrees are at least 3. Assume B(T) consists only of leaves – vertices of degree 1. u u Me’ v Me’ Me’’ Me’’ w w 9/22/2018

8 Summary: Conjectures and results
Statistical physics Phylogeny Binary tree in ordered phase conj k = O(log n) Binary tree unordered conj k = poly(n) Percolation critical  = 1/2 Homoplasy free Ising model CFN critical : 22 = 1 Sub-critical random cluster High mutation Problems: How general? What is the critical point? (extremality vs. spectral) 9/22/2018

9 Homoplasy free models and percolation
Th1[M2002]: Suppose that n=3*2q and T is a uniformly chosen (q+1)-level 3-regular tree. For all e, (e)< , and < 1/2. Then in order to reconstruct the topology with probability  = 0.1, at least k = (–q) = nΩ(-log ) samples are needed. Th2[M2002]: Suppose that T has n leaves and For all e, ½ +  < (e)< 1 - . Then k = O(log n – log ) samples suffice to reconstruct the topology with probability 1-. 9/22/2018

10 Cavendar-Farris-Neyman model:
The CFN model Cavendar-Farris-Neyman model: 2 data types: 1 and –1. Mutation along edge e: with probability (e) copy data from parent. Otherwise, choose 1/-1 with probability ½ independently of everything else. Thm[CFN] Suppose that for all e, 1 -  > (e) >  > 0. Then given k samples of the process at the n leaves, It is possible to reconstruct the underlying topology with probability 1 - , if k = nΩ(-log ). 9/22/2018

11 Phase transition for the CFN model
Th1[M2002]: Suppose that n=3*2q and T is a uniformly chosen (q+1)-level 3-regular tree. For all e, (e)< , and 22 < 1. Then in order to reconstruct the topology with probability  > 0.1, at least k = (2-q-2q) = nΩ(-log ) samples are needed. Th2[M2002]: Suppose that T is a balanced tree on n leaves and For all e, min < (e)< max and 22min > 1, max < 1. Then k = O(log n – log ) samples suffice to reconstruct the topology with probability 1-. 9/22/2018

12 Metric spaces on trees Let D be a positive function on the edges E.
Define Claim: Given D(v,u) for all v and u in B(T), it is possible to reconstruct the topology of T. Proof: Suffices to find d(u, v) for all u, v B(T), where d is the graph metric distance. d(u1,u2) = 2 iff for all w1 and w2 it holds that D’(u1,u2,w1,w2) = D(u1,w1)+D(u2,w2) –D(u1,u2)–D(w1,w2) is non-negative (“Four point condition”). w1 u1 w1 u1 w2 u2 w2 u2 9/22/2018

13 Metric spaces on trees Similarly, d(u1,u2) = 4 iff for all w1 and w2 s.t. for i=1,2 and j=1,2, d(ui,wj) = 4, it holds that D’(u1,u2,w1,w2) = 0. Remark 1: If D’(u1,u2,w1,w2) > 0, then | D’(u1,u2,w1,w2) |  2 min_e D(e). Therefore, it suffices to know D with accuracy min_e D(e)/4. Remark 2: For a balanced tree, in order to reconstruct the underlying topology up to l levels from the boundary, it suffices to know D(u,v) for all u and v s.t. d(u,v)  2 l + 2 (with accuracy min_e D(e)/4). 9/22/2018

14 Proof of CFN theorem Define D(e) = - log (e).
D(u,v) = -log(Cov(v,u)), where Cov(v, u) = E[vu]. Estimate Cov(v, u) by Cor(v, u) where Need D with accuracy m = min D(e)/4 = c, or Cor = (1  c)Cov. Cor(v, u) is a sum of k i.i.d.  1 variables with expected value Cov(v, u). Cov(v, u) may be a small as  2 depth(T) = n-O(-log ). Given k = nΩ(-log ) samples, it is possible to estimate D and therefore reconstruct T with high probability. 9/22/2018

15 Extensions Steel 94: Trick to extend to general Me provided that that det(Me) [-1,-1+]  [- , ]  [1 - , 1], using D(e) = -log( |det(Me)| ) (more or less). ESSW: If tree has small depth then k is smaller. For Balanced trees: In order to reconstruct up to l levels from the boundary, suffices to have k = Θ(log(n)). Proof: Cov(v, u). for u and v with d(u,v)  2 l + 2 is at least 2l + 2. 9/22/2018

16 What’s next? It’s important to minimize the number of samples.
Do we need k = nΩ(1), or do we need k = polylog(n)? Since the number of trees on n leaves is exponential in n log n, and each sample consists of n bits, we need at least Ω(log(n)) samples. 9/22/2018

17 The Ising model on the binary tree
The (Free)-Ising-Gibbs measure on the binary tree: Set σr, the root spin, to be +/- with probability ½. For all pairs of (parent, child) = (v, w), set σw = σv, with probability , otherwise σw = +/- with probability ½. Different Perspective: Topology is known and looking at a single sample. + + + + - + + - + - + - + + + 9/22/2018

18 The Ising model on the binary tree
Studied in statistical physics [Spitzer 75, Higuchi 77, Bleher-Ruiz-Zagrebnov 95, Evans-Kenyon-Peres-Schulman 2000, M 98] Interesting phenomena: double phase transition (different from Ising model in Zd). When 2   1, unique Gibbs measure. When 2  2  1, free measure is extremal. In other words, 9/22/2018

19 The Ising model on the binary tree
From BRZ or EKPS: mutual information: H(σ∂) + H(σr)) - H(σr,σ∂) Temp σr | σ∂≡ 1 Uniq I(σr,σ∂) Free measure high < 1/2 unbiased V → 0 extremal med. (1/2,1/√2) biased X low > 1/√2 Inf > 0 Non-ext Remark: 2  2 = 1 phase transition also transition for mixing time of Glauber dynamics for Ising model on tree (with Kenyon and Peres) 9/22/2018

20 Lower bound on number of samples
Th1[M2002]: Suppose that n=3*2q and T is a uniformly chosen (q+1)-level 3-regular tree. For all e, (e)< , and 22 < 1. Then in order to reconstruct the topology with probability  > 0.1, at least k = (2-q-2q) = nΩ(-log ) samples are needed. Proof of lower bound uses information inequalities. Lemma 1[EKPS]: For the single sample process on the binary tree of q levels, I(σr,σ∂)  (2  2) q. 9/22/2018

21 Lower bound on number of samples
Lemma 2[Fano’s inequality]: X  Y Want to reconstruct the random variable X given the random variable Y, X takes m values. Let pe be the error probability. Then 1 + pe log2(m) > H(X|Y). Lemma 3[“Data Processing Lemma”]: X  Y  Z If X and Z are independent given Y, then I(X, Z)  min { I(X, Y) , I(Y, Z) }. 9/22/2018

22 Lower bound on number of samples
Assume that topology of bottom q - l levels is known, i.e., we know d(u, v) if d(u, v)  2(q – l). Then the the conditional distribution of the topology T of the first l levels is uniform. Let σlt for t=1,…,k be the data at level l for sample t. Recall that we are given the Data = (σ∂t). By “Data Processing”, l X ? ? Y * k q - l Known Known Z * k 9/22/2018

23 Lower bound on number of samples
By independence, After some manipulations and by EKPS Lemma So Let Since T is chosen uniformly, Now By Fano’s Lemma, the probability of error in reconstructing the topology, pe, satisfies In order to get pe < 1 - , need 9/22/2018

24 Upper bound on number of samples
Th2[M2002]: Suppose that T is a balanced tree on n leaves and For all e, min < (e)< max and 22min > 1, max < 1. Then k = O(log n – log ) samples suffice to reconstruct the topology with probability 1-. Proves a conjecture of M. Steel. 9/22/2018

25 Algorithmic aspects of phase transition
Higuchi [77]: For Ising model, when 2  2 > 1, and all q level binary trees E[σr Maj(σ∂)] >  > 0, where Maj is the majority function ( is independent of q). Looks good for phylogeny because can apply Maj even when do not know the topology. But, doesn’t work when  is non-constant. All edges on blue area have  1 All edges on black area have  2  1 <  2 is close to 1. Maj(σ∂) is very close to Maj of black tree. Maj of black tree very close to σv . σv and σr are weakly correlated. r v 9/22/2018

26 Algorithmic aspects of phase transition
Mossel [98]: For Ising model, when 2  2 > 1, and all q level binary trees E[σr Rec-Majl(σ∂)] >  > 0, where Rec-Majl is the recursive majority function of l levels ( is independent of q; l depends on ). Rec-Majl for l=1 Looks bad for phylogeny, as we need to know the tree topology. But main lemma of Mossel[98] is extendable to non-constant . 9/22/2018

27 Main Lemma [M2001] Lemma: Suppose that 2  min2 > 1, then there exists an l, and  > 0 such that the CFN model on the binary tree of l levels with (e)   min, for all e not adjacent to ∂T. (e)    min , for all e adjacent to ∂T. satisfies E[σr Maj(σ∂)]  . Roughly, given data of “quality  ”, we can reconstruct the root with “quality  ”. 9/22/2018

28 Reconstructing the topology [M2001]
The algorithm: Repeat the following: Reconstruct the topology up to l levels from the boundary using 4-points method. For each sample, reconstruct the data l levels from the boundary using majority algorithm. Recall: reconstruction near the boundary take O(log n) samples. By main lemma quality stays above . Remark: The same algorithm gives (almost) tight upper bounds also when 2  min2 < 1. 9/22/2018

29 Remarks and open problems
For CFN model, algorithm is very nice: Polynomial time. Adaptive (don’t need to know  min and  max in advance). Nearly optimal. Main problem: extending main lemma to non-balanced trees and other mutation models (reconstructing local topology still works). Secondary problem: extending lower bounds to other models. 9/22/2018

30 Proving main Lemma Need to estimate E[σr Maj(σ∂)]. Estimate has two parts: Case 1: For all e adjacent to ∂T, (e) is small. Here we use a perturbation argument, i.e. estimate partial derivatives of E[σr Maj(σ∂)] with respect to various variables (using something like Russo formula). Case 2: Some e adjacent to ∂T has large (e). Use percolation theory arguments. Both cases uses isoperimetric estimates for the discrete cube. 9/22/2018

