Download presentation
Presentation is loading. Please wait.
Published byAnnar Marthinsen Modified over 6 years ago
1
From Branching processes to Phylogeny: A history of reproduction
Elchanan Mossel, U.C. Berkeley 9/22/2018
2
Trees in genetics Darwin + Today - tree of species.
Today: tree of fathers. Today: tree of mothers. Today: Descendants for a few generations if there are no crossovers (graph has no cycles). Noah Shem Japheth Ham Put Cush Kannan Mizraim 9/22/2018
3
Information on the tree
No brain, Can’t move Information on the tree Tree of species: characteristics. Won’t discuss. Tree of mothers: Mitochondria. Tree of fathers: Y chromosome. No crossover: Full DNA. Stupid Flies Stupid Walks Stupid Swims Too smart Barely moves acctga Noah Shem Japheth Ham acctaa acctga acctga Put Cush Kannan Mizraim acctga 9/22/2018 acctga agctga acctga
4
Genetic mutation models
How does the D.N.A and other genetic data mutates along each edge of the tree? Actual mutation is complicated (for Y: “indels, snips, micro satellites, mini satellites”). A simplified model: Split sequence to subsequences. All subsequences have same mutation rate. That’s the standard model in theoretical phylogeny. 9/22/2018
5
Formal model – finite state space
Finite set A of information values. Tree T=(V,E) rooted at r. Vertex v in V, has information σv in A. Edge e=(v, u), where v is the parent of u, has a mutation matrix Me of size |A| by|A|: A sample of the process is (v)v in T. For each sample, we know v, for v in B(T), where B(T) is the boundary of the tree; |B(T)| = n. We are given k independent samples 1B(T),…, kB(T). 9/22/2018
6
Formal model – infinite state space
Infinite set A – assuming no mutations back – “Homoplasy-free models” Defined on an un-rooted tree T=(V,E). Edge e has (non-mutation) parameter (e). Sample: Perform percolation – edge e open with probability (e). All the vertices v in the same open-cluster have the same color v. Different clusters get different colors. We are given k independent samples 1B(T),…, kB(T). 9/22/2018
7
Reconstructing the topology
Given 1B(T),…, kB(T), want to reconstruct the topology, i.e., the structure of the underlying tree on the set of labeled leaves. Formally, we want to find for all u and v in B(T) their graph-metric distance d(u, v). Assume all internal degrees are at least 3. Assume B(T) consists only of leaves – vertices of degree 1. u u Me’ v Me’ Me’’ Me’’ w w 9/22/2018
8
Summary: Conjectures and results
Statistical physics Phylogeny Binary tree in ordered phase conj k = O(log n) Binary tree unordered conj k = poly(n) Percolation critical = 1/2 Homoplasy free Ising model CFN critical : 22 = 1 Sub-critical random cluster High mutation Problems: How general? What is the critical point? (extremality vs. spectral) 9/22/2018
9
Homoplasy free models and percolation
Th1[M2002]: Suppose that n=3*2q and T is a uniformly chosen (q+1)-level 3-regular tree. For all e, (e)< , and < 1/2. Then in order to reconstruct the topology with probability = 0.1, at least k = (–q) = nΩ(-log ) samples are needed. Th2[M2002]: Suppose that T has n leaves and For all e, ½ + < (e)< 1 - . Then k = O(log n – log ) samples suffice to reconstruct the topology with probability 1-. 9/22/2018
10
Cavendar-Farris-Neyman model:
The CFN model Cavendar-Farris-Neyman model: 2 data types: 1 and –1. Mutation along edge e: with probability (e) copy data from parent. Otherwise, choose 1/-1 with probability ½ independently of everything else. Thm[CFN] Suppose that for all e, 1 - > (e) > > 0. Then given k samples of the process at the n leaves, It is possible to reconstruct the underlying topology with probability 1 - , if k = nΩ(-log ). 9/22/2018
11
Phase transition for the CFN model
Th1[M2002]: Suppose that n=3*2q and T is a uniformly chosen (q+1)-level 3-regular tree. For all e, (e)< , and 22 < 1. Then in order to reconstruct the topology with probability > 0.1, at least k = (2-q-2q) = nΩ(-log ) samples are needed. Th2[M2002]: Suppose that T is a balanced tree on n leaves and For all e, min < (e)< max and 22min > 1, max < 1. Then k = O(log n – log ) samples suffice to reconstruct the topology with probability 1-. 9/22/2018
12
Metric spaces on trees Let D be a positive function on the edges E.
Define Claim: Given D(v,u) for all v and u in B(T), it is possible to reconstruct the topology of T. Proof: Suffices to find d(u, v) for all u, v B(T), where d is the graph metric distance. d(u1,u2) = 2 iff for all w1 and w2 it holds that D’(u1,u2,w1,w2) = D(u1,w1)+D(u2,w2) –D(u1,u2)–D(w1,w2) is non-negative (“Four point condition”). w1 u1 w1 u1 w2 u2 w2 u2 9/22/2018
13
Metric spaces on trees Similarly, d(u1,u2) = 4 iff for all w1 and w2 s.t. for i=1,2 and j=1,2, d(ui,wj) = 4, it holds that D’(u1,u2,w1,w2) = 0. Remark 1: If D’(u1,u2,w1,w2) > 0, then | D’(u1,u2,w1,w2) | 2 min_e D(e). Therefore, it suffices to know D with accuracy min_e D(e)/4. Remark 2: For a balanced tree, in order to reconstruct the underlying topology up to l levels from the boundary, it suffices to know D(u,v) for all u and v s.t. d(u,v) 2 l + 2 (with accuracy min_e D(e)/4). 9/22/2018
14
Proof of CFN theorem Define D(e) = - log (e).
D(u,v) = -log(Cov(v,u)), where Cov(v, u) = E[vu]. Estimate Cov(v, u) by Cor(v, u) where Need D with accuracy m = min D(e)/4 = c, or Cor = (1 c)Cov. Cor(v, u) is a sum of k i.i.d. 1 variables with expected value Cov(v, u). Cov(v, u) may be a small as 2 depth(T) = n-O(-log ). Given k = nΩ(-log ) samples, it is possible to estimate D and therefore reconstruct T with high probability. 9/22/2018
15
Extensions Steel 94: Trick to extend to general Me provided that that det(Me) [-1,-1+] [- , ] [1 - , 1], using D(e) = -log( |det(Me)| ) (more or less). ESSW: If tree has small depth then k is smaller. For Balanced trees: In order to reconstruct up to l levels from the boundary, suffices to have k = Θ(log(n)). Proof: Cov(v, u). for u and v with d(u,v) 2 l + 2 is at least 2l + 2. 9/22/2018
16
What’s next? It’s important to minimize the number of samples.
Do we need k = nΩ(1), or do we need k = polylog(n)? Since the number of trees on n leaves is exponential in n log n, and each sample consists of n bits, we need at least Ω(log(n)) samples. 9/22/2018
17
The Ising model on the binary tree
The (Free)-Ising-Gibbs measure on the binary tree: Set σr, the root spin, to be +/- with probability ½. For all pairs of (parent, child) = (v, w), set σw = σv, with probability , otherwise σw = +/- with probability ½. Different Perspective: Topology is known and looking at a single sample. + + + + - + + - + - + - + + + 9/22/2018
18
The Ising model on the binary tree
Studied in statistical physics [Spitzer 75, Higuchi 77, Bleher-Ruiz-Zagrebnov 95, Evans-Kenyon-Peres-Schulman 2000, M 98] Interesting phenomena: double phase transition (different from Ising model in Zd). When 2 1, unique Gibbs measure. When 2 2 1, free measure is extremal. In other words, 9/22/2018
19
The Ising model on the binary tree
From BRZ or EKPS: mutual information: H(σ∂) + H(σr)) - H(σr,σ∂) Temp σr | σ∂≡ 1 Uniq I(σr,σ∂) Free measure high < 1/2 unbiased V → 0 extremal med. (1/2,1/√2) biased X low > 1/√2 Inf > 0 Non-ext Remark: 2 2 = 1 phase transition also transition for mixing time of Glauber dynamics for Ising model on tree (with Kenyon and Peres) 9/22/2018
20
Lower bound on number of samples
Th1[M2002]: Suppose that n=3*2q and T is a uniformly chosen (q+1)-level 3-regular tree. For all e, (e)< , and 22 < 1. Then in order to reconstruct the topology with probability > 0.1, at least k = (2-q-2q) = nΩ(-log ) samples are needed. Proof of lower bound uses information inequalities. Lemma 1[EKPS]: For the single sample process on the binary tree of q levels, I(σr,σ∂) (2 2) q. 9/22/2018
21
Lower bound on number of samples
Lemma 2[Fano’s inequality]: X Y Want to reconstruct the random variable X given the random variable Y, X takes m values. Let pe be the error probability. Then 1 + pe log2(m) > H(X|Y). Lemma 3[“Data Processing Lemma”]: X Y Z If X and Z are independent given Y, then I(X, Z) min { I(X, Y) , I(Y, Z) }. 9/22/2018
22
Lower bound on number of samples
Assume that topology of bottom q - l levels is known, i.e., we know d(u, v) if d(u, v) 2(q – l). Then the the conditional distribution of the topology T of the first l levels is uniform. Let σlt for t=1,…,k be the data at level l for sample t. Recall that we are given the Data = (σ∂t). By “Data Processing”, l X ? ? Y * k q - l Known Known Z * k 9/22/2018
23
Lower bound on number of samples
By independence, After some manipulations and by EKPS Lemma So Let Since T is chosen uniformly, Now By Fano’s Lemma, the probability of error in reconstructing the topology, pe, satisfies In order to get pe < 1 - , need 9/22/2018
24
Upper bound on number of samples
Th2[M2002]: Suppose that T is a balanced tree on n leaves and For all e, min < (e)< max and 22min > 1, max < 1. Then k = O(log n – log ) samples suffice to reconstruct the topology with probability 1-. Proves a conjecture of M. Steel. 9/22/2018
25
Algorithmic aspects of phase transition
Higuchi [77]: For Ising model, when 2 2 > 1, and all q level binary trees E[σr Maj(σ∂)] > > 0, where Maj is the majority function ( is independent of q). Looks good for phylogeny because can apply Maj even when do not know the topology. But, doesn’t work when is non-constant. All edges on blue area have 1 All edges on black area have 2 1 < 2 is close to 1. Maj(σ∂) is very close to Maj of black tree. Maj of black tree very close to σv . σv and σr are weakly correlated. r v 9/22/2018
26
Algorithmic aspects of phase transition
Mossel [98]: For Ising model, when 2 2 > 1, and all q level binary trees E[σr Rec-Majl(σ∂)] > > 0, where Rec-Majl is the recursive majority function of l levels ( is independent of q; l depends on ). Rec-Majl for l=1 Looks bad for phylogeny, as we need to know the tree topology. But main lemma of Mossel[98] is extendable to non-constant . 9/22/2018
27
Main Lemma [M2001] Lemma: Suppose that 2 min2 > 1, then there exists an l, and > 0 such that the CFN model on the binary tree of l levels with (e) min, for all e not adjacent to ∂T. (e) min , for all e adjacent to ∂T. satisfies E[σr Maj(σ∂)] . Roughly, given data of “quality ”, we can reconstruct the root with “quality ”. 9/22/2018
28
Reconstructing the topology [M2001]
The algorithm: Repeat the following: Reconstruct the topology up to l levels from the boundary using 4-points method. For each sample, reconstruct the data l levels from the boundary using majority algorithm. Recall: reconstruction near the boundary take O(log n) samples. By main lemma quality stays above . Remark: The same algorithm gives (almost) tight upper bounds also when 2 min2 < 1. 9/22/2018
29
Remarks and open problems
For CFN model, algorithm is very nice: Polynomial time. Adaptive (don’t need to know min and max in advance). Nearly optimal. Main problem: extending main lemma to non-balanced trees and other mutation models (reconstructing local topology still works). Secondary problem: extending lower bounds to other models. 9/22/2018
30
Proving main Lemma Need to estimate E[σr Maj(σ∂)]. Estimate has two parts: Case 1: For all e adjacent to ∂T, (e) is small. Here we use a perturbation argument, i.e. estimate partial derivatives of E[σr Maj(σ∂)] with respect to various variables (using something like Russo formula). Case 2: Some e adjacent to ∂T has large (e). Use percolation theory arguments. Both cases uses isoperimetric estimates for the discrete cube. 9/22/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.