Presentation is loading. Please wait.

Presentation is loading. Please wait.

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci & a couple of unrelated observations Elchanan Mossel, UC Berkeley Joint.

Similar presentations


Presentation on theme: "Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci & a couple of unrelated observations Elchanan Mossel, UC Berkeley Joint."— Presentation transcript:

1 Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci & a couple of unrelated observations Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research At Newton Institute Dec 07

2 Lecture Plan A simple observation about gene trees and population trees. A comment: on “optimal” and “absolute converging” tree reconstruction A comment on: “Generic models”. A comment on: “Network Reconstruction”. Disclaimer: Last talk – a bit philosophical (but would be happy to provide hard technical proofs )

3 Gene Trees and Population Trees Main goal in phylogenetics: Recovering species/population histories. Data: Current Genes. Issue: In recent populations: gene trees may differ from population trees. Model for evolution of trees in populations: Coalescence: Fixed size population N Each individual chooses a random parent in previous generation. # generations = N £ branch-length Main Question: How to reconstruct population trees from gene trees?

4 Gene Trees: The Engineering Approach Two common “engineering” approaches: Approach 1: Assume all genes come from a single tree. Kubato-Degnan: Inconsistent. Approach 2: Build tree for each tree on its own. Take majority tree. Degnan-Rosenberg: Inconsistent. Q: What should be done instead?

5 Gene Trees: A Rigorous Approach M-Roch: A consistent estimator of the molecular distance between two populations d(P 1,P 2 ) is: D(P 1,P 2 ) = min {d g (P 1,P 2 ) : g 2 Genes} ) distances between populations are identifiable. ) tree is identifiable Under standard coalescence assumptions, get good rate: P(topology error) · (# pops) £ exp(-c # genes) c = shortest branch length. Estimator can be “plugged in” into any distance based method for reconstructing trees. In M-Roch, use NJ, but similarly work for: Short-quartets (ESSW) Distorted metrics and forests (M) etc.

6 Comments on Absolute Convergence Algorithmic paradigm: Want to reconstruct tree on n species using sequence length L and running time T. “Absolute Convergence”: L = poly(n); T = poly(n). Q: Is this the best we can do?

7 resolution of Steel’s conjecture [M’04] [Daskalakis- M-Roch’06] short branches seq. length L = c log n long branches seq. length L = n C ancestral reconstruction phylogenetic reconstruction n = # species Short branches := all branches < l c Long branches := all branches > l c l c depends on mutation model but not on tree, tree size etc.

8 The algorithmic challenge Conj: For short branches, if data is generated from the model: ML identifies the correct using L = O(log n) samples (best bound known is L = exp(O(N)). Conclusion: In order to “beat” ML, need algorithms with L = O(log n) Challenge: The constant in O is important! Challenge: Deal with short/long branches (contract edges; output forest) Challenge: General mutation models (not just CFN, JC). Comment: Rigorous methods have running time gaurentee. Comment: For L=poly(n), know how to deal with all challenges: ESSW M’07 (forests – long edges). Gornieu et. al (short edges).

9 On generic parameters From Rhodes talk: “Generic models are easier to identify”. Typically – genetic parameters. How about generic trees?

10 Mixtures and Phenomena in High Dims The Geometry of High Dimensions: “Almost every collection of k vectors are almost orthogonal in high enough dimension n”. M-Roch (in preparation): For every k, as n -> 1 the probability that a mixture of k trees on n leaves is identifiable goes to 1. Holds for most reasonable measures on the space of trees and most mutation models. Basic idea: In generic situations can (almost) cluster samples according to trees. Gives an efficient algorithm. Similar results hold for rates across sites 

11 A Comment on Dynamic Programming Q (Zhang): Given a tree is it possible to find the most informative k species? In terms of Pasrsimony? In terms of ML? Note: If we know Parsimony/ML score for left/right sub-tree, we know it for the root. Q: Can use dynamic programming? A: Yes – but with the right “data structure” Information per node: Discrete version of the set of achievable distributions. Called “Density Evolution” in coding theory / spin-glass theory. Additive error = 1/poly(n). L1L1 L2L2 L L1L1 L2L2 L

12 Hardness of Distinguishing Network Models with Hidden Nodes Basic question: Is it possible to recover a network G from observation at a subset of the nodes? Easier question: Suppose we observe X 1,…,X r. Is it possible to determine if they come from nodes S in G 1 or nodes T in G 2 ? Problem: It may be that the two distributions are the same. Assume: The two distributions are different (large total variation distance) Q: Assuming the two distributions are different how hard is it to tell if it’s coming from G 1 or G 2 ? Related question: What is a computational model of a biologist? G1G1 G2G2

13 The distinguishing problem for Trees Q: Assuming the two distributions are different how hard is it to tell if it’s coming from T 1 or T 2 ? Note: For trees the problem is easy: Perform likelihood test. Easy to do efficiently (peeling, pruning, dynamics programming). # samples needed poly(n). T1T1 T2T2

14 Two Models of a Biologist The Computationally Limited Biologist: Cannot solve hard computational problems, in particular cannot sample from a general G- distributions. The Computationally Unlimited Biologist: Can sample from any distribution. Related to the following problem: Can nature solve computationally hard problems? From Shapiro at Weizmann

15 Hardness Results The Computational Limited Biologist (Bogdanov-M): Distinguishing problem can be solved efficiently iff NP=RP. Computational Unlimited Biologist (Bogdanov- M): The problem is at least zero-knowledge hard. Zero-Knowledge Problem: Can we decide if samples from a computationally efficient distribution is coming from the uniform distributions? Related to cryptography. G1G1 G2G2

16 Reconstructing Networks Motivation: abundance of stochastic networks in biology, social networks, neuro-science etc. etc. Network defines a distribution as follows: G=(V,E) = Graph on [n] = {1,2,…,n} Distribution defined on A V, where A is some finite set. Too each clique C in G, associate a function  C : A C -> R + and: P[  ] =  C  C (  C ) Called Markov Random Field, Factorized Distribution etc. Directed models also common. Markov Property: If S separates A from B then  A and  B are conditionally independent given  S

17 Reconstructing Networks. Task 1: Given samples of , find G. Task 2: Given samples of  restricted to a set S find G. Will consider the problem when n large and maximum degree d is small. (Note that specification of the model is of size max(n,,exp(max |C|)) )

18 Reconstructing Networks – A Trivial Algorithm Lower bound (Bresler-M-Sly): In order to recover G of max-deg d need at least c d log n samples. Pf follows by “counting # of networks”. Upper bound (Bresler-M-Sly): If distribution is “non-degenerate” c d log n samples suffice. Trivial Algorithm: For each v 2 V: Enumerate on N(v) For each w 2 V check if  v ind. of  w given  N(v). Non-Degeneracy: For every v and every w 2 N(v) there exists two assignments to N(v)  1 and  2 that differ at w and: d TV (P(  v |  1 ), P(  v |  2 )) ¸  For soft-core model suffices to have for all  =  u,v max a,b,c,d |  (c,a)-  (d,a)+  (c,b)-  (d,b)| >  Running time = O(n d+1 log n)

19 A Trivial Algorithm – Related Result Trivial Algorithm: For each v 2 V: Enumerate on N(v) For each w 2 V check  v ind. of  w given  N(v). Related work Algorithm was suggested before. Abbeel, D. Koller, A. Ng: without restrictions learn a model whose KL distance from generating model is small (no guarantee of obtaining the true model; in order to get O(1) KL distance need poly samples). M. J. Wainwright, P. Ravikumar, J. D: Use L 1 regularization to get true model for Ising models, sampling complexity O(d 5 log n) – no running time bounds. Other related work: assuming special form of potentials 

20 Variants of the Trivial Algorithm If graph has exponential decay of correlations Corr(  u,  v ) · exp(-c d(u,v)) Suffices to enumerate over N(v) among w correlated with v. Running time: O(n 2 log n + n f(d)). Missing nodes: Suppose G is triangle free, then a variant of the algorithm can find one hidden node. Idea (with M. Biskup’s help): Run the algorithm as if the node is not hidden Noise: The algorithm tolerates small amounts of noise (statistical robustness). Q: What about higher amounts of noise? (From Bresler-M-Sly) possible w’s

21 Higher Noise & Non Identifiable Example Bresler-M-Sly: Example of non-identifiably Consider G 1 = path of length 2, G 2 = triangle + Noise. Assume Ising model with random interactions and random noise. Then with constant probability, cannot distinguish between the models. Ising: P[  ] =  u,v 2 E exp(   (u)  (v)) Intuitive reason: dimension of distribution is 3 in both cases. = hidden nodes = observed nodes

22 Thanks !!

23 Sebastien Roch Costis Daskalakis Andrej Bogdanov

24 Thanks !! Fascinating workshop: Principal Organiser: Professor Mike Steel (University of Canterbury, NZ) Organisers: Professor Vincent Moulton (University of East Anglia) and Dr Katharina Huber (University of East Anglia) Sponsored by: Allan Wilson Centre for Molecular Ecology and EvolutionAllan Wilson Centre for Molecular Ecology and Evolution As part of a great program: Organisers: Professor V Moulton (East Anglia), Professor M Steel (Canterbury) and Professor D Huson (Tubingen)


Download ppt "Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci & a couple of unrelated observations Elchanan Mossel, UC Berkeley Joint."

Similar presentations


Ads by Google