Mossel@stat.berkeley.edu, Phylognetic trees: What to look for and where? Lessons from Statistical Physics Elchanan Mossel, U.C. Berkeley and Microsoft.

Slides:



Advertisements
Similar presentations
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis vol. 1: lecture 1.
Advertisements

6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 19.
CS 336 March 19, 2012 Tandy Warnow.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Phylogenetic reconstruction
CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep
Approximate Counting via Correlation Decay Pinyan Lu Microsoft Research.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan Gronau.
Mixing Times of Self-Organizing Lists and Biased Permutations Sarah Miracle Georgia Institute of Technology.
cover times, blanket times, and majorizing measures Jian Ding U. C. Berkeley James R. Lee University of Washington Yuval Peres Microsoft Research TexPoint.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Proving Non-Reconstruction on Trees by an Iterative Algorithm Elitza Maneva University of Barcelona joint work with N. Bhatnagar, Hebrew University.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Phylogenetics II.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci & a couple of unrelated observations Elchanan Mossel, UC Berkeley Joint.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
11/4/20151 Markovian Models of Genetic Inheritance Elchanan Mossel, U.C. Berkeley
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Chapter 10 Graph Theory Eulerian Cycle and the property of graph theory 10.3 The important property of graph theory and its representation 10.4.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Hierarchical Well-Separated Trees (HST) Edges’ distances are uniform across a level of the tree Stretch  = factor by which distances decrease from root.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
Unique Games Approximation Amit Weinstein Complexity Seminar, Fall 2006 Based on: “Near Optimal Algorithms for Unique Games" by M. Charikar, K. Makarychev,
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
2/1/20161 Markovian Models of Genetic Inheritance – Lecs 3,4 Correlation Decay and Phylogenetic Reconsruction Elchanan Mossel, U.C. Berkeley
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Shuffling by semi-random transpositions Elchanan Mossel, U.C. Berkeley Joint work with Yuval Peres and Alistair Sinclair.
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 23.
Equitable Rectangular Dissections Dana Randall Georgia Institute of Technology Joint with: Sarah Cannon and Sarah Miracle.
Distance-based phylogeny estimation
Phylogenetic Trees - Parsimony Tutorial #12
New Characterizations in Turnstile Streams with Applications
Distance based phylogenetics
Character-Based Phylogeny Reconstruction
Path Coupling And Approximate Counting
Algorithms and networks
Reconstruction on trees and Phylogeny 1
Gibbs measures on trees
From Branching processes to Phylogeny: A history of reproduction
Glauber Dynamics on Trees and Hyperbolic Graphs
Phase Transitions In Reconstruction Yuval Peres, U.C. Berkeley
Methods of molecular phylogeny
Reconstruction on trees and Phylogeny 2
Reconstruction on trees and Phylogeny 4
Algorithms and networks
Technion – Israel Institute of Technology
Linear sketching over
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
Reconstruction on trees and Phylogeny 3
Reading Phylogenetic Trees
The Most General Markov Substitution Model on an Unrooted Tree
CS 394C: Computational Biology Algorithms
Arun ganesh (UC BERKELEY)
Presentation transcript:

mossel@stat.berkeley.edu, Phylognetic trees: What to look for and where? Lessons from Statistical Physics Elchanan Mossel, U.C. Berkeley and Microsoft Research mossel@stat.berkeley.edu, www.stat.berkeley.edu/~mossel/ 11/22/2018

Statistical physics Statistical physics is a sub-field of mathematical physics studying complex systems with simple microscopic interactions. The Ising model on a graph G=(V,E) is a probability measure (“Gibbs distribution”) on the space of configurations σ : V  {-1,1} such that P[σ] is given by: exp(Σ(v, w) ε E σ(v)σ(w)/T)/Z = exp( Σ(v, w) ε E σ(v)σ(w))/Z Or, Weight() ~ exp( # { u ~ v : (u) = (v) } ) Traditionally studied on cubes in Zd. The Ising model on 200 x 200 grid 11/22/2018

Statistical physics - intuition The Ising model on the nxn grid is given by: exp(Σ(v, w) ε E σ(v)σ(w)/T)/Z = exp( Σ(v, w) ε E σ(v)σ(w))/Z We expect that: T small,  large ) strong correlations: Corr(boundary,0) >  > 0 for all n. T large,  small ) weak correlations: Corr(boundary,0) ! 0 as n ! 1. 2n boundary Onsager (1944) proved it where Critical  = c = ln(1+21/2)/2 For most other graphs, we know very little The Ising model on 200 x 200 grid  = c 11/22/2018

Statistical physics on trees The Ising model on a tree T=(V,E) is given by: exp( Σ(v, w) ε E (v,w) (v) (w))/Z It is equivalent to the following model: Let r be a root (chosen arbitrarily). Let (r) = § 1 with probability ½ and for Each edge (u,v) directed away from the root, let: (v) = (u) with probability (u,v). (v) is independent § 1 otherwise. (u,v) = ( e(u,v)-e-(u,v) )/ (e(u,v)+e-(u,v)) + + + + - + + - + - + - + + 11/22/2018

Ising Model on binary Trees low interm. high bias bias no bias bias no bias “typical” boundary “typical” boundary “Extermality” 8 e, 2(e) > 1 8 e, 2 2(e) · 1 Unique Gibbs measure 8 e, 2(e) · 1 “Non-Extermality” 8 e, 2(e)2 > 1 11/22/2018

Statistical physics on trees: History Uniqueness studied by Bethe (1930’s). Extremality phase more recently Spitzer 75, Higuchi 77, Bleher-Ruiz-Zagrebnov 95, Evans-Kenyon-Peres-Schulman 2000, Ioffe 99, M 98, Haggstrom-M 2000, Kenyon-M-Peres 2001, Martinelli-Sinclair Weitz- 2003, Martin-2003 Many problems are still open. Extremality has rich connections with Noisy computation/communication [von-Neumann53, Evans-Shculmann00,…] Mixing of Markov chains [Berger-Kenyon-Mossel-Peres01,Martinelli-Sinclair-Weitz05] Spinglasses and Random Sat problems [Parisi,Mezard,Montanari; Mezard-Montanari06] 11/22/2018

Phylogeny “Phylogeny is the true evolutionary relationships between groups of living things” Noah Shem Japheth Ham Cush Kannan Mizraim 11/22/2018

History of Phylogeny Intuitively: : “animal kingdom” or “plant kingdom.” More scientifically: morphology, fossils, etc. Darwin … But: Is a human more like a great ape or like a chimpanzee? No brain, Can’t move Stupid Walks Stupid Swims Stupid Flies Too smart Barely moves 11/22/2018

Molecular Phylogeny Molecular Phylogeny: Based on DNA, RNA or protein sequences of organisms. Mutation mechanisms: Substitutions Transpositions Insertions, Deletions, etc. Will only consider substitutions and assume sequences are aligned. Noah acctga Shem Japheth Ham acctaa acctga acctga Put Cush Kannan Mizraim acctga 11/22/2018 acctga agctga acctga

Simplifying assumptions: models Assumption: Letters of sequences (“characters”) evolve independently and identically. CFN model: The first stochastic model invented by Cavender, Farris and Neyman (70s): Let (r) = § 1 with probability ½ and for Each edge (u,v) directed away from the root, let: (v) = (u) with probability (u,v). (v) is independent § 1 otherwise. This is exactly the Ising model on the evolutionary tree! Dictionary: {A,C} = + (Pyrimidine group) {G,T} = - (Purine group). Some results can be generalized to other models. 11/22/2018

Simplifying assumptions: trees Assumption 1: Evolution is on a tree. Assumption 2: Trees are binary -- All internal degrees are 3. Given a set of species (labeled vertices) X, an X-tree is a tree which has X as the set of leaves. Two X-trees T1 and T2 are identical if there’s a graph isomorphism between T1 and T2 that is the identity map on X. Most results to trees all of whose internal degrees are at least 3. u u Me’ v Me’ Me’’ Me’’ w w d a c b d a b c c a b d 11/22/2018

The Phylogenetic Challenge: Time Contemporary Genetic sequences Evolutionary model Genetic sequences ?? How to reconstruct Phylogenetic tree from genetic data at contemporary species?? 11/22/2018

Phylogeny Tree is unknown. Given sequences at the leaves of the tree. Want to reconstruct the tree (un-rooted). How “hard” is it as a function of n = “size of tree” = # leaves. k = length of sequences. 11/22/2018

Phylogeny 11/22/2018

n and k Length of sequence! Interested to know k = #characters needed to reconstruct the tree with n = #leaves. Erdos-Steel-Szekeley-Warnow96: If  < (e) < 1 -  for all e. Tree can be recovered from Sequences of length k = nc. In polynomial time. Question: How about shorter sequences? Previously, best lower bound on sequence length is k = (log n). However, in practice: Sometimes hard to find long sequences. Short sequences often suffice. 11/22/2018

Lesson 1: Phylogenetic lower bound for forgetful trees Th[M2004; Trans AMS]: If 2 2(e) < 1 for all e then we show A lower bound on sequence length of k = nc, where c > 0 is a function of  =maxe (e) and c ! 1 as  ! 0. Th [M2003; JCB] Similar theorem for general mutation models if mutation rates are high. Proofs are easy. 11/22/2018

Poly. lower bound for Phylogeny “Proof by coupling”: X=T ? ? L * k Known Known q-L * k If for all k characters we can couple bottom q-L levels, then X is independent of the data. By forgetfulness of tree, if k < nc, X is independent of data with high probability. Similar idea can be used to test trees (M+Riesenfeld) 11/22/2018

Lesson 2: Recent history is easy In the proof of lower bound, the “deep convergences” were hard to reconstruct. Theorem [M04]: If  < (e) < 1 -  for all e, then “most of the tree” can be reconstructed from sequences of length k = O(log n). “most of tree” := a forest F such that the true tree is obtained from F by adding o(n) edges. Result were refined + experiments in [Daskalakis-Hill-Jaffe-Miahescu-Mossel-Rao] Proof is *not* easy – based on Distorted Metrics. 11/22/2018

Lesson 3: Species that remember their past can reconstruct their history. Thm [Daskalakis-Mossel-Roch; To appear STOC06]: If 2 2(e) > 1 for all e then The tree can be recovered with high probability from sequences of length k = O( log n ). Solves M. Steel’s “Favourite conjecture” Builds on: [M2004; Trans AMS] Hard proof: Mixes probability, algorithms, statistical physics. 11/22/2018

Proof Sketch: Logarithmic reconstruction Two parts of the proofs: I. Statistical / algorithmic. II. Probability / statistical physics. By Forest result we may recover a forest containing 90% of the edges of the tree from O(log n) samples. Doesn’t use the 2 2 > 1 11/22/2018

Logarithmic Reconstruction II. Here we use the condition that 2  2 > 1 in order to estimate the characters at the inner nodes of the forest. “Like” I. 11/22/2018

Ising Model on binary Trees low interm. high bias bias no bias k = (nc) bias no bias Most tree from k = O(log n) k = O( log n ) “typical” boundary “typical” boundary “Extermality” 8 e, 2(e) > 1 8 e, 2 2(e) · 1 Unique Gibbs measure 8 e, 2(e) · 1 “Non-Extermality” 8 e, 2(e)2 > 1 11/22/2018

Many more challenges to come … We know very little … We don’t understand methods used in practice: Maximum Likelihood (NP hard on arbitrary data; [Chor-Tuller05; Roch05]) Markov Chain Monte Carlo (Can be exponentially slow on mixtures; M-Vigda05). In what sense Parsimony = Maximum – Likelihood? (2 Conjectures by Steel) Other mutation models: rates across sites, gene order etc. etc. + all the problems on Gibbs measures on trees 11/22/2018