Download presentation
Presentation is loading. Please wait.
1
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science & Engineering Hebrew University Matan Ninio Computer Science & Engineering Hebrew University Itsik Pe’er Computer Science Tel-Aviv University Tal Pupko Inst. of Statistical Mathematics Tokyo
2
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Introduction u Phylogentic inference: “reconstruction of the tree of evolution based on DNA/Protein sequences of current day species” u Maximum Likelihood inference l Model evolution as a stochastic process l Use likelihood of observed sequences to evaluate different trees u Computational task l Construct the maximum likelihood tree u We describe a new procedure that use a variant of EM to efficiently learn better trees
3
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Evolution of a Character over Time Probability of change: P a b (t) Assumptions: u Lack of memory: l u Reversibility: Exist stationary probabilities { P a } s.t. A GT C
4
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 Lengths t = {t i,j } for each branch (i,j) u Phylogenetic tree = (Topology, Lengths) leaf branch internal node
5
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Joint Distribution Random variables X 1... X 2N-2 for all nodes. x[1...N] - Observed values of X[1...N]. Joint distribution: Marginal distribution: u Computation of marginal distribution: by dynamic programming
6
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Maximum Likelihood Reconstruction Observed data: ( D ) N sequences of length M Each position: an independent sample from the marginal distribution over N current day taxa Likelihood: Given a tree (T,t) : Goal: Find a tree (T,t) that maximizes l(T,t:D).
7
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Current Approaches u Perform search over possible topologies T1T1 T3T3 T4T4 T2T2 TnTn Parametric optimization (EM) Parameter space Local Maxima
8
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Computational Problem u Such procedures are computationally expansive! u Computation of optimal parameters, per candidate, requires non-trivial optimization step. u Spend non-negligible computation on a candidate, even if it is a low scoring one. u In practice, such learning procedures can only consider small sets of candidate structures
9
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM Idea: Use parameters found for current topology to help evaluate new topologies. Outline: Perform search in (T, t) space. u Use EM-like iterations: l E-step: use current solution to compute expected sufficient statistics for all topologies l M-step: select new topology based on these expected sufficient statistics
10
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 The Complete-Data Scenario Suppose we observe H, the ancestral sequences. Define: Find: topology T that maximizes S i,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,j F is a linear function of S i,j
11
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Expected Likelihood Start with a tree (T 0,t 0 ) u Compute Formal justification: u Define: Theorem: Consequence: improvement in expected score improvement in likelihood
12
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Proof Theorem: u Simple application of Jensen’s inequality
13
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Algorithm Outline Original Tree (T 0,t 0 ) Unlike standard EM for trees, we compute all possible pairwise statistics Time: O(N 2 M) Compute: Weights:
14
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Pairwise weights This stage also computes the branch length for each pair (i,j) Algorithm Outline Compute: Weights: Find:
15
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Max. Spanning Tree Fast greedy procedure to find tree By construction: Q(T’,t’) Q(T 0,t 0 ) Thus, l(T’,t’) l(T 0,t 0 ) Algorithm Outline Compute: Weights: Find: Construct bifurcation T 1
16
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Fix Tree Remove redundant nodes Add nodes to break large degree This operation preserves likelihood l(T 1,t’) =l(T’,t’) l(T 0,t 0 ) Algorithm Outline Compute: Find: Weights: Construct bifurcation T 1
17
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 New Tree Thm: l(T 1,t 1 ) l(T 0,t 0 ) Algorithm Outline Compute: Construct bifurcation T 1 Find: Weights: These steps are then repeated until convergence
18
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Evaluation u Comparison to MOLPHY (PROTML): Evaluation on u Synthetic data sets l Sampled from a tree we generated l Allows us to control # taxa and #positions l Can compare to “true” generating tree u Real-life data
19
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Number of Positions (48 taxa) Number of Positions 101001000 -0.5 0 0.5 1 1.5 2 SEMPHY MOLPHY Original Original (no training) Log-probability (per position) relative to original -12 -10 -8 -6 -4 -2 0 101001000 Performance on test data relative to original model
20
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 020406080100 SEMPHY MOLPHY Original Original (no training) Number of taxa (100 pos) Number of Taxa Log-probability (per position) relative to original -1.8 -1.6 -1.4 -1.2 -0.8 -0.6 -0.4 -0.2 0 020406080100 Performance on test data relative to original model
21
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Run times 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0102030405060708090100 SEMPHY MOLPHY Time in seconds Number of taxa
22
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Real life data LysozymeMitochondrial # taxa4334 # pos1223,578 MOLPHY likelihood -2,916.2-74,227.9 SEMPHY likelihood -2,892.1-70,533.5 Diff per position 0.191.03
23
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Discussion u New algorithmic approach for optimizing the likelihood of models u SEMPHY: an implementation for protein sequences Incorporates standard models for P a b (t) u Early results shows that it outperforms current programs for ML reconstruction l In terms of running time & solution quality Work in progress u Escaping “local” maxima u More elaborate models of evolution l Variable rate l Co-evolution
24
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 -0.12 -0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 102030405060708090100 log-likelihood relative to optimized original Number of Taxa SEMPHY Anneal SEMPHY Preliminary Results: Annealed Structural EM Original
25
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 THE END
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.