Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Phylogeny Tree Reconstruction
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Phylogenetic trees Sushmita Roy BMI/CS 576
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Terminology of phylogenetic trees
Molecular phylogenetics
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Lecture 2: Principles of Phylogenetics
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Introduction to Phylogenetics
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
The star-tree paradox in Bayesian phylogenetics Bengt Autzen Department of Philosophy, Logic and Scientific Method LSE.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
A Unified Continuous Greedy Algorithm for Submodular Maximization Moran Feldman Roy SchwartzJoseph (Seffi) Naor Technion – Israel Institute of Technology.
Evaluating the Fossil Record with Model Phylogenies Cladistic relationships can be determined without ideas about stratigraphic completeness; implied gaps.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Slide 1: Thank you Elizabeth for the introduction, and hello everybody. So, I have been a PhD student with Charles Semple and Mike Steel at the UoC since.
Recitation 5 2/4/09 ML in Phylogeny
BNFO 602 Phylogenetics Usman Roshan.
Why Models of Sequence Evolution Matter
CSCI-2400 Models of Computation.
BNFO 602 Phylogenetics – maximum likelihood
BNFO 602 Phylogenetics Usman Roshan.
The Most General Markov Substitution Model on an Unrooted Tree
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Binary Trees.
Presentation transcript:

Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore

Boreoeutherian Ancestor

The Genome Selection for Reconstruction problem Instance: Given a phylogeny P of a set of genomes, an integer k and a reconstruction method T (say parsimony). Solution: k genomes in the phylogeny that gives the highest accuracy of reconstructing the ancestral genome at the root of the phylogeny, using method T.

Two reasons It is often impossible to sequence all descendent genomes below an ancestor; More taxa do not necessarily give a higher accuracy for the reconstruction of ancestral character states in general (examples will be given below)

Outline Introduction to reconstruction accuracy analysis More genomes are not necessarily better for reconstruction Greedy algorithms for genome selection A joint work with G. Li, J. Ma and M. Steel

1. Reconstruction and its Accuracy There are different methods for reconstructing the ancestral character states Parsimony Maximum likelihood methods (Koshi & Golstein’96, Yang et al’95) Bayesian methods (Yang et al’95) In this work, we study the problem with the Fitch parsimony and maximum likelihood in the Jukes-Cantor evolutionary model.

Jukes-Cantor Model Characters evolve by a symmetric, reversible Markov process. Probability of a substitution change of any sort is the same on a branch. For simplicity, we assume there are two states 0 and 1.

Reconstruction Accuracy In the symmetric Jukes-Cantor model, the reconstruction accuracy of a method is independent of the prior distribution of the states at the root.

D denotes a state configuration at leaves: it has one state for each leaf. There are state configurations since there are 2 possible states at each leaf. I(0, D, K) is 1 if the method K reconstructs state 0 from D and 0 otherwise. Pr(D|0) is the probability that 0 at the root evolves into D.

D denotes a state configuration at leaf nodes: it has one state for each leaf. There are state configurations since there are 2 possible states at each leaf. The reconstruction accuracy is the sum of generating Prob. of state configurations which allow the true state 0 to be recovered by the method K.

Previous Analysis Works Simulation study (Martins’99, Mooers’04, Salisbury & Kim’01, Zhang & Nei’97, Yang et al’95); Theoretical study (Mossel’01, Lucena and Haussler’05, Maddison’95)

Fitch Method Given a state configuration of the leaves, the Fitch method reconstructs a subset of states at each internal node (from leaves to the root ) recursively: 01 0 {0, 1} {0} B C A

The unambiguous reconstruction accuracy: P Accuracy = P[{1}|1]=P[{0}|0] and the reconstruction accuracy P[{1}|1]= the probability that Fitch method outputs true state at the root. P[{0}|1], P[{1}|1], and P[{0, 1}|1] can be calculated by a dynamic approach (Maddison, 1995) Calculating the Reconstruction Accuracy of Fitch Method

Outline Introduction to reconstruction accuracy analysis More genomes are not necessarily better for reconstruction accuracy Greedy algorithms for genome selection

2. Reconstruction accuracy is not a monotone function of the size of taxon sampling umbalanced tree There is a large clade with a long stem A short single sister lineage Such a phylogeny is used when both fossil record and data at extant species are used for reconstruction (Finarelli and Flynn, 2006)

A Y Z p1p1 p2p2 Theorem 1: A parsimony < p 1 if ½< p 2 <= p 1 p1 is the conservation probability on AY p2 is the conservation probability on AZ 0{0} 0 {0, 1} P A [{0}|0] = Pr AY [0  0] x (Pr AZ [0  0] P Z [{0} or {0, 1}| 0] + Pr AZ [0  1] P Z [{0} or {0, 1}| 1] = p 1 (p 2 (1- P Z [{1}| 0] ) + (1-p 2 ) (1-P Z [{1}|1]) = p 1 ( 1- p 2 P Z [{1}|0] – (1-p 2 ) P Z [{1}|1] ) Proof.

A Y Z p1p1 p2p2 p1 is the conservation probability on AY p2 is the conservation probability on AZ 0 {1} {0, 1} 1 {0} P A [{0, 1}|0] = [p 1 p 2 +(1-p 1 )(1-p 2 )] x P Z [{1}|0] + [ p 1 (1-p 2 )+p 2 (1-p 1 )] X P Z [{0}|0] A parsimony = P A [{0}|0] + ½ P A [{0, 1}|1] = p 1 + ½ (1-p 1 -p 2 ) P Z [{1}|0] + ½(p 2 -p 1 ) P Z [{0}|0] < p 1 ½ < p 2 <= p 1

The reconstruction accuracy on comb-shaped trees in the limit case

A Y Z p1p1 p2p2 Theorem 2: A ML = p 1 if ½< p 2 <= p 1 p1 is the conservation probability on AY p2 is the conservation probability on AZ 0 DZDZ D Z : a state configuration below Z. Pr A ( 0D Z |s): the probability that s at A evolves into state configuration 0D Z, s=0,1. Marginal ML method: Pr A (0D Z |0) = p 1 x [ p 2 Pr Z (D Z |0) + (1-p 2 )Pr Z (D Z | 1)] 0 0, 1? Pr A (0D Z |1) = (1-p 1 ) x [ (1-p 2 )Pr Z (D Z |0) + p 2 Pr Z (D Z | 1)] Pr A (0D Z |0)-Pr A (0D Z |1)=(p 1 +p 2 -1)Pr Z (D Z |0) + (p 1 -p 2 )Pr Z (D Z |1) >0 The marginal ML outputs 0 at A iff the state at Y is 0.

Another Example showing the Non-monotone Property of Reconstruction Accuracy

Simulation Experiment setup Yule birth-death model Conservation probability along branches: 0.5~1 Count the number of random trees in which the ambiguous accuracy of using a single (longest or shortest) path is better than that from the full phylogeny

Simulation results: Counting the bad trees +: using the shortest path

Comparison of Parsimony, Joint ML and Marginal ML 500 random trees with 12 leaves generated: Yule birth-death model branch length is uniform from 0 to 1 MML outperforms JML, MP. In 80% of instances, MML is strictly better than JML In 99% of instances, JML is strictly better than MP.

Outline Introduction to reconstruction accuracy analysis More genomes are not necessarily better for reconstruction accuracy Greedy algorithms for genome selection problem

Genome selection for reconstruction the problem Instance: A phylogeny P over n genomes, integer k and a reconstruction method T Question: Find k genomes that allows the ancestral genome of the root of P to be reconstructed with the maximum accuracy, using method T.

Our approaches The genome selection problem is unlikely polynomial-time solvable (no hardness proof yet) As a result, we propose two greedy algorithms for the problem: Forward greedy algorithm & Backward greedy algorithm

Forward Greedy Algorithm 1. Set S ← φ; 2. For i = 1, 2, · · ·, k do 2.1) for each genome g not in S, compute the accuracy A(g) of the reconstruction by applying method T to S ∪ {g}; 2.2) add g with the max accuracy A(g) to S ; 3. Output S S is the set of selected genomes

Backward Greedy Algorithm 1. Let S contain all the given genomes; 2. For i = 1, 2, · · ·, n − k do 2.1) for each genome g in S, compute the accuracy A’(g) of the reconstruction by applying T to S − {g}; 2.2) remove g from S if A’(g) is the max over all g’s; 3. Output S

Validation test – Trees with the same height Experiment setup Random trees with N (9, or 16) leaves generated by program Evolver in PAML with the following parameters: Birth rate=10; Death rate=5; Sampling fraction=1. Tree height = 0.1, 0.2, 0.5, 1, 2, or 5.

Performance of the selection method for reconstruction with Parsimony

Performance of the selection method for reconstruction with Marginal Maximum Likelihood

Performance of the selection method for reconstruction with Joint Maximum Likelihood

Marginal Maximum Likelihood

Parsimony Method

Concluding remarks Reconstruction accuracy is not monotone increasing with the taxon sampling size in unbalanced trees for Parsimony method --- Another kind of “inconsistency” 1. One implication of this observation is that Parsimony, ML method might not explore the full power of incorporating fossil record into current data. Hence, modification might probably be needed. 2. Caution should be used in drawing conclusion on testing hypothesis with ancestral state reconstruction.

3. Is the reconstruction accuracy function monotone in ultrametric phylogeny? It seems true when the number of taxa is large. Consider the complete binary tree when conservation prob on each branch is less than 7/8, (The ambiguous reconstruction accuracy) = (the accuracy of using just one taxa ) =1/2 in the limit case. (Rormula exists, see Steel’89.)

Concluding remarks Formulate the genome selection for reconstruction problem Two greedy algorithms proposed for the problem Validation test shows that the reconstruction accuracy of using the genomes selected by the greedy algorithms are comparable to the the max reconstruction accuracy.

Thanks You!

A Biological Example Boreoeutherian ancestor From Encode project 4 states at leaf nodes Expected accuracy at the root node

A Biological Example – Results Backward algo is always similar as the exhaustive search With 8 leaf nodes, the accuracy from Backward algo is 93.6%, near to the accuracy 94.6% with full phylogeny

Outline Introduction to phylogeny reconstruction accuracy analysis More genomes are not necessarily better for reconstruction accuracy Greedy algorithms for genome selection problem Validation test Conclusion

Formulate the genome selection for reconstruction problem Two greedy algorithms proposed for the problem Validation test shows that the reconstruction accuracy of using the genomes selected by the greedy algorithms are comparable to the the max reconstruction accuracy.

Fitch Parsimony method Given character states in the leave nodes the method reconstructs a subset of states at each internal nodes by the following rule: 01 0 {0, 1} {0}

More Genomes Are Not Necessarily Better – An example with 4 leaves The complete tree

More Genomes Are Not Necessarily Better – An example with 4 leaves The unambiguous reconstruction accuracy of using one genome is P path = p 2 +(1-p) 2 ; The unambiguous reconstruction accuracy of using all the 4 genomes is P whole = P path – 3p 2 (1-p) 2 ; More genomes give more noise!

A Small with Six Leaves (The ambiguous reconstruction accuracy) < (The unambiguous accuracy on the shortest path) When 0.5<p<0.65

Reconstruction accuracy on complete phylogeny in limit case When conservation rate on each branch is less than 7/8, (The ambiguous reconstruction accuracy) = (the accuracy of using just one genome ) =1/2