Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping.

Slides:



Advertisements
Similar presentations
Bayesian Estimation in MARK
Advertisements

1 Dan Graur Methods of Tree Reconstruction. 2 3.
Cladogram Building - 1 ß How complex is this problem anyway ? ß NP-complete:  Time needed to find solution in- creases exponentially with size of problem.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Phylogenetic Trees Lecture 4
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
1. 2 Rooting the tree and giving length to branches.
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
BNFO 602 Phylogenetics Usman Roshan.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Building Phylogenies Parsimony 2.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
Lecture 8 – Searching Tree Space. The Search Tree.
Maximum parsimony Kai Müller.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
A brief introduction to phylogenetics
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Lecture 2: Principles of Phylogenetics
Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
The star-tree paradox in Bayesian phylogenetics Bengt Autzen Department of Philosophy, Logic and Scientific Method LSE.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Molecular Systematics
Bayesian Phylogenetics. Bayes Theorem Pr(Tree|Data) = Pr(Data|Tree) x Pr(Tree) Pr(Data)
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Bayesian statistics named after the Reverend Mr Bayes based on the concept that you can estimate the statistical properties of a system after measuting.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Lecture 14 – Consensus Trees & Nodal Support
Distance-based phylogeny estimation
Phylogenetics LLO9 Maximum Likelihood and Its Applications
Inferring a phylogeny is an estimation procedure.
Maximum likelihood (ML) method
Bayesian inference Presented by Amir Hadadi
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
#31 - Phylogenetics Character-Based Methods
CS 581 Tandy Warnow.
Why Models of Sequence Evolution Matter
The Most General Markov Substitution Model on an Unrooted Tree
Lecture 8 – Searching Tree Space
Lecture 7 – Algorithmic Approaches
Lecture 14 – Consensus Trees & Nodal Support
CS 394C: Computational Biology Algorithms
Presentation transcript:

Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping –Star decomposition

Exhaustive Search

Searching for trees Generation of all possible trees 1.Generate all 3 trees for first 4 taxa:

Searching for trees 2. Generate all 15 trees for first 5 taxa: (likewise for each of the other two 4-taxon trees)

Searching for trees 3. Full search tree:

Searching for trees Branch and bound algorithm: The search tree is the same as for exhaustive search, with tree lengths for a hypothetical data set shown in boldface type. If a tree lying at a node of this search tree has a length that exceeds the current lower bound on the optimal tree length, this path of the search tree is terminated (indicated by a cross-bar), and the algorithm backtracks and takes the next available path. When a tip of the search tree is reached (i.e., when we arrive at a tree containing the full set of taxa), the tree is either optimal (and hence retained) or suboptimal (and rejected). When all paths leading from the initial 3-taxon tree have been explored, the algorithm terminates, and all most-parsimonious trees will have been identified. Asterisks indicate points at which the current lower bound is reduced. Circled numbers represent the order in which phylogenetic trees are visited in the search tree.

Stepwise Addition (in a nutshell)

Searching for trees Stepwise addition A greedy stepwise-addition search applied to the example used for branch-and-bound. The best 4-taxon tree is determined by evaluating the lengths of the three trees obtained by joining taxon D to tree 1 containing only the first three taxa. Taxa E and F are then connected to the five and seven possible locations, respectively, on trees 4 and 9, with only the shortest trees found during each step being used for the next step. In this example, the 233-step tree obtained is not a global optimum. Circled numbers indicate the order in which phylogenetic trees are evaluated in the stepwise-addition search.

Stepwise Addition Variants As Is –add in order found in matrix Closest –add unplaced taxa that requires smallest increase Furthest –add unplaced taxa that requires largest increase Simple –Farris’s (1970) “simple algorithm” uses a set of pairwise reference distances Random –random permutation of taxa is used to select the order

Branch swapping Nearest Neighbor Interchange (NNI) E A C B D A D E C B D A C B E

Branch swapping Subtree Pruning and Regrafting (SPR) D A B C G F E  D G F E A B C G D E F B A C a

Branch swapping Tree Bisection and Reconnection (TBR) D A B C G F E D G F E A B C G D E F B C A G D E F B A C G D E F C A B 

Reconnection limits in TBR Reconnection distances:

In PAUP*, use “ReconLim” to set maximum reconnection distance Reconnection limits in TBR

Star-decomposition search

Overview of maximum likelihood as used in phylogenetics Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolutionOverall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolution Likelihood(hypothesis) Prob(data | hypothesis) Likelihood(hypothesis)  Prob(data | hypothesis) Likelihood(tree,model) = k Prob(observed sequences|tree,model) [not Prob(tree | data,model)]

Computing the likelihood of a single tree 1 j N (1) C…GGACA…C…GTTTA…C (2) C…AGACA…C…CTCTA…C (3) C…GGATA…A…GTTAA…C (4) C…GGATA…G…CCTAG…C 1 j N (1) C…GGACA…C…GTTTA…C (2) C…AGACA…C…CTCTA…C (3) C…GGATA…A…GTTAA…C (4) C…GGATA…G…CCTAG…C(1)(2)(3)(4) CCAG(6) (5)

Computing the likelihood of a single tree Prob CCAG A A Likelihood at site j = + Prob CCAG A C Prob CCAG T T + … + But use Felsenstein (1981) pruning algorithm

Computing the likelihood of a single tree Note: PAUP* reports -ln L, so lower -ln L implies higher likelihood

Finding the maximum-likelihood tree (in principle) Evaluate the likelihood of each possible tree for a given collection of taxa.Evaluate the likelihood of each possible tree for a given collection of taxa. Choose the tree topology which maximizes the likelihood over all possible trees.Choose the tree topology which maximizes the likelihood over all possible trees.

Probability calculations require… An explicit model of substitution that specifies change probabilities for a given branch length “Instantaneous rate matrix”An explicit model of substitution that specifies change probabilities for a given branch length “Instantaneous rate matrix” Jukes-Cantor Kimura 2-parameter Hasegawa-Kishino-Yano (HKY) Felsenstein 1981, 1984 General time-reversible An estimate of optimal branch lengths in units of expected amount of change ( = rate x time)An estimate of optimal branch lengths in units of expected amount of change ( = rate x time)

For example: Jukes-Cantor (1969) Kimura (1980) “2-parameter” Hasegawa-Kishino-Yano (1985) General-Time Reversible

E.g., transition probabilities for HKY and F84:

A Family of Reversible Substitution Models

The Relevance of Branch Lengths CCAAAAAAAA A C CCAAAAAAAA C A

When does maximum likelihood work better than parsimony? When you’re in the “Felsenstein Zone”When you’re in the “Felsenstein Zone”ACBD (Felsenstein, 1978)

In the Felsenstein Zone A C G T A C G T Substitution rates: Base frequencies: A=0.1 C=0.2 G=0.3 T=0.4ABCD

In the Felsenstein Zone Sequence Length parsimony ML-GTR Proportion correct

The long-branch attraction (LBA) problem Pattern type 14 AI = Uninformative (constant)A A A A A The true phylogeny of 1, 2, 3 and 4 (zero changes required on any tree)

The long-branch attraction (LBA) problem Pattern type 14 AI = Uninformative (constant)A AII = UninformativeG A A A A The true phylogeny of 1, 2, 3 and 4 (one change required on any tree)

The long-branch attraction (LBA) problem Pattern type 14 AI = Uninformative (constant)A AII = UninformativeG CIII = UninformativeG A A A A The true phylogeny of 1, 2, 3 and 4 (two changes required on any tree)

The long-branch attraction (LBA) problem Pattern type 14 AI = Uninformative (constant)A AII = UninformativeG CIII = UninformativeG G IV = MisinformativeG A A A A The true phylogeny of 1, 2, 3 and 4 (two changes required on true tree)

The long-branch attraction (LBA) problem G 4 A 2 A 3 G 1 … but this tree needs only one step

Concerns about statistical properties and suitability of models (assumptions) Consistency If an estimator converges to the true value of a parameter as the amount of data increases toward infinity, the estimator is consistent.

When do both methods fail? When there is insufficient phylogenetic signal...When there is insufficient phylogenetic signal

When does parsimony work “better” than maximum likelihood? When you’re in the Inverse-Felsenstein (“Farris”) zoneWhen you’re in the Inverse-Felsenstein (“Farris”) zone A B C D (Siddall, 1998)

Siddall (1998) parameter space a a b b b Both methods do poorly Parsimony has higher accuracy than likelihood Both methods do well p a p b 00.75

Parsimony vs. likelihood in the Inverse-Felsenstein Zone B B BBBBBBBBBB J J J J J J J J J J J J ,00010,000100,000 Sequence length B J Parsimony ML/JC 15% 67.5% (expected differences/site) Accuracy

Why does parsimony do so well in the Inverse-Felsenstein zone? A A C C AC A A C C A G A C G C A A C C A C A C True synapomorphy Apparent synapomorphies actually due to misinterpreted homoplasy

Parsimony vs. likelihood in the Felsenstein Zone B B B B BBBBBBBB J J J J J J J J J JJJ 15% 67.5% Accuracy ,00010,000100,000 B J Parsimony ML/JC (expected differences/site) Sequence length

From the Farris Zone to the Felsenstein Zone C D A B C D A B C D A B B C D A B D C A External branches = 0.5 or 0.05 substitutions/site, Jukes-Cantor model of nucleotide substitution

Parsimony Likelihood Simulationresults:

Maximum likelihood models are oversimplifications of reality. If I assume the wrong model, won’t my results be meaningless? Not necessarily (maximum likelihood is pretty robust)Not necessarily (maximum likelihood is pretty robust)

Model used for simulation... A C G T A C G T Substitution rates: Base frequencies: A=0.1 C=0.2 G=0.3 T=0.4ABCD

Performance of ML when its model is violated (one example)

Among site rate heterogeneity Proportion of invariable sites –Some sites don’t change do to strong functional or structural constraint (Hasegawa et al., 1985) Site-specific rates –Different relative rates assumed for pre-assigned subsets of sites Gamma-distributed rates –Rate variation assumed to follow a gamma distribution with shape parameter  Lemur AAGCTTCATAG TTGCATCATCCA …TTACATCATCCA Homo AAGCTTCACCG TTGCATCATCCA …TTACATCCTCAT Pan AAGCTTCACCG TTACGCCATCCA …TTACATCCTCAT Goril AAGCTTCACCG TTACGCCATCCA …CCCACGGACTTA Pongo AAGCTTCACCG TTACGCCATCCT …GCAACCACCCTC Hylo AAGCTTTACAG TTACATTATCCG …TGCAACCGTCCT Maca AAGCTTTTCCG TTACATTATCCG …CGCAACCATCCT equal rates?

Performance of ML when its model is violated (another example) Rate  =50  =200 Modeling among-site rate variation with a gamma distribution... …can also estimate a proportion of “invariable” sites (p inv )  =2  =0.5 Frequency

Performance of ML when its model is violated (another example)

“MODERATE”–Felsenstein zone

“MODERATE”–Inverse- Felsenstein zone

Bayesian Inference in Phylogenetics Uses Bayes formula: Pr(  |D) = Pr(D|  ) Pr(  ) Pr(D)  Pr(D|  ) Pr(  )  L(  ) Pr(  ) Calculation involves integrating over all tree topologies and model-parameter values, subject to assumed prior distribution on parameters (  =tree topology, branch-lengths, and substitution-model parameters)

Bayesian Inference in Phylogenetics To approximate this posterior density (complicated multidimensional integral) we use Markov chain Monte Carlo (MCMC) –Simulated Markov chain in which transition probabilities are assigned such that the stationary distribution of the chain is the posterior density of interest –E.g., Metropolis-Hastings algorithm: Accept a proposed move from one state  to another state  * with probability min(r,1) where r = Pr(  *|D) Pr(  |  *) Pr(  |D) Pr(  *|  ) –Sample chain at regular intervals to approximate posterior distribution MrBayes (by John Huelsenbeck and Fredrik Ronquist) is most popular Bayesian inference program

A B C D A B C D Likelihood Iterations A brief intro to Markov chain Monte Carlo (MCMC) A B C D... If the chain is run “long enough”, the stationary distribution of states in the chain will represent a good approximation to the target distribution (in this case, the Bayesian posterior) 1.Initialize the chain, e.g., by picking a random state X 0 (topology,branch lengths, substitution-model parameters) from the assumed prior distribution A B C D AB|CD A B C D A B C D BC|AD A B C D A B C D A B C D B C D A AC|BD AB|CD A B C D 2.For each time t, sample a new candidate state Y from some proposal distribution q(.|X t ) (e.g., change branch lengths or topology plus branch lengths) Calculate acceptance probability 3.If Y is accepted, let X t+1 = Y; otherwise let X t+1 = X t “burn in”

Model-based distances Can also calculate pairwise distances based on these modelsCan also calculate pairwise distances based on these models These distances estimate the number of substitutions per site that have accumulated since the two sequences shared a common ancestor, allowing for superimposed substitutions (“multiple hits”)These distances estimate the number of substitutions per site that have accumulated since the two sequences shared a common ancestor, allowing for superimposed substitutions (“multiple hits”) E.g.:E.g.: –Jukes-Cantor distance –Kimura 2-parameter distance –General maximum-likelihood distances available for other models

a d e c b p 12 = a+b p 13 = a+c+d p 14 = a+c+e p 23 = b+c+d p 24 = b+c+e p 34 = d+e p ij = d ij for all i and j if the tree topology is correct and distances are additive Distance-based optimality criteria “Additive trees”

Distance-based optimality criteria Minimum evolution and least-squares p ij d ij SS Least-Squares Minumum evolution (ME) LS branch lengths