“Inferring Phylogenies” Joseph Felsenstein Excellent reference

Slides:



Advertisements
Similar presentations
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny.
Cladogram Building - 1 ß How complex is this problem anyway ? ß NP-complete:  Time needed to find solution in- creases exponentially with size of problem.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
1. 2 Rooting the tree and giving length to branches.
Heuristic search heuristic search attempts to find the best tree, without looking at all possible trees.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Machine Learning CMPT 726 Simon Fraser University
NJ was originally described as a method for approximating a tree that minimizes the sum of least- squares branch lengths – the minimum – evolution criterion.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Building Phylogenies Parsimony 2.
Lecture 8 – Searching Tree Space. The Search Tree.
What Is Phylogeny? The evolutionary history of a group.
Maximum parsimony Kai Müller.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Lecture 15 - Hypothesis Testing A. Competing a priori hypotheses - Paired-Sites Tests Null Hypothesis : There is no difference in support for one tree.
Molecular phylogenetics
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Models of sequence evolution GTR HKY Jukes-Cantor Felsenstein K2P Tree building methods: some examples Assessing phylogenetic data Popular phylogenetic.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
A brief introduction to phylogenetics
Lecture 2: Principles of Phylogenetics
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Calculating branch lengths from distances. ABC A B C----- a b c.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
The star-tree paradox in Bayesian phylogenetics Bengt Autzen Department of Philosophy, Logic and Scientific Method LSE.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
The Big Issues in Phylogenetic Reconstruction Randy Linder Integrative Biology, University of Texas
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Lecture 15 - Hypothesis Testing
Lecture 14 – Consensus Trees & Nodal Support
Phylogenetics LLO9 Maximum Likelihood and Its Applications
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Inferring a phylogeny is an estimation procedure.
Maximum likelihood (ML) method
BNFO 602 Phylogenetics – maximum likelihood
BNFO 602 Phylogenetics Usman Roshan.
The Most General Markov Substitution Model on an Unrooted Tree
Lecture 8 – Searching Tree Space
Lecture 14 – Consensus Trees & Nodal Support
Presentation transcript:

“Inferring Phylogenies” Joseph Felsenstein Excellent reference Phylogenetics “Inferring Phylogenies” Joseph Felsenstein Excellent reference

What is a phylogeny?

Different Representations Cladogram - branching pattern only Phylogram - branch lengths are estimated and drawn proportional to the amount of change along the branch Rooted - implies directionality of change Unrooted - does not How do you root a tree?

What is a phylogeny used for?

Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

Working Tree sp2 sp1 c2 sp3 sp5 sp4

Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

Working Tree sp2 sp1 c2 sp3 c4 sp5 sp4

Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

Working Tree sp2 sp1 c7 c2 sp3 c4 sp5 sp4

Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

Working Tree sp2 sp1 c7 c2 sp3 c4 c9 sp5 sp4

Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

Working Tree sp2 sp1 c10 c7 c2 sp3 c4 c9 sp5 sp4

Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA

Final Tree sp2 sp1 c10 c11 c2 c7 sp3 c4 c9 sp5 sp4

What optimality criteria do we use then? Parsimony Likelihood Bayesian Distance methods?

Parsimony Why should we choose a specific grouping? Maximum parsimony: we should accept the hypothesis that explain the data most simply and efficiently “Parsimony is simply the most robust criterion for choosing between competing scientific hypotheses. It is not a statement about how evolution may or may not have taken place”1 1 Kitching, I. J.; Forey, P. L.; Humphries, J. & Williams, D. M. 1998. Cladistics: the theory and practice of parsimony analysis. The systematics Association Publication. No. 11.

Parsimony Optimality criteria that chooses the topology with the less number of transformations of character states Optimizing one component: tree topology (pattern based) Most parsimonious tree: the one (or multiple) with the minimum number of evolutionary changes (smaller size/tree length)

Reconstructing trees via sequence data 1 2 3 4 5 6 O T G A B C - D A O D C B 6. T=>G 5. A=> GAP 4. A=>G 4. A=>C 2. G=>A 3. T=>C 1. T=>A Tree length = 8

Neighbor-joining Method

NJ distance matrices

NJ distance matrices

NJ distance matrices

NJ distance matrices

Finished NJ tree

Models of Evolution T C Pyrimidines A G Purines Transversions Transitions

Maximum Likelihood Base frequencies: fA + fG + fC + fT = 1 Base exchange: fs + fv = 1 R-matrix:  +  +  +  +  +  = 1 Gamma shape parameter Number of discrete gamma-distribution categories Pinvar: fvar + finv = 1 Likelihood: L =  li where i is each character state

Maximum Likelihood L=Pr(D|H) C G G t4 t5 A G y t1 t2 t3 t6 x z t7 t8 w The likelihood is not the probability that the tree is the true tree, rather it is the probability that the tree has given rise to the data we collected. Likelihood requires three elements (what are they? We've talked about two, the data and the tree (hypothesis) the third is the model of evolution). w

ML cont. the probability that the nucleotide at time t is i is given by the probability that the nucleotide at time t is j, j i, is given by

The conditional probability of H given D: posterior probability Bayes Theorem Prior probability or Marginal probability of H The conditional probability of H given D: posterior probability Likelihood function Prob (H │D) = Prob (H) Prob (D│H) Prob (D) H=Hypothesis D=Data Prior probability or Marginal probability of D ∑HP(H) P(D|H) Normalizing Constant: ensures ∑ P (H │D) = 1

Take Home Message Likelihood: represents the P of the data given the hypothesis => difficult to interpret Bayes approach: estimates the P of the hypothesis given the data => estimates P for the hypothesis of interest

Bayesian Inference of Phylogeny f(i |X) = f(i) f(X|i) ∑j=1 f(i) f(X|i) B(s) Calculating pP of a tree involves a summation over all possible trees and, for each tree, integration over all combinations of bl and substitution-model parameter values f(i,i,|X) = f(i,i,) f(X|i,i,) ∑j=1 ∫ , f(i,i,) f(X| i,i,)dd B(s) Inferences of any single parameter are based on the marginal distribution of the parameter f(i|X) = ∫ , f(i,i,) f(X|i,i,) dd ∑j=1 ∫ , f(i,i,) f(X| i,i,)dd B(s) This marginal P distribution of the topology, for example, integrates out all the other parameters Advantage: the power of the analysis is focused on the parameter of interest (i.e., the topology of the tree)

Estimating phylogenies Exhaustive Searches Branch and bound methods Rise in computational time versus rise in solution space

How many topologies are there? When we add species to a tree, the number of ways in which we can do that are equal to the number of branches, including the branch at the botom of the tree. There are 3 such branches in a two species tree. Every time that we add a new species, it adds a new interior node, plus two new branches. Thus after choosing one of the 3 possible places to add the third species, the fourth can be added in any of 5 places, the fifth in any of 7, and so on.

The Phylogenetic Problem

HIV-1 Whole Genomes 1993 - 15 HIV-1 Whole Genomes 2003 (JAN) - 397 The two trees represent complete HIV-1 genomes (limited to those with over 7000bp sequenced) from the Los Alamos National Labs HIV database (http://hiv-web.lanl.gov/ ). The sparse tree represents those genomes sequenced 1993 or earlier (determined by the sequence submission date to Genbank, not the publication date since several seemed to be cobbled together from multiple sources). There were 15 genomes by 1993, mostly from subtype B and a few subtype D, with a final alignment of 8097 characters. The dense tree represents 397 complete HIV-1 genomes, the current complement of genomes available. The search on the Los Alamos database came up with 416 genomes, but a few were deleted during the alignment process due to stretches of questionable sequence. The final alignment length was 8583 characters. Both trees are color coded by subtype and major groups of recombinants. Both trees were constructed using Neighbor Joining, with the results of modeltest providing the model of evolution for tree construction.

Tree Space - the final frontier

Heuristic Searches Nearest-neighbor interchanges (NNI) - swap two adjacent branches on the tree Subtree pruning and regrafting (SPR) - removing a branch from the tree (either an interior or an exterior branch) with a subtree attached to it. The subtree is then reinserted into the remaining tree in all possible places Tree bisection and reconnection (TBR) - An interior branch is broken, and the two resulting fragments o the tree ar considered as separate trees. All possible connections are made between a branch of one and a branch of the other.

Other approaches Tree-fusing - find two near optimal trees and exchange subgroups between the two trees Genetic Algorithms - a simulation of evolution with a genotype that describes the tree and a fitness function that reflects the optimality of the tree Disc Covering - upcoming paper

Phylogenetic Accuracy? Consistency - A phylogenetic method is consistent for a given evolutionary model if the method converges on the correct tree as the data available to the method become infinite. Efficiency - Statistical efficiency is a measure of how quickly a method converges on the correct solution as more data are applied to the problem. Robustness - Robustness refers to the degree to which violations of assumptions will affect performance of phylogenetic methods All methods are consistent when their assumptions (explicit and implicit) are met, and all methods are inconsistent when these assumptions are violated sufficiently. In the case of phylogenetic methods, efficiency may be measured in terms of the number of characters required to find the correct solution at a given frequency or in terms of the frequency of correct solutions at a given sample size. All methods are based on explicit and/or implicit assumptions about the evolutionary process, and yet we know these assumptions are violated to one degree or another in real data.

How reliable is MY phylogeny? Bootstrap Analysis Jackknife Analysis Posterior Probabilities (Bayesian Approaches) Decay Indices

Bootstrap

Pseudoreplicates