“Inferring Phylogenies” Joseph Felsenstein Excellent reference Phylogenetics “Inferring Phylogenies” Joseph Felsenstein Excellent reference
What is a phylogeny?
Different Representations Cladogram - branching pattern only Phylogram - branch lengths are estimated and drawn proportional to the amount of change along the branch Rooted - implies directionality of change Unrooted - does not How do you root a tree?
What is a phylogeny used for?
Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA
Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA
Working Tree sp2 sp1 c2 sp3 sp5 sp4
Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA
Working Tree sp2 sp1 c2 sp3 c4 sp5 sp4
Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA
Working Tree sp2 sp1 c7 c2 sp3 c4 sp5 sp4
Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA
Working Tree sp2 sp1 c7 c2 sp3 c4 c9 sp5 sp4
Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA
Working Tree sp2 sp1 c10 c7 c2 sp3 c4 c9 sp5 sp4
Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA
Final Tree sp2 sp1 c10 c11 c2 c7 sp3 c4 c9 sp5 sp4
What optimality criteria do we use then? Parsimony Likelihood Bayesian Distance methods?
Parsimony Why should we choose a specific grouping? Maximum parsimony: we should accept the hypothesis that explain the data most simply and efficiently “Parsimony is simply the most robust criterion for choosing between competing scientific hypotheses. It is not a statement about how evolution may or may not have taken place”1 1 Kitching, I. J.; Forey, P. L.; Humphries, J. & Williams, D. M. 1998. Cladistics: the theory and practice of parsimony analysis. The systematics Association Publication. No. 11.
Parsimony Optimality criteria that chooses the topology with the less number of transformations of character states Optimizing one component: tree topology (pattern based) Most parsimonious tree: the one (or multiple) with the minimum number of evolutionary changes (smaller size/tree length)
Reconstructing trees via sequence data 1 2 3 4 5 6 O T G A B C - D A O D C B 6. T=>G 5. A=> GAP 4. A=>G 4. A=>C 2. G=>A 3. T=>C 1. T=>A Tree length = 8
Neighbor-joining Method
NJ distance matrices
NJ distance matrices
NJ distance matrices
NJ distance matrices
Finished NJ tree
Models of Evolution T C Pyrimidines A G Purines Transversions Transitions
Maximum Likelihood Base frequencies: fA + fG + fC + fT = 1 Base exchange: fs + fv = 1 R-matrix: + + + + + = 1 Gamma shape parameter Number of discrete gamma-distribution categories Pinvar: fvar + finv = 1 Likelihood: L = li where i is each character state
Maximum Likelihood L=Pr(D|H) C G G t4 t5 A G y t1 t2 t3 t6 x z t7 t8 w The likelihood is not the probability that the tree is the true tree, rather it is the probability that the tree has given rise to the data we collected. Likelihood requires three elements (what are they? We've talked about two, the data and the tree (hypothesis) the third is the model of evolution). w
ML cont. the probability that the nucleotide at time t is i is given by the probability that the nucleotide at time t is j, j i, is given by
The conditional probability of H given D: posterior probability Bayes Theorem Prior probability or Marginal probability of H The conditional probability of H given D: posterior probability Likelihood function Prob (H │D) = Prob (H) Prob (D│H) Prob (D) H=Hypothesis D=Data Prior probability or Marginal probability of D ∑HP(H) P(D|H) Normalizing Constant: ensures ∑ P (H │D) = 1
Take Home Message Likelihood: represents the P of the data given the hypothesis => difficult to interpret Bayes approach: estimates the P of the hypothesis given the data => estimates P for the hypothesis of interest
Bayesian Inference of Phylogeny f(i |X) = f(i) f(X|i) ∑j=1 f(i) f(X|i) B(s) Calculating pP of a tree involves a summation over all possible trees and, for each tree, integration over all combinations of bl and substitution-model parameter values f(i,i,|X) = f(i,i,) f(X|i,i,) ∑j=1 ∫ , f(i,i,) f(X| i,i,)dd B(s) Inferences of any single parameter are based on the marginal distribution of the parameter f(i|X) = ∫ , f(i,i,) f(X|i,i,) dd ∑j=1 ∫ , f(i,i,) f(X| i,i,)dd B(s) This marginal P distribution of the topology, for example, integrates out all the other parameters Advantage: the power of the analysis is focused on the parameter of interest (i.e., the topology of the tree)
Estimating phylogenies Exhaustive Searches Branch and bound methods Rise in computational time versus rise in solution space
How many topologies are there? When we add species to a tree, the number of ways in which we can do that are equal to the number of branches, including the branch at the botom of the tree. There are 3 such branches in a two species tree. Every time that we add a new species, it adds a new interior node, plus two new branches. Thus after choosing one of the 3 possible places to add the third species, the fourth can be added in any of 5 places, the fifth in any of 7, and so on.
The Phylogenetic Problem
HIV-1 Whole Genomes 1993 - 15 HIV-1 Whole Genomes 2003 (JAN) - 397 The two trees represent complete HIV-1 genomes (limited to those with over 7000bp sequenced) from the Los Alamos National Labs HIV database (http://hiv-web.lanl.gov/ ). The sparse tree represents those genomes sequenced 1993 or earlier (determined by the sequence submission date to Genbank, not the publication date since several seemed to be cobbled together from multiple sources). There were 15 genomes by 1993, mostly from subtype B and a few subtype D, with a final alignment of 8097 characters. The dense tree represents 397 complete HIV-1 genomes, the current complement of genomes available. The search on the Los Alamos database came up with 416 genomes, but a few were deleted during the alignment process due to stretches of questionable sequence. The final alignment length was 8583 characters. Both trees are color coded by subtype and major groups of recombinants. Both trees were constructed using Neighbor Joining, with the results of modeltest providing the model of evolution for tree construction.
Tree Space - the final frontier
Heuristic Searches Nearest-neighbor interchanges (NNI) - swap two adjacent branches on the tree Subtree pruning and regrafting (SPR) - removing a branch from the tree (either an interior or an exterior branch) with a subtree attached to it. The subtree is then reinserted into the remaining tree in all possible places Tree bisection and reconnection (TBR) - An interior branch is broken, and the two resulting fragments o the tree ar considered as separate trees. All possible connections are made between a branch of one and a branch of the other.
Other approaches Tree-fusing - find two near optimal trees and exchange subgroups between the two trees Genetic Algorithms - a simulation of evolution with a genotype that describes the tree and a fitness function that reflects the optimality of the tree Disc Covering - upcoming paper
Phylogenetic Accuracy? Consistency - A phylogenetic method is consistent for a given evolutionary model if the method converges on the correct tree as the data available to the method become infinite. Efficiency - Statistical efficiency is a measure of how quickly a method converges on the correct solution as more data are applied to the problem. Robustness - Robustness refers to the degree to which violations of assumptions will affect performance of phylogenetic methods All methods are consistent when their assumptions (explicit and implicit) are met, and all methods are inconsistent when these assumptions are violated sufficiently. In the case of phylogenetic methods, efficiency may be measured in terms of the number of characters required to find the correct solution at a given frequency or in terms of the frequency of correct solutions at a given sample size. All methods are based on explicit and/or implicit assumptions about the evolutionary process, and yet we know these assumptions are violated to one degree or another in real data.
How reliable is MY phylogeny? Bootstrap Analysis Jackknife Analysis Posterior Probabilities (Bayesian Approaches) Decay Indices
Bootstrap
Pseudoreplicates