IE68 - Biological databases Phylogenetic analysis
Phylogenetic analysis Phylogeny a reconstruction of the evolutionary (genealogical) history of a group of organisms/genes or proteins from biological data organisms: populations, species, genera,... => taxa => operational taxonomic units (OTU’s) data: molecular, morphological, archaeological,... => characters Phylogenetic tree the graphical reconstruction of a phylogeny tree structure: phylogram, cladogram IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Phylogenetic tree A tree consists of nodes connected by branches polytomy A B C D E => OTU’s for which we have data outgroup/midpoint => Ancestor of all the taxa that comprise the tree notation: ((A,B),(C,D,E)) IE68 - biological databases - phylogeny
Phylogenetics <> Phenetics Phenetics: method of grouping taxa that is based on overall (dis)similarities of characters => with no reference to evolution! Phylogenetics: method of grouping taxa that is based on shared derived characters (synapomorphies) or a model of evolution IE68 - biological databases - phylogeny
Why do we need phylogenies? Intrinsic interest in the tree => tree of life origin of organisms IE68 - biological databases - phylogeny
Why do we need phylogenies? Phylogenies can also be used as tools for investigating other problems e.g. biogeography phylogeny reflects the order of separation of the areas the different taxa occupy T IE68 - biological databases - phylogeny
Why do we need phylogenies? Phylogenies can also be used as tools for investigating other problems e.g. forensic science IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny
Phylogenetic analysis Molecular Phylogenetics reconstruction of the evolutionary (geneological) history of a group of organisms from molecular data, i.e. DNA or protein sequences In this lecture, we will focus on phylogenetic analysis of organisms based on DNA sequence data IE68 - biological databases - phylogeny
Molecular phylogenetics: approach Step 1: PCR with primers that target cytoplasmic DNA or nuclear loci of taxa, followed by DNA sequence analysis Step 2: Multiple DNA sequence alignment Step 3: Phylogenetic analysis IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny PCR and DNA sequencing Which loci? DNA sequence information, primers, variability, single or low-copy, orthologous, neutral, recombination... Gene trees versus organismal trees phylogenies for genes do not always match those of their corresponding organisms => analyse more than one gene IE68 - biological databases - phylogeny
Confounding influence of gene duplication 2 types of homology: orthology (speciation) and paralogy (gene duplication) IE68 - biological databases - phylogeny
Lineage sorting and coalescence species alleles IE68 - biological databases - phylogeny
Molecular phylogenetics: approach Step 1: PCR with primers that target cytoplasmic DNA or nuclear loci of taxa, followed by DNA sequence analysis Step 2: Multiple DNA sequence alignment Step 3: Phylogenetic analysis IE68 - biological databases - phylogeny
Multiple DNA sequence alignment Problem: alternative alignments possible to align any two sequences by postulating some combination of gaps (insertion/deletions = indels) and substitutions => which one to choose? Basic task of sequence alignment is to find the alignment with the highest similarity, smallest distance, or lowest overall cost IE68 - biological databases - phylogeny
Multiple DNA sequence alignment 2 sequences + scoring scheme => optimal alignment Scoring scheme: - scoring matrix: distance weights or similarity scores for each pair of aligned bases e.g. transition – transversion matrix A T G C A 0 5 1 5 T 5 0 5 1 G 1 5 0 5 C 5 1 5 0 - gap weight, cost or penalty IE68 - biological databases - phylogeny
Multiple DNA sequence alignment Cost of an alignment D = s + wg s = no of substitutions, g = total length of gaps w = gap penalty = cost of gap relative to substitution Gap penalty W makes implicit assumptions about how the sequences have evolved if indels are thought to be rare, then W should be large (and vice versa) => have to use knowledge of biology e.g. translation (3 bp indel, position), transition<>transversion, ... IE68 - biological databases - phylogeny
Multiple DNA sequence alignment Software programs: e.g. CLUSTALW (global alignment) http://www.ebi.ac.uk/clustalw/index.html The optimal alignment is not always the true alignment => new developments phylogenetic analysis without the multiple DNA sequence alignment step IE68 - biological databases - phylogeny
Molecular phylogenetics: approach Step 1: PCR with primers that target cytoplasmic DNA or nuclear loci of taxa, followed by DNA sequence analysis Step 2: Multiple DNA sequence alignment Step 3: Phylogenetic analysis IE68 - biological databases - phylogeny
Inferring phylogenies from DNA sequences Sequence alignment A ..AGCGTCT.. B ..AGCGTGT.. C ..AG–GAGT.. A B Phylogenetic methods unrooted tree A B taxa characters C rooted tree IE68 - biological databases - phylogeny
Phylogenetic methods Character-based methods Non character-based methods Methods based on an explicit model of evolution Maximum-likelihood methods Pairwise distance methods Methods not based on an explicit model of evolution Maximum parsimony methods IE68 - biological databases - phylogeny
Pairwise distance methods 3 taxa, 3 sequences Dissimilarity matrix: count the number of differences between all possible pairs of sequences Convert dissimilarity to evolutionary distance by correcting for multiple events per site according to a certain model of evolution Infer tree topology on the basis of the evolutionary distances by using a clustering algorithm or optimality criterion 1 2 3 1 2 0.26 3 0.20 0.33 1 2 3 1 2 0.32 3 0.23 0.44 tree IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Models of sequence evolution expected observed difference => correction (linear) (not linear) Apply a substitution model that tries to estimate the correct number of substitutions IE68 - biological databases - phylogeny
Models of sequence evolution Distance “correction” methods: convert observed distances into measure that correspond to ACTUAL distance Several methods have been proposed, all with different assumptions about the nature of the evolutionary process Essentially they differ by the number of parameters they include We can use a general framework to show how these models are inter-related IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Substitution models: general framework IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Substitution models: general framework IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny e.g. Model of Jukes & Cantor (JC) One of the first proposed – perhaps the simplest model of evolution Assumes that all four bases have equal frequency and that all substitutions are equally likely Under this model, the distance between any two sequences is given by d = -3/4ln(1-4/3p), where p is the proportion of nucleotides that are different in the two sequences IE68 - biological databases - phylogeny
e.g. Kimura 2 parameter model (K2P) incorporates the observation that transitions accumulate more rapidly than transversion assumes all four bases have equal frequencies but that there are 2 rate classes for substitutions Under this model, the distance between any two sequences is given by d = 1/2ln[1/(1-2P-Q)] + 1/4ln[1/(1-2Q)], where P and Q are the proportional differences between the two sequences due to transitions and transversions, respectively IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Substitution models Other models: adding more parameters Felsenstein model (F81) variation in base composition => base frequency f = [A C G T] may vary Hasewaga Kishino Yano (HKY) model unequal base frequency, transition/transversion General reversible model (REV) unequal base frequency, all six pairs of substitutions have different rates => ideally, we want the simplest model we can get away with that still yields a reasonable estimate IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Substitution models Assumptions of these models: all nucleotide sites change independently base composition equilibrium substitution rate is constant over time and in different lineages each site in a sequence is equally likely to undergo substitution => gamma distribution has a parameter that specifies the range of rate variation among sites: model + ’ IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Pairwise distance methods Dissimilarity matrix: count the number of differences between all possible pairs of sequences Convert dissimilarity to evolutionary distance by correcting for multiple events per site according to a certain model of evolution Infer tree topology on the basis of the evolutionary distances by using a clustering algorithm 3 taxa, 3 sequences 1 2 3 1 2 0.26 3 0.20 0.33 1 2 3 1 2 0.32 3 0.23 0.44 tree IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Clustering methods Clustering methods follow a set of steps (an algorithm) and arrive at a tree UPGMA (Unweighted Pair Group Method using Arithmetic Averages): results in an rooted and additive tree with molecular clock Neighbor-joining: results in an unrooted and additive tree Other approaches: least-squares, Fitch, Kitch,... IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny UPGMA clustering A B C B 2 least differences C 4 4 D 6 6 6 1 A 1 B Compute new distances between (AB) and other OTU’s d(AB)C = (dAC + dBC) /2 = 4 d(AB)D = (dAD + dBD) /2 = 6 IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny UPGMA clustering 1 AB C C 4 D 6 6 A 1 1 B 2 C 1 A 1 Compute new distances between (ABC) and other OTU’s d(ABC)D = (d(AB)D + dCD) /2 = 6 1 B 1 2 C 3 D IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Clustering methods UPGMA additive and ultrametric distances => assumes a molecular clock => very sensitive to unequal rate of evolution! => relative-rate test Use other clustering methods for phylogeny e.g. Neighbor-joining “Goodness of fit” statistics: to select the metric tree that best accounts for the observed distances IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Pairwise distance methods Dissimilarity matrix: count the number of differences between all possible pairs of sequences Convert dissimilarity to evolutionary distance by correcting for multiple events per site according to a certain model of evolution Infer tree topology on the basis of the evolutionary distances by using an optimality criterion 3 taxa, 3 sequences 1 2 3 1 2 0.26 3 0.20 0.33 1 2 3 1 2 0.32 3 0.23 0.44 tree IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Minimum evolution Distance matrix => unrooted metric trees Each tree has a length L, which is the sum of all the branch lengths Optimality criterion: the minimum evolution tree ME is the tree which minimizes L IE68 - biological databases - phylogeny
Pairwise distance method Advantages very fast based on a model of evolution Disadvantages sequence information is reduced to one number branch lengths may not be biologically interpreted most methods provide only one tree topology dependent on the model of evolution used IE68 - biological databases - phylogeny
Phylogenetic methods Character-based methods Non character-based methods Methods based on an explicit model of evolution Maximum-likelihood methods Pairwise distance methods Methods not based on an explicit model of evolution Maximum parsimony methods IE68 - biological databases - phylogeny
Character-based methods Character-based (discrete) methods operate directly on sequences, rather than on pairwise distances Two major discrete methods: Maximum parsimony (MP): chooses tree(s) that require fewest evolutionary changes Maximum Likelihood (ML): chooses tree(s) that is the one most likely to have produced the observed data IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Maximum parsimony Maximum parsimony infers a phylogenetic tree by minimizing the total number of evolutionary steps Principle: Investigate all possible tree topologies Reconstruct ancestral sequences Choose topology with smallest number of steps IE68 - biological databases - phylogeny
Maximum parsimony - principle 1 3 A 2 4 1 2 B 3 4 1 2 C 4 3 possible tree topologies IE68 - biological databases - phylogeny
Maximum parsimony - principle IE68 - biological databases - phylogeny
Maximum parsimony - principle IE68 - biological databases - phylogeny
Maximum parsimony - principle IE68 - biological databases - phylogeny
Maximum parsimony - generalized In previous example, cost of each substitution was “one step” => equal weight Instead, we can use different costs for different types of change (e.g. transitions vs transversions) to better match our assumptions about evolutionary processes => weighted parsimony according to Dollo, Wagner, Fitch, ... IE68 - biological databases - phylogeny
Maximum parsimony - characters IE68 - biological databases - phylogeny
Maximum parsimony – search methods Number of tree topologies: Nu = (2n-5)!/2n-3(n-3)! i.e., 3 sequences ~ 1 tree, 4 seq ~ 3 trees, 5 seq ~ 15, 6 ~ 105, => the more sequences (~ taxa), the more trees => computationally expensive Finding optimal trees: Exhaustive search: limited number of taxa (<10) find the minimum tree of all possible trees Branch and bound: small number of taxa (<18) find the minimum tree without evaluating all trees by discarding families of trees during tree construction that cannot be shorter than the shortest tree found so far Heuristic search: large number of taxa IE68 - biological databases - phylogeny
Maximum parsimony – search methods - Heuristic search: explore a subset of all possible trees, by using stepwise addition of taxa plus a rearrangement process (branch swapping), but not guaranteed to find the minimal tree Global optimum Local optimum IE68 - biological databases - phylogeny
Maximum parsimony - output Consensus tree: MP can yield multiple equally most parsimonious (optimal) trees => relationships common to all the optimal trees are summarized with a consensus tree Strict consensus: includes splits found in all trees Majority-rule consensus: includes splits found in the majority of the trees (> 50%) IE68 - biological databases - phylogeny
Maximum parsimony - output Consistency index (CI) - Retention index (RI) measures of the parsimony fit of a character to a tree, or of the average fit of all characters to a tree more specifically: index of how much homoplasy the constructed tree has Value from 0 to 1 higher value => less homoplasy IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny
Parsimony – branch support and tree stability Bootstrap analysis is a resampling technique used to measure sampling error gives an idea about the reliability of branches and clusters original dataset => resample => construct trees => compare trees to original trees >70% quite confident of tree topology Decay index (Bremer support) gives us a sense of how many steps would be required before a grouping collapses higher value => better branch support IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Maximum parsimony Advantages based on shared derived characters evaluates different tree topologies does not reduce the information Disadvantages computationally intensive for large datasets no correction for multiple mutations sensitive to unequal rates of evolution (long branch attraction) IE68 - biological databases - phylogeny
Maximum-likelihood methods Phylogenetic methods Character-based methods Non character-based methods Methods based on an explicit model of evolution Maximum-likelihood methods Pairwise distance methods Methods not based on an explicit model of evolution Maximum parsimony methods IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Maximum likelihood Statistical method If given some data D and a hypothesis H, the likelihood of that data is given by LD = Pr (D|H) Which is the probability of D given H? IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Maximum likelihood In the context of molecular phylogenetics D is the set of sequences being compared H is a phylogenetic tree We want to find the likelihood of obtaining the observed data given the tree The tree that makes the data the most probable evolutionary outcome is the Maximum Likelihood estimate of the phylogeny IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Maximum likelihood In other words Which tree is most likely to have yielded these sequences (observed data) under a given model of evolution (JC, K2P, ...)? IE68 - biological databases - phylogeny
IE68 - biological databases - phylogeny Maximum likelihood Advantages Statistically well founded Based on a model of evolution Evaluates different topologies Uses all sequence information Often yields estimates that have lower variance than other methods Disadvantages Very slow (computationally intensive) Dependent on the model of evolution used IE68 - biological databases - phylogeny
Software programs for phylogenetic analysis Overview: http://evolution.genetics.washington.edu/phylip/software.html Most widely used software programs PHYLIP: free available (downloadable or online http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html) PAUP: user friendly but not free available IE68 - biological databases - phylogeny
Phylogenetic information on the internet http://tolweb.org/tree/phylogeny.html http://www.treebase.org/treebase/ .... IE68 - biological databases - phylogeny
If you need more information Jacqueline Vander Stappen K.U.Leuven Laboratory of Gene Technology Kasteelpark Arenberg 21 B-3001 Leuven Jacqueline.vanderstappen@agr.kuleuven.ac.be IE68 - biological databases - phylogeny