IE68 - Biological databases Phylogenetic analysis

Slides:

Advertisements

Similar presentations

Introduction to Molecular Evolution

Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.

An Introduction to Phylogenetic Methods

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.

Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.

Phylogenetic reconstruction

Molecular Evolution Revised 29/12/06

Tree Reconstruction.

Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.

BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.

Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.

Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.

Bioinformatics and Phylogenetic Analysis

In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Lecture 24 Inferring molecular phylogeny Distance methods

Probabilistic methods for phylogenetic trees (Part 2)

Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.

Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.

Phylogenetic Analysis. 2 Phylogenetic Analysis Overview Insight into evolutionary relationships Inferring or estimating these evolutionary relationships.

Phylogenetic trees Sushmita Roy BMI/CS 576

What Is Phylogeny? The evolutionary history of a group.

Phylogenetic Analysis

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Maximum parsimony Kai Müller.

Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Terminology of phylogenetic trees

Molecular phylogenetics

Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.

Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)

Tree Inference Methods

1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.

COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL.

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections

Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.

BINF6201/8201 Molecular phylogenetic methods

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Applied Bioinformatics Week 8 Jens Allmer. Practice I.

Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections

A brief introduction to phylogenetics

Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)

Lecture 2: Principles of Phylogenetics

Introduction to Phylogenetics

Calculating branch lengths from distances. ABC A B C----- a b c.

Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.

Lecture 17: Phylogenetics and Phylogeography

Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.

Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,

Phylogeny Ch. 7 & 8.

Applied Bioinformatics Week 8 Jens Allmer. Theory I.

Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?

1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.

Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.

Maximum Parsimony Phenetic (distance based) methods are fast and often accurate but discard data and are not based on explicit character states at each.

Evolutionary genomics can now be applied beyond ‘model’ organisms

Phylogenetic basis of systematics

Inferring a phylogeny is an estimation procedure.

Maximum likelihood (ML) method

Phylogenetic Inference

Multiple Alignment and Phylogenetic Trees

Goals of Phylogenetic Analysis

Inferring phylogenetic trees: Distance and maximum likelihood methods

But what if there is a large amount of homoplasy in the data?

Presentation transcript:

IE68 - Biological databases Phylogenetic analysis

Phylogenetic analysis Phylogeny a reconstruction of the evolutionary (genealogical) history of a group of organisms/genes or proteins from biological data organisms: populations, species, genera,... => taxa => operational taxonomic units (OTU’s) data: molecular, morphological, archaeological,... => characters Phylogenetic tree the graphical reconstruction of a phylogeny tree structure: phylogram, cladogram IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Phylogenetic tree A tree consists of nodes connected by branches polytomy A B C D E => OTU’s for which we have data outgroup/midpoint => Ancestor of all the taxa that comprise the tree notation: ((A,B),(C,D,E)) IE68 - biological databases - phylogeny

Phylogenetics <> Phenetics Phenetics: method of grouping taxa that is based on overall (dis)similarities of characters => with no reference to evolution! Phylogenetics: method of grouping taxa that is based on shared derived characters (synapomorphies) or a model of evolution IE68 - biological databases - phylogeny

Why do we need phylogenies? Intrinsic interest in the tree => tree of life origin of organisms IE68 - biological databases - phylogeny

Why do we need phylogenies? Phylogenies can also be used as tools for investigating other problems e.g. biogeography phylogeny reflects the order of separation of the areas the different taxa occupy T IE68 - biological databases - phylogeny

Why do we need phylogenies? Phylogenies can also be used as tools for investigating other problems e.g. forensic science IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny

Phylogenetic analysis Molecular Phylogenetics reconstruction of the evolutionary (geneological) history of a group of organisms from molecular data, i.e. DNA or protein sequences In this lecture, we will focus on phylogenetic analysis of organisms based on DNA sequence data IE68 - biological databases - phylogeny

Molecular phylogenetics: approach Step 1: PCR with primers that target cytoplasmic DNA or nuclear loci of taxa, followed by DNA sequence analysis Step 2: Multiple DNA sequence alignment Step 3: Phylogenetic analysis IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny PCR and DNA sequencing Which loci? DNA sequence information, primers, variability, single or low-copy, orthologous, neutral, recombination... Gene trees versus organismal trees phylogenies for genes do not always match those of their corresponding organisms => analyse more than one gene IE68 - biological databases - phylogeny

Confounding influence of gene duplication 2 types of homology: orthology (speciation) and paralogy (gene duplication) IE68 - biological databases - phylogeny

Lineage sorting and coalescence species alleles IE68 - biological databases - phylogeny

Molecular phylogenetics: approach Step 1: PCR with primers that target cytoplasmic DNA or nuclear loci of taxa, followed by DNA sequence analysis Step 2: Multiple DNA sequence alignment Step 3: Phylogenetic analysis IE68 - biological databases - phylogeny

Multiple DNA sequence alignment Problem: alternative alignments possible to align any two sequences by postulating some combination of gaps (insertion/deletions = indels) and substitutions => which one to choose? Basic task of sequence alignment is to find the alignment with the highest similarity, smallest distance, or lowest overall cost IE68 - biological databases - phylogeny

Multiple DNA sequence alignment 2 sequences + scoring scheme => optimal alignment Scoring scheme: - scoring matrix: distance weights or similarity scores for each pair of aligned bases e.g. transition – transversion matrix A T G C A 0 5 1 5 T 5 0 5 1 G 1 5 0 5 C 5 1 5 0 - gap weight, cost or penalty IE68 - biological databases - phylogeny

Multiple DNA sequence alignment Cost of an alignment D = s + wg s = no of substitutions, g = total length of gaps w = gap penalty = cost of gap relative to substitution Gap penalty W makes implicit assumptions about how the sequences have evolved if indels are thought to be rare, then W should be large (and vice versa) => have to use knowledge of biology e.g. translation (3 bp indel, position), transition<>transversion, ... IE68 - biological databases - phylogeny

Multiple DNA sequence alignment Software programs: e.g. CLUSTALW (global alignment) http://www.ebi.ac.uk/clustalw/index.html The optimal alignment is not always the true alignment => new developments phylogenetic analysis without the multiple DNA sequence alignment step IE68 - biological databases - phylogeny

Molecular phylogenetics: approach Step 1: PCR with primers that target cytoplasmic DNA or nuclear loci of taxa, followed by DNA sequence analysis Step 2: Multiple DNA sequence alignment Step 3: Phylogenetic analysis IE68 - biological databases - phylogeny

Inferring phylogenies from DNA sequences Sequence alignment A ..AGCGTCT.. B ..AGCGTGT.. C ..AG–GAGT.. A B Phylogenetic methods unrooted tree A B taxa characters C rooted tree IE68 - biological databases - phylogeny

Phylogenetic methods Character-based methods Non character-based methods Methods based on an explicit model of evolution Maximum-likelihood methods Pairwise distance methods Methods not based on an explicit model of evolution Maximum parsimony methods IE68 - biological databases - phylogeny

Pairwise distance methods 3 taxa, 3 sequences Dissimilarity matrix: count the number of differences between all possible pairs of sequences Convert dissimilarity to evolutionary distance by correcting for multiple events per site according to a certain model of evolution Infer tree topology on the basis of the evolutionary distances by using a clustering algorithm or optimality criterion 1 2 3 1 2 0.26 3 0.20 0.33 1 2 3 1 2 0.32 3 0.23 0.44 tree IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Models of sequence evolution expected  observed difference => correction (linear) (not linear) Apply a substitution model that tries to estimate the correct number of substitutions IE68 - biological databases - phylogeny

Models of sequence evolution Distance “correction” methods: convert observed distances into measure that correspond to ACTUAL distance Several methods have been proposed, all with different assumptions about the nature of the evolutionary process Essentially they differ by the number of parameters they include We can use a general framework to show how these models are inter-related IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Substitution models: general framework IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Substitution models: general framework IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny e.g. Model of Jukes & Cantor (JC) One of the first proposed – perhaps the simplest model of evolution Assumes that all four bases have equal frequency and that all substitutions are equally likely Under this model, the distance between any two sequences is given by d = -3/4ln(1-4/3p), where p is the proportion of nucleotides that are different in the two sequences IE68 - biological databases - phylogeny

e.g. Kimura 2 parameter model (K2P) incorporates the observation that transitions accumulate more rapidly than transversion assumes all four bases have equal frequencies but that there are 2 rate classes for substitutions Under this model, the distance between any two sequences is given by d = 1/2ln[1/(1-2P-Q)] + 1/4ln[1/(1-2Q)], where P and Q are the proportional differences between the two sequences due to transitions and transversions, respectively IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Substitution models Other models: adding more parameters Felsenstein model (F81) variation in base composition => base frequency f = [A C G T] may vary Hasewaga Kishino Yano (HKY) model unequal base frequency, transition/transversion General reversible model (REV) unequal base frequency, all six pairs of substitutions have different rates => ideally, we want the simplest model we can get away with that still yields a reasonable estimate IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Substitution models Assumptions of these models: all nucleotide sites change independently base composition equilibrium substitution rate is constant over time and in different lineages each site in a sequence is equally likely to undergo substitution => gamma distribution has a parameter that specifies the range of rate variation among sites: model + ’ IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Pairwise distance methods Dissimilarity matrix: count the number of differences between all possible pairs of sequences Convert dissimilarity to evolutionary distance by correcting for multiple events per site according to a certain model of evolution Infer tree topology on the basis of the evolutionary distances by using a clustering algorithm 3 taxa, 3 sequences 1 2 3 1 2 0.26 3 0.20 0.33 1 2 3 1 2 0.32 3 0.23 0.44 tree IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Clustering methods Clustering methods follow a set of steps (an algorithm) and arrive at a tree UPGMA (Unweighted Pair Group Method using Arithmetic Averages): results in an rooted and additive tree with molecular clock Neighbor-joining: results in an unrooted and additive tree Other approaches: least-squares, Fitch, Kitch,... IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny UPGMA clustering A B C B 2 least differences C 4 4 D 6 6 6 1 A 1 B Compute new distances between (AB) and other OTU’s d(AB)C = (dAC + dBC) /2 = 4 d(AB)D = (dAD + dBD) /2 = 6 IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny UPGMA clustering 1 AB C C 4 D 6 6 A 1 1 B 2 C 1 A 1 Compute new distances between (ABC) and other OTU’s d(ABC)D = (d(AB)D + dCD) /2 = 6 1 B 1 2 C 3 D IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Clustering methods UPGMA additive and ultrametric distances => assumes a molecular clock => very sensitive to unequal rate of evolution! => relative-rate test Use other clustering methods for phylogeny e.g. Neighbor-joining “Goodness of fit” statistics: to select the metric tree that best accounts for the observed distances IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Pairwise distance methods Dissimilarity matrix: count the number of differences between all possible pairs of sequences Convert dissimilarity to evolutionary distance by correcting for multiple events per site according to a certain model of evolution Infer tree topology on the basis of the evolutionary distances by using an optimality criterion 3 taxa, 3 sequences 1 2 3 1 2 0.26 3 0.20 0.33 1 2 3 1 2 0.32 3 0.23 0.44 tree IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Minimum evolution Distance matrix => unrooted metric trees Each tree has a length L, which is the sum of all the branch lengths Optimality criterion: the minimum evolution tree ME is the tree which minimizes L IE68 - biological databases - phylogeny

Pairwise distance method Advantages very fast based on a model of evolution Disadvantages sequence information is reduced to one number branch lengths may not be biologically interpreted most methods provide only one tree topology dependent on the model of evolution used IE68 - biological databases - phylogeny

Phylogenetic methods Character-based methods Non character-based methods Methods based on an explicit model of evolution Maximum-likelihood methods Pairwise distance methods Methods not based on an explicit model of evolution Maximum parsimony methods IE68 - biological databases - phylogeny

Character-based methods Character-based (discrete) methods operate directly on sequences, rather than on pairwise distances Two major discrete methods: Maximum parsimony (MP): chooses tree(s) that require fewest evolutionary changes Maximum Likelihood (ML): chooses tree(s) that is the one most likely to have produced the observed data IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Maximum parsimony Maximum parsimony infers a phylogenetic tree by minimizing the total number of evolutionary steps Principle: Investigate all possible tree topologies Reconstruct ancestral sequences Choose topology with smallest number of steps IE68 - biological databases - phylogeny

Maximum parsimony - principle 1 3 A 2 4 1 2 B 3 4 1 2 C 4 3 possible tree topologies IE68 - biological databases - phylogeny

Maximum parsimony - principle IE68 - biological databases - phylogeny

Maximum parsimony - principle IE68 - biological databases - phylogeny

Maximum parsimony - principle IE68 - biological databases - phylogeny

Maximum parsimony - generalized In previous example, cost of each substitution was “one step” => equal weight Instead, we can use different costs for different types of change (e.g. transitions vs transversions) to better match our assumptions about evolutionary processes => weighted parsimony according to Dollo, Wagner, Fitch, ... IE68 - biological databases - phylogeny

Maximum parsimony - characters IE68 - biological databases - phylogeny

Maximum parsimony – search methods Number of tree topologies: Nu = (2n-5)!/2n-3(n-3)! i.e., 3 sequences ~ 1 tree, 4 seq ~ 3 trees, 5 seq ~ 15, 6 ~ 105, => the more sequences (~ taxa), the more trees => computationally expensive Finding optimal trees: Exhaustive search: limited number of taxa (<10) find the minimum tree of all possible trees Branch and bound: small number of taxa (<18) find the minimum tree without evaluating all trees by discarding families of trees during tree construction that cannot be shorter than the shortest tree found so far Heuristic search: large number of taxa IE68 - biological databases - phylogeny

Maximum parsimony – search methods - Heuristic search: explore a subset of all possible trees, by using stepwise addition of taxa plus a rearrangement process (branch swapping), but not guaranteed to find the minimal tree Global optimum Local optimum IE68 - biological databases - phylogeny

Maximum parsimony - output Consensus tree: MP can yield multiple equally most parsimonious (optimal) trees => relationships common to all the optimal trees are summarized with a consensus tree Strict consensus: includes splits found in all trees Majority-rule consensus: includes splits found in the majority of the trees (> 50%) IE68 - biological databases - phylogeny

Maximum parsimony - output Consistency index (CI) - Retention index (RI) measures of the parsimony fit of a character to a tree, or of the average fit of all characters to a tree more specifically: index of how much homoplasy the constructed tree has Value from 0 to 1 higher value => less homoplasy IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny

Parsimony – branch support and tree stability Bootstrap analysis is a resampling technique used to measure sampling error gives an idea about the reliability of branches and clusters original dataset => resample => construct trees => compare trees to original trees >70% quite confident of tree topology Decay index (Bremer support) gives us a sense of how many steps would be required before a grouping collapses higher value => better branch support IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Maximum parsimony Advantages based on shared derived characters evaluates different tree topologies does not reduce the information Disadvantages computationally intensive for large datasets no correction for multiple mutations sensitive to unequal rates of evolution (long branch attraction) IE68 - biological databases - phylogeny

Maximum-likelihood methods Phylogenetic methods Character-based methods Non character-based methods Methods based on an explicit model of evolution Maximum-likelihood methods Pairwise distance methods Methods not based on an explicit model of evolution Maximum parsimony methods IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Maximum likelihood Statistical method If given some data D and a hypothesis H, the likelihood of that data is given by LD = Pr (D|H) Which is the probability of D given H? IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Maximum likelihood In the context of molecular phylogenetics D is the set of sequences being compared H is a phylogenetic tree We want to find the likelihood of obtaining the observed data given the tree The tree that makes the data the most probable evolutionary outcome is the Maximum Likelihood estimate of the phylogeny IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Maximum likelihood In other words Which tree is most likely to have yielded these sequences (observed data) under a given model of evolution (JC, K2P, ...)? IE68 - biological databases - phylogeny

IE68 - biological databases - phylogeny Maximum likelihood Advantages Statistically well founded Based on a model of evolution Evaluates different topologies Uses all sequence information Often yields estimates that have lower variance than other methods Disadvantages Very slow (computationally intensive) Dependent on the model of evolution used IE68 - biological databases - phylogeny

Software programs for phylogenetic analysis Overview: http://evolution.genetics.washington.edu/phylip/software.html Most widely used software programs PHYLIP: free available (downloadable or online http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html) PAUP: user friendly but not free available IE68 - biological databases - phylogeny

Phylogenetic information on the internet http://tolweb.org/tree/phylogeny.html http://www.treebase.org/treebase/ .... IE68 - biological databases - phylogeny

If you need more information Jacqueline Vander Stappen K.U.Leuven Laboratory of Gene Technology Kasteelpark Arenberg 21 B-3001 Leuven Jacqueline.vanderstappen@agr.kuleuven.ac.be IE68 - biological databases - phylogeny