An Introduction to Phylogenetic Methods

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Introduction to Phylogenies
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 24 Inferring molecular phylogeny Distance methods
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic Analysis. 2 Phylogenetic Analysis Overview Insight into evolutionary relationships Inferring or estimating these evolutionary relationships.
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogenetic Analysis
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
For immunologists 2013 Introduction to Phylogenies Dr Laura Emery
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Molecular phylogenetics
Christian M Zmasek, PhD 15 June 2010.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Tree Confidence Have we got the true tree? Use known phylogenies Unfortunately, very rare Hillis et al. (1992) created experimental phylogenies using phage.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
A brief introduction to phylogenetics
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Calculating branch lengths from distances. ABC A B C----- a b c.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
The Big Issues in Phylogenetic Reconstruction Randy Linder Integrative Biology, University of Texas
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Building Phylogenies. Phylogenetic (evolutionary) trees Human Gorilla Chimp Gibbon Orangutan Describe evolutionary relationships between species Cannot.
Phylogenetic trees. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
Introduction to Bioinformatics Resources for DNA Barcoding
Phylogenetic basis of systematics
Inferring a phylogeny is an estimation procedure.
Phylogenetic Inference
Clustering methods Tree building methods for distance-based trees
Multiple Alignment and Phylogenetic Trees
Goals of Phylogenetic Analysis
Methods of molecular phylogeny
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
Summary and Recommendations
Phylogeny.
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

An Introduction to Phylogenetic Methods Part one Dr Laura Emery Laura.Emery@ebi.ac.uk www.ebi.ac.uk/training

Objectives After this tutorial you should be able to… Discuss a range of methods for phylogenetic inference, their advantages, assumptions and limitations Implement some phylogenetic methods using publicly available software Appreciate some approaches for assessing branch support and selecting an appropriate substitution model We don't have enough time to go into any of these methods in great depth, and if you want to fully understand how they work, then you will need to go away and do some extra reading. But the point of this tutorial is to make you aware of some of the most commonly available methods, their assumptions and limitations. And also to direct you to information that you can read up on if you want further help

Outline Alignment for phylogenetics Phylogenetics: The general approach Phylogenetic Methods (1 – simple methods) Assessing Branch Support BREAK Substitution Models Phylogenetic Methods (2 - statistical inference) Deciding which model to use (hypothesis testing) Software Distance-based methods (UPGMA , NJ) Maximum Parsimony Substitution models (JC, K2P, F81, HKY85, GTR + gamma) Other substitution models Maximum likelihood Bayesian inference

Alignment for phylogenetics Phylogenetic analyses are typically applied to alignments of sequence data Occasionally other data such as morphological traits are used (e.g. when no sequence data is available) Alignments must contain homologous sequences We assume that sites in the same column in an alignment are homologous

Alignment for phylogenetics Benjamin Redelings

Columns in alignments should be homologous Benjamin Redelings

Phylogenetics: The general approach We want to find the tree that best explains our aligned sequences We need to be able to define “best explains” we need a model of sequence evolution we need a criterion (or set of criteria) to use to choose between alternative trees then evaluate all possible trees (NB: if N=20, then 2 x 1020 possible unrooted trees!) or take a short cut Paul Sharp

There is only one true tree The true tree refers to what actually happened in the evolutionary past All methods attempt to reconstruct the true phylogeny Even the best method may not give you the true tree Use the thought experiment of thinking of your ma'am and then your mothers ma'am and so on into the evolutionary past. This is something we must be mindful of when implementing or assessing any phylogenetic inference of what we think might have happened in the past based upon the data and our beliefs about how sequences evolved No model is a true description of the biological complexity that has happened. They are simplifications and approximations which are useful as a tool to help us figure out what has happened. And for some taxa this may be something which we never resolve.

Methodological approaches Distance matrix methods (pre-computed distances) UPGMA assumes perfect molecular clock Sokal & Michener (1958) Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei (1987) Maximum parsimony Fitch (1971) Minimises number of mutational steps Maximum likelihood, ML Evaluates statistical likelihood of alternative trees, based on an explicit model of substitution Bayesian methods Like ML but can incorporate prior knowledge We will examine these methods in this order We will examine the UPGMA method in detail because it is simple and gives you an idea of some of the pitfalls of other methods, although it is not commonly used any more will examine overview of the methods, although there is not time for us to look at exactly how they work in detail, but we will consider their assumptions advantages and limitations

What is a distance matrix? A table that indicates the number of substitutions between pairs of sequences

Distance Matrix Methods Take alignment compute genetic distance for all pairwise combinations put it into a matrix cluster together those sequences which are most similar recalculate the matrix using a single distance for the sequences you've clustered together continue doing this until the trees finished Andrew Rambaut

UPGMA Method Identify the pair of most closely related taxa according to the pairwise-genetic distance matrix Cluster these together Figures Andrew Rambaut

UPGMA Method Recalculate distance matrix (calculate the distances from the new cluster to every other sequence) Take the average of both distances E.g. distance[spinach, monkey/human] : = (distance[spinach, human] + distance[spinach, monkey]) / 2 = (86.3 + 90.8)/2 = 88.55 Figures Andrew Rambaut

UPGMA Method Repeat the procedure until the tree is finished Produces rooted phylogeny distance between (spi,ric) and mos(mon,hum) is 108.7 Andrew Rambaut

UPGMA Method Assumptions: Advantages: Disadvantages: Strict molecular clock Ultrametric distance data Advantages: Fast and simple Disadvantages: Data are almost never ultrametric Usage: Almost never used Ultra-metric = genetic distance data exactly proportional to time

Neighbour Joining Method An improvement over the UPGMA: does not require data to be ultrametric Identifies the topology that gives the least total branch length at each step after we have joined two species in a cluster we have to compute the distances from every other sequence to the new cluster. We do this with a simple average of distances: distance[spinach, monkey/human] = (distance[spinach, human] + distance[spinach, monkey] - distance[monkey, human]) / 2 Figures Olivier Gascuel

Neighbour Joining Method Advantages: allows the use of an explicit model of evolution fast and simple able to deal with thousands of taxa Disadvantages: only produces one tree reduces all sequence information into a single distance value dependant on the evolutionary model used Usage: commonly used due to being widely available in many software packages

Methodological approaches Distance matrix methods (pre-computed distances) UPGMA assumes perfect molecular clock Sokal & Michener (1958) Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei (1987) Maximum parsimony Fitch (1971) Minimises number of mutational steps Maximum likelihood, ML Evaluates statistical likelihood of alternative trees, based on an explicit model of substitution Bayesian methods Like ML but can incorporate prior knowledge We will examine these methods in this order We will examine the UPGMA method in detail because it is simple and gives you an idea of some of the pitfalls of other methods, although it is not commonly used any more will examine overview of the methods, although there is not time for us to look at exactly how they work in detail, but we will consider their assumptions advantages and limitations

Maximum Parsimony The most parsimonious tree is the tree requiring the smallest number of substitutions to explain the sequences * T A C length = 2 length = 3 ? C A MP (unrooted) T C A * length = 3 Add to this the four diagrams that Sarah said she would do. Animate them so that initially there is one without the ancestral states, so that you can explain how we don't know what they were. Then pencils in is a possibility (not the most parsimonious tree). Then get the other 3 to appear, and finally the number of substitutions to appear In practice, all possible trees need to be evaluated

Maximum Parsimony Assumptions: Advantages: Disadvantages Multiple substitutions rare Advantages: fast Disadvantages not consistent with most models of evolution can result in multiple optimal trees Usage: still used with morphological data Figures Andrew Rambaut

The problem of multiple substitutions A * * hidden mutations G A A T More likely to have occurred between distantly related species > We need an explicit model of evolution to account for these (to be covered in part two) The further back in time you go, the more likely it is that hidden mutations have occurred, that you cannot see in the sequence data. You will never observe 100% sequence difference, because there are only four bases, and so you expect a maximum sequence difference of 75% (even with random sequences)

Methodological approaches Distance matrix methods (pre-computed distances) UPGMA assumes perfect molecular clock Sokal & Michener (1958) Minimum evolution (e.g. Neighbor-joining, NJ) Saitou & Nei (1987) Maximum parsimony Fitch (1971) Minimises number of mutational steps Maximum likelihood, ML Evaluates statistical likelihood of alternative trees, based on an explicit model of substitution Bayesian methods Like ML but can incorporate prior knowledge How well supported are my branches? We will examine these methods in this order We will examine the UPGMA method in detail because it is simple and gives you an idea of some of the pitfalls of other methods, although it is not commonly used any more will examine overview of the methods, although there is not time for us to look at exactly how they work in detail, but we will consider their assumptions advantages and limitations

How well supported are my branches? A tree is a collection of hypotheses so we assess our confidence in each of its parts or branches independently There are three main approaches: Bootstraps Bayesian methods Approximate likelihood ratio test (aLRT) methods 0.93 0.81 0.99 85 63 100 We will look at the straps in more detail probabilistic

Bootstrapping 2. Resample columns with replacement to create many dummy alignments 1. Take your alignment, and consider each column separately repeat lots repeat lots 3. Use these to draw many trees and count up the occurrences of each branch among these trees Figures Andrew Rambaut Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783-791.

Issues with bootstrapping Sites may not evolve independently P values are biased (too conservative) Calculating bootstraps for many branches results in multiple testing Bootstrapping does not correct biases in phylogeny methods Nevertheless they perform surprisingly well

Outline Alignment for phylogenetics Phylogenetics: The general approach Phylogenetic Methods (1 – simple methods) Assessing Branch Support BREAK Substitution Models Phylogenetic Methods (2 - statistical inference) Deciding which model to use (hypothesis testing) Software Distance-based methods (UPGMA , NJ) Maximum Parsimony Substitution models (JC, K2P, F81, HKY85, GTR + gamma) Other substitution models Maximum likelihood Bayesian inference

Now it's your turn… Open your course manuals and begin Tutorial 1 Also available to download from: http://www.ebi.ac.uk/training/course/scuola-di- bioinformatica-2013 You will require the alignment file 5SrRNA.txt There are answers available online but it is much better to ask for help!

Thank you! www.ebi.ac.uk Twitter: @emblebi Facebook: EMBLEBI