Intro. To Phylogenetic Analysis Slides modified by David Ardell From Caro-Beth Stewart, Paul Higgs, Joe Felsenstein and Mikael Thollesson.

Slides:



Advertisements
Similar presentations
Phylogenetic reconstruction
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Phylogenetic inference or “How to recognize a tree from quite a long way away” Mikael Thollesson Evolutionary Biology Centre, Uppsala University Slides.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Molecular Phylogeny Analysis, Part II. Mehrshid Riahi, Ph.D. Iranian Biological Research Center (IBRC), July 14-15, 2012.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
Phylogenetic Analysis 1 Phylogeny (phylo =tribe + genesis)
Based on lectures by C-B Stewart, and by Tal Pupko Phylogenetic Analysis based on two talks, by Caro-Beth Stewart, Ph.D. Department of Biological Sciences.
Phylogenetic Analysis
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Phylogenetic trees Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Chapter 2.
Molecular Evolution Revised 29/12/06
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
Phylogenetic reconstruction
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Phylogenetic reconstruction
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Lecture 24 Inferring molecular phylogeny Distance methods
Phylogenetic Analysis
Building Phylogenies Parsimony 2.
Phylogenetic trees Sushmita Roy BMI/CS 576
Processing & Testing Phylogenetic Trees. Rooting.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Maximum parsimony Kai Müller.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Molecular phylogenetics
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL.
Phylogentic Tree Evolution Evolution of organisms is driven by Diversity  Different individuals carry different variants of.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Tree Confidence Have we got the true tree? Use known phylogenies Unfortunately, very rare Hillis et al. (1992) created experimental phylogenies using phage.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Phylogenetic Inference Data Optimality Criteria Algorithms Results Practicalities BIO520 BioinformaticsJim Lund Reading: Ch8.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
 Read Chapter 4.  All living organisms are related to each other having descended from common ancestors.  Understanding the evolutionary relationships.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
A brief introduction to phylogenetics
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Lecture 2: Principles of Phylogenetics
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Phylogeny & Systematics
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Part 9 Phylogenetic Trees
Maximum Parsimony Phenetic (distance based) methods are fast and often accurate but discard data and are not based on explicit character states at each.
Phylogenetic trees. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
PHYLOGENETIC ANALYSIS. Phylogenetics Phylogenetics is the study of the evolutionary history of living organisms using treelike diagrams to represent pedigrees.
What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: (1) Phylogeny inference or “tree building”
Bioinformatics Lecture 3 Molecular Phylogenetic By: Dr. Mehdi Mansouri Mehr 1395.
Phylogenetic basis of systematics
Lecture 7 – Algorithmic Approaches
Presentation transcript:

Intro. To Phylogenetic Analysis Slides modified by David Ardell From Caro-Beth Stewart, Paul Higgs, Joe Felsenstein and Mikael Thollesson

C-B Stewart, NHGRI lecture, 12/5/00 What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: 1.Phylogeny inference or “tree building” — evolutionary relationships between genes or species 2.Character and rate analysis — mapping information onto trees

C-B Stewart, NHGRI lecture, 12/5/00 Ancestral Node or ROOT of the Tree Internal Nodes (represent hypothetical ancestors of the taxa) Branches or Lineages Terminal Nodes A B C D E Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny Common Phylogenetic Tree Terminology CLADE

A B C D X and Y are defined to be more closely related to each other than to Z if, and only if, they share a more recent common ancestor than they do with Z DCABBACD

C-B Stewart, NHGRI lecture, 12/5/00 All of these rearrangements show the same evolutionary relationships between the taxa B A C D A B D C B C A D B D A C B A C D Rooted tree 1a B A C D A B C D

C-B Stewart, NHGRI lecture, 12/5/00

Taxon A Taxon B Taxon C Taxon D no meaning Three types of trees Cladogram All show the same branching orders between taxa. groupings

C-B Stewart, NHGRI lecture, 12/5/00 Taxon A Taxon B Taxon C Taxon D evolutionary distance Taxon A Taxon B Taxon C Taxon D no meaning Three types of trees Cladogram Phylogram All show the same branching orders between taxa. groupingsgroupings + distance

C-B Stewart, NHGRI lecture, 12/5/00 Taxon A Taxon B Taxon C Taxon D Evolutionary distance Taxon A Taxon B Taxon C Taxon D time Taxon A Taxon B Taxon C Taxon D no meaning Three types of trees Cladogram Phylogram Ultrametric tree All show the same branching orders between taxa. groupingsgroupings + distancegroupings + time

C-B Stewart, NHGRI lecture, 12/5/00 Similarity vs. Evolutionary Relationship: Since taxa evolve at different rates, your closest relative could be very different Taxon A Taxon B Taxon C (think lamprey) Taxon D C is closer to A but more closely related to B This is why the closest BLAST hit is not necessarily the closest relative, and why you need to make trees.

Types of Similarity Observed similarity between two entities can be due to: Evolutionary relationship: Shared ancestral characters (‘plesiomorphies’) Shared derived characters (‘’synapomorphy’) Homoplasy (independent evolution of the same character): Convergent events,Parallel events, Reversals C C G G C C G G C G G C C G G T

C-B Stewart, NHGRI lecture, 12/5/00 A few examples of what can be inferred from phylogenetic trees built from DNA or protein sequence data: Which species are the closest living relatives of modern humans? Did the infamous Florida Dentist infect his patients with HIV? What were the origins of specific transposable elements?

Which species are the closest living relatives of modern humans? Classical view Humans Bonobos Gorillas Orangutans Chimpanzees MYA

Which species are the closest living relatives of modern humans? Molecular view Classical view MYA Chimpanzees Orangutans Humans Bonobos Gorillas Humans Bonobos GorillasOrangutans Chimpanzees MYA

Did the Florida Dentist infect his patients with HIV? DENTIST Patient D Patient F Patient C Patient A Patient G Patient B Patient E Patient A Local control 2 Local control 3 Local control 9 Local control 35 Local control 3 Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. No From Ou et al. (1992) and Page & Holmes (1998) Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People:

C-B Stewart, NHGRI lecture, 12/5/00 Uses of character mapping: Dating adaptive evolutionary events Ancestral reconstruction Testing biological hypotheses of correlated function or change

Ex: Where geographically was the common ancestor of African apes and humans? Eurasia = Black Africa = Red = Dispersal Modified from: Stewart, C.-B. & Disotell, T.R. (1998) Current Biology 8: R Scenario B requires four fewer dispersal events Scenario A: Africa as species fountainScenario B: Eurasia as ancestral homeland

C-B Stewart, NHGRI lecture, 12/5/00 Building Trees COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES

C-B Stewart, NHGRI lecture, 12/5/00 Building Trees COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES

C-B Stewart, NHGRI lecture, 12/5/00 Building Trees COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES

Types of data: Character-data: Taxa Characters Species AATGGCTATTCTTATAGTACG Species BATCGCTAGTCTTATATTACA Species CTTCACTAGACCTGTGGTCCA Species DTTGACCAGACCTGTGGTCCG Species ETTGACCAGTTCTCTAGTTCG Distance-based data: pairwise distances (dissimilarities) A B C D E Species A Species B Species C Species D Species E Uncorrected “p” distance Example 2: Kimura 2-parameter distance

C-B Stewart, NHGRI lecture, 12/5/00

Building Trees COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES

Parsimony n Given two trees, the one requiring the lowest number of character changes to explain the observations is the better – Parsimony score for a tree is the minimum number of required changes – This score is frequently referred to as number of steps or tree length

Parsimony – an example  acgtatgga  acgggtgca  aacggtgga  aactgtgca  : c  : c  : a  : a  : c  : c  : a  : a  : c  : a  : a  : c Total tree length: 7Total tree length: 8

C-B Stewart, NHGRI lecture, 12/5/00 Building Trees COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES

Using models Observed differences Actual changes AG CT Example: Jukes-Cantor, if i=j, if i≠j AC G C ACGT A C G T

C-B Stewart, NHGRI lecture, 12/5/00

30 nucleotides from  -globin genes of two primates on a one-edge tree * * Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGG Orangutan GGACTCCTTGAGAAATAAACTGCACACTGG There are two differences and 28 similarities tt lnL  t= lnL= Likelihood of a one-branch tree…

A recipe for phylogenetic inference n Collect your data n Select an optimality criterion (“which tree is better?”, tree score) n Optional: do data transformation (“corrections”) n Select a search strategy to find the best tree n Find the best hypothesis according to that criterion n Assess the variation in your data in some way

Finding the best tree n Number of (rooted) trees – 3 taxa -> 3 trees – 4 taxa -> 15 trees – 10 taxa -> trees – 25 taxa -> 1,19·10 30 trees – 52 taxa -> 2,75·10 80 trees n Finding the optimal tree is an NP-complete problem –Search strategies Exact n Exhaustive n Branch and bound Algorithmic n Greedy algorithms, a.k.a. hill-climbing (including Neighbor-joining) Heuristic n Systematic; branch- swapping (NNI, SPR, TBR) n Stochastic – Markov Chain Monte Carlo (MCMC) – Genetic algorithms

C-B Stewart, NHGRI lecture, 12/5/00 Completely unresolved or "star" phylogeny Partially resolved phylogeny Fully resolved, bifurcating phylogeny AAA B BB C C C E E E D DD Polytomy or multifurcationA bifurcation “Star-Decomposition”

C-B Stewart, NHGRI lecture, 12/5/00 There are three possible unrooted trees on four taxa (A, B, C, D) AC B D Tree 1 AB C D Tree 2 AB D C Tree 3

C-B Stewart, NHGRI lecture, 12/5/00 The number of unrooted trees increases in a greater than exponential manner with number of taxa (2N - 5)!! = # unrooted trees for N taxa C A B D A B C A D B E C A D B E C F

C-B Stewart, NHGRI lecture, 12/5/00

What is a “good” method? n Efficiency n Power n Consistency n Robustness n Falsifiability – Time to find a/the solution – Rate of convergence/how much data are needed – Convergence to “correct” solution as data are added – Performance when assumptions are violated – Rejection of the model when inadequate

C-B Stewart, NHGRI lecture, 12/5/00

Frequency of correct inference Sequence length All and 0.05 respectively Performance on simulated data

+ and – of the methods Pair-wise, NJ, distance approach + Fast (efficiency) + Models can be used to make distances (can be consistent) – pairwise distances throw out information (loss of power) – One will get a tree, but no score to compare with other trees or hypotheses Parsimony and tree-search + Philosophically appealing – Occam’s razor – Can be inconsistent – Can be computationally slow due to a huge number of possible trees Maximum likelihood and tree-search + Model-based, can be consistent, powerful, gain biological info – Model-based, bad when you have the wrong model – Computationally veeeeery slow due to heavy calculations in determining the tree score and a huge number of possible trees

The quick and dirty, pretty good tree n Calculate model-based pairwise distances. n Make a Neighbor-Joining Tree n Do a bootstrap

A recipe for phylogenetic inference n Collect your data n Select an optimality criterion (“which tree is better”?) n Optional: do data transformation (“corrections”) n Select a search strategy to find the best tree n Find the best hypothesis according to that criterion n Assess the variation in your data in some way

Assessing the variation Jackknife – resampling without replacement Bootstrap – resampling with replacement

Assessing the variation Jackknife – resampling without replacement Bootstrap – resampling with replacement 1. Resample columns from an alignment with replacement to make a simulated sample of the same size

Assessing the variation Jackknife – resampling without replacement Bootstrap – resampling with replacement 1. Resample columns from an alignment with replacement to make a simulated sample of the same size 2. Analyze this resampled dataset in the same way as you did the original sample

Assessing the variation Jackknife – resampling without replacement Bootstrap – resampling with replacement 1. Resample columns from an alignment with replacement to make a simulated sample of the same size 2. Analyze this resampled dataset in the same way as you did the original sample 3. Repeat this 100+ times, making 100 bootstrap trees

Assessing the variation Jackknife – resampling without replacement Bootstrap – resampling with replacement 1. Resample columns from an alignment with replacement to make a simulated sample of the same size 2. Analyze this resampled dataset in the same way as you did the original sample 3. Repeat this 100+ times, making 100 bootstrap trees 4. Summarize, for example, as a majority-rule consensus tree 5. Clades in 50% of the trees will be shown, need 70% to be called “weakly supported”

Original data set with n characters. Draw n characters randomly with re- placement. Repeat m times. m pseudo-replicates, each with n characters. Aus Beus Ceus Deus Original analysis, e.g. MP, ML, NJ. Aus Beus Ceus Deus 75% Evaluate the results from the m analyses. Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Repeat original analysis on each of the pseudo- replicate data sets. Bootstrap NB! The consensus tree is not a phylogenetic hypothesis, but a way to summarize other trees – in this case bootstrapped trees

C-B Stewart, NHGRI lecture, 12/5/00 Rooting To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A B C Root D A B C D Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Rooted tree Unrooted tree

C-B Stewart, NHGRI lecture, 12/5/00 Now, try it again with the root at another position: A B C Root D Unrooted tree Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D. C D Root Rooted tree A B

C-B Stewart, NHGRI lecture, 12/5/00 An unrooted, four-taxon tree can be rooted in five different places The unrooted tree 1: AC B D Rooted tree 1d C D A B 4 Rooted tree 1c A B C D 3 Rooted tree 1e D C A B 5 Rooted tree 1b A B C D 2 Rooted tree 1a B A C D 1

Outgroup rooting: Uses taxa or sequences (the “outgroup”) known to fall outside all the others (the “ingroup”). Requires prior knowledge. There are two major ways to root trees: A B C D Midpoint rooting: Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes clock-like evolution. outgroup d (A,D) = = 18 Midpoint = 18 / 2 = 9

C-B Stewart, NHGRI lecture, 12/5/00 x = C A B D AD B E C A D B E C F (2N - 3)!! = # unrooted trees for N taxa Each unrooted tree theoretically can be rooted anywhere along any of its branches

We have arrived at a tree – can we trust it as a good hypothesis of the phylogeny? What can go wrong? n Sampling error – Assessed by - for example - the bootstrap n Too superficial tree search – Remember – finding the best tree is really hard – Systematic error (inconsistent method) – Tests of the adequacy of models used – Premeditated use of different methods n Reality – A tree may be a poor model of the real history – Information has been lost by subsequent evolutionary changes n “Species” vs. “gene” trees

CanisMusGadus What is wrong with this tree? n Negligible (within sequence) sampling error n Tree estimated by a consistent method 100

Gene duplication “Species” tree “Gene” trees The expected tree…

CanisMusGadus MusCanis Two copies (paralogs) present in the genomes Paralogous Orthologous

CanisGadusMus What we have studied…

CanisGadusMus What we have studied… Message: specific loss patterns of paralogs can disrupt species trees if we don’t know what is a paralog And what is an ortholog

To conclude– n Phylogenetic inference deals with historical events and information transfer through time n Results from phylogenetic analyses are hypotheses for further testing; the true history will remain unknown n Inference is mathematical intricate and computational heavy, and as a result methods for phylogenetic inference are legio n There are several pitfalls to avoid when doing the analyses and when interpreting them n But… Ignoring the shared histories can sometimes give completely bogus results in comparative studies

Phylogenetic trees diagram the evolutionary relationships between the taxa ((A,(B,C)),(D,E)) Taxon A Taxon B Taxon C Taxon E Taxon D No meaning to the spacing between the taxa, or to the order in which they appear from top to bottom.