Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.

Slides:



Advertisements
Similar presentations
Introduction to Molecular Evolution
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
1 Dan Graur Methods of Tree Reconstruction. 2 3.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sampling distributions of alleles under models of neutral evolution.
Phylogenetic Trees Lecture 4
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
The Tree of Life From Ernst Haeckel, 1891.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Lecture 24 Inferring molecular phylogeny Distance methods
Phylogeny Tree Reconstruction
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Phylogenetic trees Sushmita Roy BMI/CS 576
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
1 Additive Distances Between DNA Sequences MPI, June 2012.
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
1 Evolutionary Change in Nucleotide Sequences Dan Graur.
Calculating branch lengths from distances. ABC A B C----- a b c.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
MAT 4830 Mathematical Modeling
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Phylogeny Ch. 7 & 8.
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Building Phylogenies. Phylogenetic (evolutionary) trees Human Gorilla Chimp Gibbon Orangutan Describe evolutionary relationships between species Cannot.
Evolutionary Change in Sequences
Models for DNA substitution
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Linkage and Linkage Disequilibrium
Maximum likelihood (ML) method
Distances.
Goals of Phylogenetic Analysis
Patterns in Evolution I. Phylogenetic
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogeny.
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM

a b Mainly a STATISTICAL problem!

a)Models of sequence evolution b)Sequence similarity c)Estimating the number of substitutions between two sequences d)Phylogenetic reconstruction

Evolution at the molecular level is the substitution of one allele by another 0 1 frequency time 1/  The basic forces are: mutation, genetic drift and natural selection Allele AAllele BAllele C

By this process, a DNA sequence accumulates substitutions through time ATCGCATCCATCGCATCC ATTGCGTACATTGCGTAC TAGCGTAGGTAGCGTAGG TAACCCATGTAACCCATG t

In the study of molecular evolution, this changes in a DNA sequence are used for both: Estimating the rate of molecular evolution Reconstructing the evolutionary history

Models of sequence evolution

Models of DNA evolution AC To study the dynamics of nucleotide substitution we must made assumptions regarding the probability (p) of substitution of one nucleotide by another at the end of time interval t p t

p AC For instance, P AC represents the probability that a site that has started with nucleotide i (A in this case) change to nucleotide j (C in this case) at the end of interval t

Models of DNA evolution using matrix theory P AA P AC P AG P AT P CA P CC P CG P CT P GA P GC P GG P GT P TA P TC P TG P TT P t = Substitution probability matrix f = [f A f C f G f T ] Base composition of sequences

The Jukes and Cantor’s One-Parameter Model AG CT    

*  *    *   *     *   *     *   * P t = Substitution probability matrix f = [ ¼ ¼ ¼ ¼ ] Base composition of sequences The Jukes and Cantor’s One-Parameter Model * p ii = 1 -  j  i p ij

A The Jukes and Cantor’s One-Parameter Model t = 0t = 1 A p A(0) = 1 p A(1) =  Since we started whit A The probability that the nucleotide has remained unchanged What is the probability of having an A in a site in a DNA sequence at time t =1, in a site that started whit an A at time t = 0 ?

The Jukes and Cantor’s One-Parameter Model What is the probability of having an A in a site in a DNA sequence at time t = 2? A A A A Not A A t = 0 t = 1 t = 2 Scenario 1Scenario 2 No substitutionSubstitution No substitutionSubstitution (After Li, 1997)

The Jukes and Cantor’s One-Parameter Model What is the probability of having an A in a site in a DNA sequence at time t = 2? A A A A Not A A t = 0 t = 1 t = 2 Scenario 1Scenario 2 p A(1) = (1 - 3  ) [1 - p A(1) ] (1 - 3  )  (After Li, 1997)

The Jukes and Cantor’s One-Parameter Model What is the probability of having an A in a site in a DNA sequence at time t = 2? A A A A Not A A t = 0 t = 1 t = 2 Scenario 1Scenario 2 p A(1) [1 - p A(1) ] (1 - 3  )  (After Li, 1997) +

The Jukes and Cantor’s One-Parameter Model What is the probability of having an A in a site in a DNA sequence at time t = 2? p A(2) = (1 - 3  ) p A(1) +  [1 - p A(1) ] The probability of not having a substitution from t = 1 to t = 2 The probability of not having a substitution from t = 0 to t = 1 The probability of having a substitution from not A to A, from t = 1 to t = 2 The probability of having a substitution from A to not A, in t = 0 to t = 1 The probability of no change The probability of reversible change

The Jukes and Cantor’s One-Parameter Model The following recurrence equation holds for any t: p A(t + 1) = (1 - 3  ) p A(t) +  [1 - p A(t) ]

The Jukes and Cantor’s One-Parameter Model Rewriting this equation in terms of the amount of change: p A(t + 1) - p A(t) = (1 - 3  ) p A(t) +  [1 - p A(t) ] - p A(t)

The Jukes and Cantor’s One-Parameter Model Doing some algebra: p A(t + 1) - p A(t) = (1 - 3  ) p A(t) +  [1 - p A(t) ] - p A(t)

The Jukes and Cantor’s One-Parameter Model Doing some algebra: p A(t + 1) - p A(t) = (1 - 3  ) p A(t) +  [1 - p A(t) ] - p A(t) p A(t + 1) - p A(t) = p A(t) - 3  p A(t) +  [1 - p A(t) ] - p A(t)

The Jukes and Cantor’s One-Parameter Model Doing some algebra: p A(t + 1) - p A(t) = (1 - 3  ) p A(t) +  [1 - p A(t) ] - p A(t) p A(t + 1) - p A(t) = p A(t) - 3  p A(t) +  [1 - p A(t) ] - p A(t)

The Jukes and Cantor’s One-Parameter Model Doing some algebra: p A(t + 1) - p A(t) = (1 - 3  ) p A(t) +  [1 - p A(t) ] - p A(t)  p A(t) = - 3  p A(t) +  [1 - p A(t) ] p A(t + 1) - p A(t) = p A(t) - 3  p A(t) +  [1 - p A(t) ] - p A(t)

The Jukes and Cantor’s One-Parameter Model Doing some algebra: p A(t + 1) - p A(t) = (1 - 3  ) p A(t) +  [1 - p A(t) ] - p A(t)  p A(t) = - 3  p A(t) +  [1 - p A(t) ] p A(t + 1) - p A(t) = p A(t) - 3  p A(t) +  [1 - p A(t) ] - p A(t)

The Jukes and Cantor’s One-Parameter Model Doing some algebra: p A(t + 1) - p A(t) = (1 - 3  ) p A(t) +  [1 - p A(t) ] - p A(t)  p A(t) = - 4  p A(t) +  p A(t + 1) - p A(t) = p A(t) - 3  p A(t) +  [1 - p A(t) ] - p A(t)  p A(t) = - 3  p A(t) +  [1 - p A(t) ]

Rewriting this equation for a continuous time model: = - 4  p A(t) +  d p A(t) d t The Jukes and Cantor’s One-Parameter Model

Rewriting this equation for a continuous time model: = - 4  p A(t) +  d p A(t) d t The Jukes and Cantor’s One-Parameter Model p A(t) = ¼ +  p A(0) - ¼  e -4  t The solution is given by:

Since we started with A, p A(0) = 1 The Jukes and Cantor’s One-Parameter Model An if we start with non A, p A(0) = 0 p A(t) = ¼ +  1 - ¼  e -4  t = ¼ + ¾ e -4  t p A(t) = ¼ +  0 - ¼  e -4  t = ¼ - ¼ e -4  t

The probability of initially having A, and still having A at time t is: The Jukes and Cantor’s One-Parameter Model The probability of initially having G, and then having A at time t is: p AA(t) = ¼ + ¾ e -4  t p GA(t) = ¼ - ¼ e -4  t We can write the equations in a more explicit form:

And since all nucleotides are equivalent under the JC model, p GA(t) = p CA(t) = p TA(t). The Jukes and Cantor’s One-Parameter Model p ii(t) = ¼ + ¾ e -4  t p ij(t) = ¼ - ¼ e -4  t where i  j

pA(t)pA(t) For instance, p A(t) can also be interpreted as the frequency of A in a DNA sequence. For example, if we start with a sequence made of A ‘s only, then p A(0) = 1, and p A(t) is the expected frequency of A in the sequence at time t.

Probability Time (million years) p ii p ij ¼ The Jukes and Cantor’s One-Parameter Model Temporal changes in the probability of having a certain nucleotide at a given nucleotide site (  = 5x10 -9 substitutions/site/year)

Other models of sequence evolution

The Kimura two-Parameter Model AG CT     Transitions Transversions

Base pair differences Time since divergence (Myr) Transitions Transversions The Kimura two-Parameter Model Number of transition and transversions between pairs of bovid mammal mitochondrial sequences (684 base pairs from the COII gene) against the estimated time of divergence

*  *    *   *     *   *     *   * P t = Substitution probability matrix f = [ ¼ ¼ ¼ ¼ ] Base composition of sequences The Kimura two-Parameter Model * p ii = 1 -  j  i p ij

*  C   G   T   A  *  G   T   A   C  *  T   A   C   G  * P t = Substitution probability matrix f = [  A  C  G  T ] Base composition of sequences The Felsenstein (1981) Model * p ii = 1 -  j  i p ij This model assumes that there is variation in base composition

*  C   G   T   A  *  G   T   A   C  *  T   A   C   G  * P t = Substitution probability matrix f = [  A  C  G  T ] Base composition of sequences The Hasegawa, Kishino and Yano (1985) Model * p ii = 1 -  j  i p ij This model assumes that there is variation in base composition and that transition and transversions occur at different rates.

*  C a  G b  T c  A a *  G d  T e  A b  C d *  T f  A c  C e  G f * P t = Substitution probability matrix f = [  A  C  G  T ] Base composition of sequences The General Reversible (REV) Model * p ii = 1 -  j  i p ij This model assumes that there is variation in base composition and that each substitution has its own probability.

Comparing the Models Jukes-Cantor Allow for  /  bias Allow for base frequency to vary Kimura 2 parameterFelsenstein (1981) Allow for  /  bias Allow for base frequency to vary Felsenstein (1981) Allow all six pairs of substitutions to have different rates General Reversible (REV) From Page and Holms (1998)

Among site rate variation

For protein coding sequences not all sites have the same probability of change (there is among site rate variation). If this effect is not taken into account, the number of substitutions per site between two sequences can be underestimated (Li and Graur, 1991).

Effect of among site rate variation in sequence divergence (A) Substitution rate of 0.5 % / M.a. and 80 % of the sites free to vary (B) Substitution rate of 2 % / M.a. and 50 % of the sites free to vary (Page and Holms, 1998)

Gamma distribution f(r) = [b a /  (a)] e –br r a-1 where:  (a) = ∫ 0 e –t t a-1 dt

The a shape parameter

Time reversibility

Time reversibility in the Jukes and Cantor’s One- Parameter Model A A A tt p AA(t) p AA(t) 2 AAA t = 0t = 1t = 2 p AA(t) p AA(t) 2

Time reversibility in the Jukes and Cantor’s One- Parameter Model A A A tt p AA(t)

Time reversibility in the Jukes and Cantor’s One- Parameter Model A A A tt p AA(t)

Time reversibility in the Jukes and Cantor’s One- Parameter Model A A A tt p AA(t) p AA(t) 2

Time reversibility in the Jukes and Cantor’s One- Parameter Model A substitution process is said to be time reversible if the probability of starting from nucleotide i and changing to nucleotide j in a time interval t is the same as the probability of starting from j and going backward to i in the same time duration. p ij(t) p = p ji(t) p

Sequence similarity between two sequences

Divergence Between DNA sequences Ancestral sequence Sequence 1 Sequence 2 tt

I(t)I(t) The expected value of the proportion of identical nucleotides between the two sequences under study is equal to the probability, I (t), that the nucleotide at a given site at time t is the same in both sequences.

Sequence Similarity A tt

A A tt p AA(t)

Sequence Similarity A A A tt p AA(t)

Sequence Similarity A A A tt p AA(t) p AA(t) 2

Sequence Similarity A C C tt p AC(t) p AC(t) 2 But for parallel substitutions.

Sequence Similarity A G G tt p AG(t) p AG(t) 2 But for parallel substitutions.

Sequence Similarity A T T tt p AT(t) p AT(t) 2 But for parallel substitutions.

Sequence Similarity in the JC Model Therefore, I (t) = p AA(t) 2 + p AT(t) 2 + p AC(t) 2 + p AG(t) 2 And from the JC model, I (t) = ¼ + ¾ e -8  t This equation also holds if the initial nucleotide was different from A, and represents the expected proportion of identical nucleotides between two sequences that diverged t time units ago

Proportion of identical nucleotides Time (million years) ¼ Sequence similarity in the Jukes and Cantor’s One-Parameter Model Temporal changes in the expected proportion of identical nucleotides between two sequences that diverged t years ago (  = 5x10 -9 substitutions/site/year)

Estimating the number of nucleotide substitutions between two sequences

Number of nucleotide substitutions between two sequences K= N/L Substitutions per nucleotide site. Total number of substitutions. Number of sites compared between two sequences.

A simple measure of genetic distance between two sequences is p p= n d / n Proportion of different sites. Total number of differences. Number of sites compared between two sequences.

Divergence Between DNA sequences Ancestral sequence Sequence 1 Sequence 2 ACTGAACGTAACGCACTGAACGTAACGC ACTGAACGTAACGCACTGAACGTAACGC tt Single substitution Multiple substitutions T C Coincidental substitutions Parallel substitutions Convergent substitutions Back substitutions T C A G G A A T

Divergence Between DNA sequences Ancestral sequence Sequence 1 Sequence 2 ACTGAACGAATCGCACTGAACGAATCGC ACTGAACGAATCGCACTGAACGAATCGC tt Single substitution Multiple substitutions T C Coincidental substitutions Parallel substitutions Convergent substitutions Back substitutions T C A A G A A T Although there has been 12 mutations, only 3 can be detected

Sequence dissimilarity D = (1 – I (t) ) Time Due to multiple substitutions, the observed number of differences between two sequence is less than the true number of substitutions 0 1 Proportion of observed differences Proportion of actual differences

Sequence dissimilarity D = (1 – I (t) ) Time Models of sequence evolution can be used to “correct” for multiple hits 0 1 Distance correction

Estimating the number of nucleotide substitutions under the Jukes and Cantor’s One-Parameter Model As we have seen, the expected proportion of identical nucleotides between two sequences that diverged t time units ago is given by: I (t) = ¼ + ¾ e -8  t

Estimating the number of nucleotide substitutions under the Jukes and Cantor’s One-Parameter Model And the probability that the two sequences are different at a site at time t is: I (t) = ¼ + ¾ e -8  t p = 1 - I (t)

Estimating the number of nucleotide substitutions under the Jukes and Cantor’s One-Parameter Model Doing some algebra: p = 1 - (¼ + ¾ e -8  t ) p = ¾ (1 - e -8  t ) 8  t = - ln (1 - 4p/3) p = 1 - I (t) And since in the JC model K = 2(3  t) between two sequences: K = - (¾) ln (1 - (4/3)p)

Estimating the number of nucleotide substitutions under the Kimura two-Parameter Model where: And P and Q are the proportions of transitional and transversional differences between the two sequences K = (½) ln(a) + (¼)ln(b) a = 1/ (1 - 2P - Q) b = 1/ (1 - 2Q)

Estimating the number of nucleotide substitutions using the Poisson Correction for protein sequences

M C A N T P L … P (k) = e -rt (rt) k / k! P (0) = e -rt P (1) = e -rt P (2) = e -rt (rt) 2 / 2! P (n) = e -rt (rt) n / n! P (substitutions)

Estimating the number of nucleotide substitutions using the Poisson Correction for protein sequences Sec A Sec 1 Sec 2 e –rt q = (e –rt ) 2  e –2rt = 1 - p The probability that none of the sequences has suffered a substitution is: K = 2rt Doing a little algebra: K = - ln (1 - p) e –K = 1 - p

Genetic distance using Poisson Correction

Trees

A phylogeny and the three basic kinds of tree used to depict that phylogeny After Page and Holmes (1998) ABC time Character change Phylogeny ABC Cladogram ABC Additive tree ABC 5 0 Ultrametric tree

Distance Methods for Phylogenetic Inference

[ ] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] Distance Matrix

In order for a distance measure to be used to build phylogenies it must satisfy some basic requeriments It must be metric It must be additive

Metric distances A distance is metric if: 1 d (a,b)  0 (non-negativity) a sequence b sequence d (a,b) 2 d (a,b) = d (b,a) (symetry) 3 d (a,c)  d (a,b) + d (b,c) (triangle inequality) 4 d (a,b) = 0 if and only if a = b (distinctiness)

Ultrametric distances 5 d (a,b)  maximum [d (a,c), d (b,c)] A distance is ultrametric if: a b c An ultrametric distance have the property of implying a constant evolutionary rate

Additive distances Four point condition: d (a,b) + d (c,d)  maximum [d (a,c) + d (b,d), d (a,d) + d (b,c)] a b c d

a b c d abcdabcd a b c d An ultrametric distance matrix between four sequences and the corresponding ultrametric tree

a b c d abcdabcd a b c d An aditive distance matrix between four sequences and the corresponding additive tree

Unweighted Pair-group Method using Arithmetic averages (UPGMA) OTUABC Bd AB Cd AC d BC Dd AD d BD d CD OTU

Unweighted Pair-group Method using Arithmetic averages (UPGMA) OTUABC Bd AB Cd AC d BC Dd AD d BD d CD OTU

Unweighted Pair-group Method using Arithmetic averages (UPGMA) A B d AB /2

OTU(AB)C Cd (AB)C Dd (AB)D d CD OTU Unweighted Pair-group Method using Arithmetic averages (UPGMA) d (AB)C = ( d AC + d BC )/2 d (AB)D = ( d AD + d BD )/2

OTU(AB)C Cd (AB)C Dd (AB)D d CD OTU Unweighted Pair-group Method using Arithmetic averages (UPGMA)

A B C d (AB)C /2

Unweighted Pair-group Method using Arithmetic averages (UPGMA) d (ABC)D /2 = [(d AD + d BD + d CD )/ 3]/ 2 A B C D

Unweighted Pair-group Method using Arithmetic averages (UPGMA) d XY =  d ij / (n X n Y ) Assumes a constant molecular clock Estimates tree topology and branch length

Minimum Evolution Method In this method, the sum (S) of all branch length estimates is computed for all or all plausible topologies and the topology that has the smallest S value is chosen as the best tree. S =  b i i T

Neighbor-Joining Method The principle of N-J method is to find neighbors sequentially that may minimize the total lenght of the tree X This method strarts with a starlike tree: Y X The first step is to separate a pair of OTUs from all others: And among all the posible pair of OTUs the one with the smallest sum of branch lenghts is chosen. This procedure is repeated until all interior branches are found