Models of Sequence Evolution

Slides:



Advertisements
Similar presentations
1 Number of substitutions between two protein- coding genes Dan Graur.
Advertisements

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Positive selection A new allele (mutant) confers some increase in the fitness of the organism Selection acts to favour this allele Also called adaptive.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Ln(7.9* ) –ln(6.2* ) is  2 – distributed with (n-2) degrees of freedom Output from Likelihood Method. Likelihood: 6.2*  = 0.34.
Probabilistic methods for phylogenetic trees (Part 2)
1 Inference About a Population Variance Sometimes we are interested in making inference about the variability of processes. Examples: –Investors use variance.
Adaptive Molecular Evolution Nonsynonymous vs Synonymous.
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Terminology of phylogenetic trees
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Lecture 16 – Molecular Clocks Up until recently, studies such as this one relied on sequence evolution to behave in a clock-like fashion, with a uniform.
Estimating evolutionary parameters for Neisseria meningitidis Based on the Czech MLST dataset.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Selectionist view: allele substitution and polymorphism
Phylogeny Ch. 7 & 8.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
LBA ProtPars. LBA Prot Dist no Gamma and no alignment.
DIVERSIFYING SELECTION AND FUNCTIONAL CONSTRAINT
Phylogenetics LLO9 Maximum Likelihood and Its Applications
Models for DNA substitution
Lecture 10 – Models of DNA Sequence Evolution
Lecture 16 – Molecular Clocks
Signatures of Selection
Inferring a phylogeny is an estimation procedure.
Neutrality Test First suggested by Kimura (1968) and King and Jukes (1969) Shift to using neutrality as a null hypothesis in positive selection and selection.
Linkage and Linkage Disequilibrium
Maximum likelihood (ML) method
The Neutral Theory M. Kimura, 1968
CJT 765: Structural Equation Modeling
The +I+G Models …an aside.
Distances.
Goals of Phylogenetic Analysis
What are the Patterns Of Nucleotide Substitution Within Coding and
Summary and Recommendations
Statistical Analysis Error Bars
Why Models of Sequence Evolution Matter
Pete Lockhart Massey University Allan Wilson Centre, New Zealand
Pedir alineamiento múltiple
The Most General Markov Substitution Model on an Unrooted Tree
Lecture 10 – Models of DNA Sequence Evolution
Molecular data assisted morphological analyses
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Random Variables A random variable is a rule that assigns exactly one value to each point in a sample space for an experiment. A random variable can be.
Chapter 9 Estimation: Additional Topics
Analysis of the nonsynonymous to synonymous substitutions dN/dS (ω) for the 5,493 1:1 ortholog transcripts. Analysis of the nonsynonymous to synonymous.
Presentation transcript:

Models of Sequence Evolution

JC69: Jukes-Cantor (1969) Model Assumes all bases interchange with equal probabilities, equal base frequencies

F81: Felsenstein (1981) Model Assumes all bases interchange with equal probabilities, base frequencies may vary.

HKY85: Hasegawa, Kishino, Yano (1985) Model Assumes unequal transition and transversion probabilities, base frequencies may vary.

REV: General Time Reversible Model Assumes unequal probabilities for all substitution types, base frequencies may vary.

Model # par. ln Lik JC69 1 -1227.45 F81 4 -1187.54 K2P 2 -1210.49 Models applied to a dataset of 13 HIV pol sequences (273nt) Model # par. ln Lik JC69 1 -1227.45 F81 4 -1187.54 K2P 2 -1210.49 HKY85 5 -1165.70 REV 9 -1151.63

Other Forms of Rate Heterogeneity Variation from gene to gene Variation from site to site within a gene Synonymous vs synonymous rates Spatial rate heterogeneity Variation from lineage to lineage Correlations among sets of sites

Site-to-site rate heterogeneity

Gamma models of site-to-site heterogeneity Hierarchical model: Evolutionary rates at individual sites are drawn from a gamma distribution Given the rate at a particular site, sequence evolution follows one of the previously discussed models Original idea: Uzzel and Corbin (1971) First likelihood treatment was by Yang (1993)

REV+G: General Time Reversible Model with Gamma Rate Heterogeneity (Yang 1993)

“All” we are doing here is integrating the likelihood function over all possible rate values, with those values being weighted according to probabilities assigned by the gamma distribution.

Calculating the integral in the continuous case is very expensive. Yang (1994) suggested “discretizing” the gamma distribution, and using the discrete form of the likelihood function. The cost of calculation increases only linearly with the number of rate categories, N.

13 HIV pol sequences

4 alpha-spectrin sequences

Correlations among sites Codons Dinucleotides (secondary structure) General idea: move from 4-state nucleotide models to 16- or 64-state models.

MG94: Muse and Gaut (1994)

GY94:Goldman and Yang(1994)

This approach can be combined with any nucleotide model Rate heterogeneity can be added in the same way as with nucleotide models Account for correlations among nucleotide sites within codons Avoids the problematic notion of “degeneracy classes” Necessary for rigorous estimates of synonymous and nonsynonymous substitution rates.

Muse and Gaut (1994) model

Consider the result of allowing gamma variation in rates over codons: While they are allowed to have different magnitudes, the two classes of rates have the same distribution. Synonymous rates are likely to be less variable over sites than are nonsynonymous rates.

Site-to-site rate variation, transitions and transversions with independent distributions C G T A C G T

Muse and Gaut (1994) modification

Each site in the sequence has a (random) synonymous rate, and a (random) nonsynonymous rate, drawn from some bivariate distribution, The likelihood is again integrated with respect to f:

As before, the likelihood function is discretized for computational feasibility: The discretization process can be tricky in general. In all that follows, we assume that are independent gamma random variables, which allows discretization of each axis separately.

Goals of molecular evolutionary analyses Understand the structure of Are parameter values (i.e., rates) equal among different branches? Is the structure of (e.g., TS/TV ratio) the same for different branches? Are the values of such parameters “related” among different genes?

Likelihood function for homologous DNA sequences B G x is the collection of all parameters affecting the evolution of sequences A, B, and G. is the collection of all data (sequences A, B, and G).

Are evolutionary rates the same along two lineages? Relative Rate Tests Are evolutionary rates the same along two lineages? B A G

Versions of the relative rate test Distance based (Wu and Li 1985) For 2 clades (Li and Bousquet 1992) Likelihood ratio (Muse and Weir 1992) Nonparametric (Tajima 1993)

Likelihood-based RR Test G Maximize L assuming Maximize L without constraints LRT has a chi-squared distribution if rates are equal Note: Use of outgroup insures that the unknown divergence time is irrelevant.

Distance-based RR Tests G a b g

Nonparametric RR Test A B G

ndhF rbcL Nonsyn Syn

Locus A Locus B Locus effect Lineage effect Lineage X Locus effect

Relative Ratio Tests Are “branch lengths” proportional among loci? Muse and Gaut 1997; Muse et al. 1997; Huelsenbeck et al. 1997; Yang 1995

Relative Ratio Test The null hypothesis is that the relative proportions of branch lengths are the same for all loci. The proportion need not be known a priori. 1 2 3 4 1 2 3 4