Modelling evolution Gil McVean Department of Statistics TC A G.

Slides:



Advertisements
Similar presentations
Evolution of genomes.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
R ATES OF P OINT M UTATION. The rate of mutation = the number of new sequence variants arising in a predefined target region per unit time. Target region.
Sampling distributions of alleles under models of neutral evolution.
Phylogenetic Trees Lecture 4
Coalescence with Mutations Towards incorporating greater realism Last time we discussed 2 idealized models – Infinite Alleles, Infinite Sites A realistic.
Phylogenetic reconstruction
DNA sequences alignment measurement
MAT 4830 Mathematical Modeling 4.4 Matrix Models of Base Substitutions II
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
From population genetics to variation among species: Computing the rate of fixations.
The origins & evolution of genome complexity Seth Donoughe Lynch & Conery (2003)
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
Genetica per Scienze Naturali a.a prof S. Presciuttini Mutation Rates Ultimately, the source of genetic variation observed among individuals in.
Probabilistic methods for phylogenetic trees (Part 2)
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Chapter 3 Substitution Patterns Presented by: Adrian Padilla.
- any detectable change in DNA sequence eg. errors in DNA replication/repair - inherited ones of interest in evolutionary studies Deleterious - will be.
Molecular Clock. Rate of evolution of DNA is constant over time and across lineages Resolve history of species –Timing of events –Relationship of species.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Lecture 3: population genetics I: mutation and recombination
MAT 4830 Mathematical Modeling 4.1 Background on DNA
1 Evolutionary Change in Nucleotide Sequences Dan Graur.
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Models of Molecular Evolution III Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections 7.5 – 7.8.
Estimating evolutionary parameters for Neisseria meningitidis Based on the Czech MLST dataset.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Selectionist view: allele substitution and polymorphism
Phylogeny Ch. 7 & 8.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
NEW TOPIC: MOLECULAR EVOLUTION.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
In populations of finite size, sampling of gametes from the gene pool can cause evolution. Incorporating Genetic Drift.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Evolutionary Change in Sequences
Monkey Business Bioinformatics Research Center University of Aarhus Thomas Mailund Joint work with Asger Hobolth, Ole F. Christiansen and Mikkel H. Schierup.
Linkage and Linkage Disequilibrium
Gil McVean Department of Statistics, Oxford
Distances.
Summary and Recommendations
The coalescent with recombination (Chapter 5, Part 1)
Summary and Recommendations
Presentation transcript:

Modelling evolution Gil McVean Department of Statistics TC A G

Outline The dynamics of microevolution Natural selection Modelling changes in DNA Estimation of substitution parameters

How do genes change Occasionally, the DNA sequence changes through mutation (about 1.5 x per base pair per generation in humans) Most of these mutations will be lost from the population by chance or driven out by selection. But some will also increase in frequency and ultimately become fixed Time Advantageous Deleterious Neutral Frequency 0 1

The rate of fixation Consider a neutral locus – every generation the expected number of new mutations appearing in the population is equal to the number of chromosomes (2N) times the mutation rate (u) It is also true that ultimately, every chromosome in the population will ultimately be descended from one in the current population Because the locus is neutral, every chromosome has equal chance of being the ultimate ancestor, so the fixation probability is 1/2N The rate of substitution is equal to the rate of appearance of new mutations times their probability of fixation. For the neutral locus this is u – the neutral mutation rate It will take this mutation an average of 4N generations for mutations to fix, but the variance in fixation time is large

Adding selection Natural selection comes in many forms, the simplest model is to consider two alleles at a single locus We can make use of diffusion theory to model the dynamics of microevolution with selection Most results are due to Wright and Kimura. Kimura derived the fixation probability formula (when h is 0.5) GenotypeaaAaAA Fitness11+hs1+s

Consequences of Kimura’s formula The key parameter for more weakly selected mutations is the product of the population size and selection coefficient (Ns) For strongly favourable mutations (Ns>>1), the fixation probability is approximately twice the selective advantage There is a window (-1<Ns<1) where mutations behave in a manner that is nearly (but not quite) neutral. Relative fixation probability 4Ns In smaller populations, the substitution rate of slightly deleterious mutations can increase. Perhaps this explains why human proteins evolve faster than those in rodents?

Various complications Simple models of molecular evolution are helpful, but ignore many important biological features –Changing population sizes –Changing selection pressures –Geographical structure –Interactions between mutations (epistasis) –Interference between linked mutations

The time-scale of micro- and macro-evolution The process of substitution is called micro-evolution Ultimately, only fixed mutations leave traces in our genes. The accumulation of fixed differences between species leads to macro- evolution. Viewed over scales of millions of years, the process of substitution is effectively instantaneous

Modelling long-term evolution of DNA sequences If we approximate the selection process as instantaneous, modelling DNA sequence evolution gets much easier In particular, we can treat the evolution of a DNA sequence through time as a Markov process Furthermore, if we assume the nucleotide (or codon) positions evolve independently we can perform very efficient simulation and inference on the time of divergence

Modelling evolution of a single nucleotide At a single nucleotide we only have four states (T, C, A, G). We will ignore indels. Define the 12 rates of change from each nucleotide to every other one We may wish to deal with changes over a fixed length of time – say one million years –The key is that we need a ‘rate’ for each possible change TC A G

Using the model to simulate evolution Draw the ancestor sequence from it’s stationary distribution Calculate the total rate at which events occur to this sequence – the sum over nucleotides, i, of the rate at which they are substituted by other nucleotides, j. If we can treat the substitution process as a Poisson process, the time to the first substitution is exponentially distributed with rate  The probability that a given nucleotide is chosen to be substituted by a given other nucleotide is proportional to the rate of that substitution –This allows you to choose the substitution event Continue for as long as you want to simulate evolution for

An alternative approach We have outlined a very general approach to simulating sequence evolution For the very simple model we have discussed, we can actually make things even more efficient – which helps for inference – by taking a continuous time limit Rewrite the transition probabilities as an instantaneous rate matrix The –  for the diagonal is such that the sum for each row is zero

Simulating and performing inference The conditional distribution of the descendant nucleotide after a period of time t is given by So the probability of observing nucleotide N 1 in species 1 and N 2 in species 2 is given by the appropriate element in F is a diagonal matrix of the nucleotide frequencies in the ancestor of the two species If F and Q are known, it is also easy to estimate t if we assume independence between sites

Making it simpler Usually, various assumptions will be made to make inference easier First, it is assumed that the matrix Q is reversible –This means that watching the process forwards in time is equivalent to watching it back in time –Consequently, summing over ancestral states is equivalent to treating one of the two sequences as ancestral Second, it is assumed that the processes is at stationarity. This means that F is given by any column of exp(Q x t) as t → ∞ Thirdly, further constraints are imposed on the Q matrix. –Jukes-Cantor model: all lambda’s are equal –Kimura 2-parameter model: allows different rates for transitions and transversions

An example Suppose we observe the following ‘aligned’ sequence We will use the Kimura 2-parameter model to estimate the transition- transversion ratio and the divergence time – we will assume that the rate of transversion substitution is 1.5 x per site per year Under the model, all sites are independent, so we can tally up the changes TGGCTGTGGACTAGTCAGCTGAGGGATATGCTAG CGATAATGCACCGGTCAGCTGAGAAATATGCAGG S1/ S2 TCAG T5220 C1400 A0042 G0148

More on inference In calculating the likelihood we want to sum over the possible ancestral states In effect – we have a (very simple) hidden Markov model, where the state is the ancestor and the ‘emission’ is the two daughter sequences Because positions are independent, the likelihood is found by multiplying marginal likelihoods across sites –In effect it is a multinomial sample We can use maximum likelihood to provide point estimates AiAi S i1, S i2

Some estimates For the K2P model, you actually only need to count up the number of transition and transversion differences and the total sequence length –These are sufficient statistics for the transition-transversion parameter and the divergence There are comparable analytical expressions for estimating divergence times and model parameters for the simpler divergence models Analytically tractability is lost as models get more complex (realistic)

Parameterisation using micro-evolutionary models Most approaches to estimating divergence, etc. conflate the mutation process with the substitution process However, it is perfectly possible to separate out the two processes –For example, estimates of the mutation process can be obtained from selectively neutral genomic regions or synonymous substitutions This allows you to estimate the selective constraint or advantage to mutations An area where this is applicable is in the analysis of codon usage bias, where particular codons are favoured over others for translational efficiency –McVean and Vieira (2001)

Making it more complex It is easy to see how the model can be extended to more states. For example, a widely used approach to analysing coding sequences is to deal with the 64 x 64 matrix of states that are the codons (Goldman and Yang 1994) The key point is always that parts of the sequence (nucleotides, codons) are independent This kind of approach cannot (at least in a straightforward way) deal with context-dependent substitution rates or insertions and deletions –For example, there is a greatly elevated rate of mutation at CpG sites in vertebrates

Building up to trees Analysing more sequences means thinking about the evolutionary relationships between all of them This can (often, but not always) be represented as a tree As before, we can utilise HMM structures to make inference efficient A 1,2 S 1, S 2 A 1-3 S3S3 A 4,5 S 4, S 5 A 1-5 Here, the HMM algorithm used to calculate the likelihood is called peeling