Lecture 10 – Models of DNA Sequence Evolution

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Markov chains Assume a gene that has three alleles A, B, and C. These can mutate into each other. Transition probabilities Transition matrix Probability.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Cladogram Building - 1 ß How complex is this problem anyway ? ß NP-complete:  Time needed to find solution in- creases exponentially with size of problem.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Sampling distributions of alleles under models of neutral evolution.
Phylogenetic Trees Lecture 4
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
What is the probability that of 10 newborn babies at least 7 are boys? p(girl) = p(boy) = 0.5 Lecture 10 Important statistical distributions Bernoulli.
Lecture 6, Thursday April 17, 2003
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods: Models of Evolution Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Probabilistic methods for phylogenetic trees (Part 2)
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
1 Additive Distances Between DNA Sequences MPI, June 2012.
1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 5: Variation within and between species * Chapter 5: Are Neanderthals among.
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
1 Evolutionary Change in Nucleotide Sequences Dan Graur.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Evolutionary Models CS 498 SS Saurabh Sinha. Models of nucleotide substitution The DNA that we study in bioinformatics is the end(??)-product of evolution.
1 Probability Review E: set of equally likely outcomes A: an event E A Conditional Probability (Probability of A given B) Independent Events: Combined.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Fault Tree Analysis Part 11 – Markov Model. State Space Method Example: parallel structure of two components Possible System States: 0 (both components.
Modelling evolution Gil McVean Department of Statistics TC A G.
Evolutionary Change in Sequences
Reliability Engineering
Probability and Probability Distributions. Probability Concepts Probability: –We now assume the population parameters are known and calculate the chances.
Hidden Markov Models BMI/CS 576
Distance-based phylogeny estimation
Phylogenetics LLO9 Maximum Likelihood and Its Applications
Discrete-time Markov chain (DTMC) State space distribution
Models for DNA substitution
Inferring phylogenetic trees: Distance methods
HMM (Hidden Markov Models)
Lecture 10 – Models of DNA Sequence Evolution
Hidden Markov Models.
Distance based phylogenetics
Lecture 6B – Optimality Criteria: ML & ME
Inferring a phylogeny is an estimation procedure.
Linkage and Linkage Disequilibrium
V5 Stochastic Processes
Maximum likelihood (ML) method
Distances.
Models of Sequence Evolution
Goals of Phylogenetic Analysis
Inferring phylogenetic trees: Distance and maximum likelihood methods
Why Models of Sequence Evolution Matter
Lecture 6B – Optimality Criteria: ML & ME
Pedir alineamiento múltiple
The Most General Markov Substitution Model on an Unrooted Tree
Lecture 11 – Increasing Model Complexity
But what if there is a large amount of homoplasy in the data?
Discrete-time markov chain (continuation)
Markov Chains & Population Movements
CS723 - Probability and Stochastic Processes
Presentation transcript:

Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities for likelihood-based methods. Prob(Rr | t ) = pm x Pm,k(v3,1) x Pk,A(v1,w) x Pk,G(v1,x) x Pm,l(v3,2) x Pl,C(v2,y) x Pl,C(v2,z) It’s the Pi,j’s that we need a substitution model to calculate. The models typically used are Markov processes. Poisson process is a stochastic process that can be used to model events in time. The time between events is exponentially distributed, with rate l.

Jukes-Cantor Model The probability of a site remaining constant is: pii(t) = ¼ + ¾ e-4at The probability of a site changing is : pij(t) = ¼ - ¼ e-4at a is the rate at which any nucleotide changes to any other per unit time. Given that the state at the site is i at t0, we start by estimating the probability of state i at that site at t1. pi(0) = 1 pi(1) = 1-3a

Jukes-Cantor Model Now, what’s the probability of this site having state i at t2 ? There are two ways for the site to have state i at t2: 1 – It still hasn’t changed since time t0. (1 – 3a) pi(1) = probability of no change at the site during time t2, (1-3a), times the probability of the site having state i at time t1, (pi(1)). 2 – It has changed to something else and back again. and a(1-pi(1)) = probability of a change to i, (a), times the probability that the site is not state i at time t1, (1 - pi(1)). Therefore, pi(2) = (1 – 3a) pi(1) + a (1 – pi(1)), where

Jukes-Cantor Model We have a recurrence equation. pi(t+1) = (1 - 3a) pi(t) + a (1 – pi(t)) = pi(t) - 3api(t) + a – api(t) We can calculate the change in pi(t) across time, Dt. pi(t+1) – pi(t) = -3api(t) + a – api(t) so and

given in terms of its initial state. Jukes-Cantor Model pi(t) = 1/4 + (pi(0) – 1/4) e -4at We have a probability that a site has a particular nucleotide after time t, given in terms of its initial state. If i = j, pi(0) = 1. Therefore, pii(t) = 1/4 + 3/4 e -4at If i not = j, pi(0) = 0, and pij(t) = 1/4 - 1/4 e -4at is an instantaneous rate, so we’ve modeled branch length (rate x time) explicitly in our expectations.

The JC model makes several assumptions. 1) All substitutions are equally likely; we have a single substitution type. 2) Base frequencies are assumed to be equal; each of the four nucleotides occurs at 25% of sites. 3) Each site has the same probability of experiencing a substitution as any other; we have an equal-rates model. 4) The process is constant through time. -3a a a a a -3a a a Q = a a -3a a   a a a -3a Q - matrix 5) Sites are independent of each other. 6) Substitution is a Markov process.

Substitution types and base frequencies. For the general case: -m(apC + bpG + cpT) mapC mbpG mcpT   mgpA -m(gpA + dpG - epT) mdpG mepT Q = mhpA mjpC -m(hpA + jpC + fpT) mfpT mipA mkpC mlpG -m(ipA + kpC + lpG) where, m = the average instantaneous substitution rate, a, b, c, …, l are relative rate parameters (one of them is set to 1). and pi’s are the frequencies of the base that is being substituted to. Note that this is not symmetric, and therefore, the full model is non-reversible. a = g, b = h, c = i, d = j, e = k, & f = l.

Substitution types and base frequencies. General Time-Reversible Model -m(apC + bpG + cpT) mapC mbpG mcpT   mapA -m(apA + dpG + epT) mdpG mepT Q = mbpA mdpC -m(bpA + dpC + fpT) mfpT mcpA mepC mfpG -m(cpA + epC + fpG) There are six relative transformation rates (one of which is set to 1). There are four base frequencies that must sum to 1. Note that this is not a symmetric matrix, but it can be decomposed into R and P.

Substitution types and base frequencies. -m(a+b+c) ma mb mc   ma -m(a+d+e) md me R = mb md -m(b+d+f) mf mc me mf -m(c+e+f) Visual GTR pA 0 0 0   0 pC 0 0 P = 0 0 pG 0 0 0 0 pT

Common Simplifications Transition type substitutions occur at a higher rate than transversion substitutions. K2P Model was the first to address this. So we set b = e = k (for transitions), and a = c = d = f = 1 (for transversions) . All pi = ¼ -(m)(k + 2)/4 m/4 mk/4 m/4   m/4 -(m)(k + 2)/4 m/4 mk/4 for K2P: Q = mk/4 m/4 -(m)(k + 2)/4 m/4 m/4 mk/4 m/4 -(m)(k + 2)/4 where a = mk/4 and b = m/4. Thus, k = a / b and

Hasegawa-Kishino-Yano (HKY) Model -m(kpG + pY) mpC mkpG mpT   mpA -m(kpT + pR) mpG mkp for HKY: Q = mkpA mpC -m(kpA + pY) mpT mpA mkpC mpG -m(kpC + pR) where a = mk, b = m, pR = pA + pG, and pY = pC + pT. There are lots of other models that restrict the Q-matrix.

Some common models There are 203 special cases of the GTR, 406 if we allow for equal base frequencies.

Calculating Transformation Probabilities. So the Q & R matrices we’ve been discussing define the instantaneous rates of substitutions from one nucleotide to another. Convert the rates to probabilities by matrix exponentiation:   P(t) = e Qt Jukes-Cantor K2P Again, it’s these Pij that are used in the likelihood function.