Evolutionary Models CS 498 SS Saurabh Sinha. Models of nucleotide substitution The DNA that we study in bioinformatics is the end(??)-product of evolution.

Slides:



Advertisements
Similar presentations
Estimation of Means and Proportions
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Lecture 6  Calculating P n – how do we raise a matrix to the n th power?  Ergodicity in Markov Chains.  When does a chain have equilibrium probabilities?
IERG5300 Tutorial 1 Discrete-time Markov Chain
Phylogenetic Trees Lecture 4
1 Markov Chains (covered in Sections 1.1, 1.6, 6.3, and 9.4)
11 - Markov Chains Jim Vallandingham.
The Rate of Concentration of the stationary distribution of a Markov Chain on the Homogenous Populations. Boris Mitavskiy and Jonathan Rowe School of Computer.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
MAT 4830 Mathematical Modeling 4.4 Matrix Models of Base Substitutions II
Markov Chains Lecture #5
Chapter 7 Introduction to Sampling Distributions
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
3.3 Toward Statistical Inference. What is statistical inference? Statistical inference is using a fact about a sample to estimate the truth about the.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 6-1 Introduction to Statistics Chapter 7 Sampling Distributions.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Link Analysis, PageRank and Search Engines on the Web
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Introduction to Quantum Information Processing Lecture 4 Michele Mosca.
Dirac Notation and Spectral decomposition Michele Mosca.
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Probabilistic methods for phylogenetic trees (Part 2)
A First Course in Stochastic Processes Chapter Two: Markov Chains.
Chapter 10: Estimating with Confidence
1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 5: Variation within and between species * Chapter 5: Are Neanderthals among.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
Copyright ©2011 Nelson Education Limited The Normal Probability Distribution CHAPTER 6.
Chap 6-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 6 Introduction to Sampling.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 6-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Zorica Stanimirović Faculty of Mathematics, University of Belgrade
From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
1 Chapter 18 Sampling Distribution Models. 2 Suppose we had a barrel of jelly beans … this barrel has 75% red jelly beans and 25% blue jelly beans.
An Introduction to Genetic Algorithms Lecture 2 November, 2010 Ivan Garibay
1 Evolutionary Change in Nucleotide Sequences Dan Graur.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
+ “Statisticians use a confidence interval to describe the amount of uncertainty associated with a sample estimate of a population parameter.”confidence.
Statistical Testing with Genes Saurabh Sinha CS 466.
Phylogeny Ch. 7 & 8.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
Lecture 5 Introduction to Sampling Distributions.
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
1 Probability Review E: set of equally likely outcomes A: an event E A Conditional Probability (Probability of A given B) Independent Events: Combined.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Modelling evolution Gil McVean Department of Statistics TC A G.
+ Unit 5: Estimating with Confidence Section 8.3 Estimating a Population Mean.
Modelling substitutions. 3. Substitution models The process of nucleotide changes over time depends on two groups of factors: - mutations generated by.
Evolutionary Change in Sequences
Ch 4. Language Acquisition: Memoryless Learning 4.1 ~ 4.3 The Computational Nature of Language Learning and Evolution Partha Niyogi 2004 Summarized by.
Models for DNA substitution
Chapter 7 Sampling and Sampling Distributions
Markov Chains Mixing Times Lecture 5
Maximum likelihood (ML) method
Statistical Testing with Genes
Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of
Statistical Testing with Genes
Presentation transcript:

Evolutionary Models CS 498 SS Saurabh Sinha

Models of nucleotide substitution The DNA that we study in bioinformatics is the end(??)-product of evolution Evolution is a very complicated process Very simplified models of this process can be studied within a probabilistic framework Allows testing of various hypotheses about the evolutionary process, from multi-species data Source: Ewens and Grant, Chapter 14.

Diversity in a population There IS genetic variation between individuals in a population But relatively little variation at nucl. level E.g., two humans differ at the nucl. level at one in 500 to 1000 nucls. Roughly speaking, a single nucleotide dominates the population at a particular position in the genome

Substitution Over long time periods, the nucleotide at a given position remains the same But periodically, this nucleotide changes (over the entire population) This is called “substitution”, i.e., replacement of the predominant nucl. for that position with another predominant nucl.

Markov Chain to model substitution Markov chain to describe the substitution process at a position States are “a”, “c”, “g”, “t” The chain “runs” in certain units of time, i.e., the state may change from one time point to the next time point The unit of time (difference between successive time points) may be arbitrary, e.g., generations.

Markov Chain to model substitution A symbol such as “p ag ” is the probability of a change from “a” to “g” in one unit of time When studying two extant species, the evolutionary model has to provide the joint probability of the two species’ data Sometimes, this is done by computing probability of the ancestor, starting from one extant species, and then the probability of the other extant species, starting from the ancestor If we want to do this, the evolutionary process (model) must be “time reversible”: P(x)P(x->y) = P(y)P(y->x)

Jukes Cantor Model Markov chain with four states: a,c,g,t Transition matrix P given by: agct a 1-3  g   c   t 

Jukes Cantor Model  is a parameter depending on what a “time unit” means. If time unit represents more #generations,  will be larger  must be less than 1/3 though

Jukes Cantor Model Whatever the current nucl is, each of the other three nucls are equally likely to substitute for it

Understanding the J-C Model Consider a transition matrix P, and a probability vector v (a row vector) What does w = v P represent ? If v is the probability distribution of the 4 nucls (at a position) now, w is the prob. distr. at the next time step.

Understanding the J-C model Suppose we can find a vector  such that  P =  If the probability distribution is , it will continue to remain  at future times This is called the stationary distribution of the Markov Chain

Understanding the J-C model Check that  = (0.25, 0.25, 0.25, 0.25) satisfies  P =  Therefore, if a position evolves as per this model, for long enough, it will be equally likely to have any of the 4 nucls! This is the very long term prediction, but can we write down what the position will be as a function of time (steps) ?

Spectral Decomposition Recall that we found a  such that  P =  Such a vector is called an “eigenvector” of P, and the corresponding “eigenvalue” is 1. In general, if v P = v (for scalar ), is called an eigenvalue, and v is a left eigenvector of P

Spectral decomposition Similarly, if P u T = u T, then u is called a right eigenvector In general, there may be multiple eigenvalues j and their corresponding left and right eigenvectors v j and u j We can write P as

Spectral decomposition Then, for any positive integer, it is true that Why is P n interesting to us ? Because it tells us what the probability distribution will be after n time steps If we started with v, then P n v will be the prob. distr. after n steps

Back to the J-C model We reasoned that  = (.25,.25,.25,.25) is a left eigenvector for the eigenvalue 1. Actually, the J-C transition matrix has this eigenvalue and the eigenvalue (1-4  ), and if we do the math we get the spectral decomposition of P as:

Back to the J-C model So, if we started with (1,0,0,0), i.e., an “a”, the probability that we’ll see an “a” at that position after n time steps is: (1-4  ) n And the probability that the “a” would have mutated to say “c” is: (1-4  ) n

Substitution probability As a function of time n, we therefore get Pr(x -> y) = (1-4  ) n if x = y and = (1-4  ) n otherwise If n -> , we get back our (0.25, 0.25, 0.25, 0.25) calculation

More advanced models The J-C model made highly “symmetric” assumptions, in its formulation of the transition matrix P In reality, for example, “transitions” are more common than “transversions” –What are these? Purine = A or G. Pyrimidine = C or T. Transition is substitution in the same category; transversion is substitution across categories –Purines are similarly sized, and pyrimidines are similarly sized. More likely to be replaced by similar sized nucl. The “Kimura” model captures this transition/transversion bias

Kimura model agct a 1-  -2  g   c   t  This of course is the transition probability matrix P of the Markov chain Two parameters now, instead of one.

Kimura model Again, one of the eigenvalues is 1, and the left eigenvector corresponding to it is  = (.25,.25,.25,.25) So again, the stationary distribution is uniform P(x -> x) = (1-4  ) n +.5(1-2(  +  )) n P(x -> y) = (1-4  ) n +.5(1-2(  +  )) n if x is a purine and y is the other purine

Even more advanced models Get to greater levels of realism Kimura model still has a uniform stationary distribution, which is not true of real data One extension: purine to pyrimidine subst. prob. is different from pyrimidine to purine subst. prob. –This leads to a non-uniform stationary probability

Felsenstein models agct a 1-u+u  a ugug ucuc utut g uaua 1-u+u  g ucuc utut c uaua ugug 1-u+u  c utut t uaua ugug ucuc 1-u+u  t Transition probability proportional to the stationary probability of the target nucleotide. Stationary distribution is (  a,  g,  c,  t )

Reversible models Many inference procedures require that the evolutionary model be time reversible What does this mean?

Reversible Markov Chain Source: Wikipedia Looks like time has been reversed. That is, if we can find a  such that The models we have seen today all have this property.