Dayhoff’s Markov Model of Evolution. Brands of Soup Revisited Brand A Brand B P(B|A) = 2/7 P(A|B) = 2/7.

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Bayesian Evolutionary Distance P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1— 17, 1996.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Last lecture summary.
Matrix Multiplication To Multiply matrix A by matrix B: Multiply corresponding entries and then add the resulting products (1)(-1)+ (2)(3) Multiply each.
Introduction to Bioinformatics
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Substitution matrices
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Hidden Markov Models for Sequence Analysis 4
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
Tutorial 4 Substitution matrices and PSI-BLAST 1.
Pairwise Sequence Analysis-III
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Meeting 18 Matrix Operations. Matrix If A is an m x n matrix - that is, a matrix with m rows and n columns – then the scalar entry in the i th row and.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
A very brief introduction to Matrix (Section 2.7) Definitions Some properties Basic matrix operations Zero-One (Boolean) matrices.
2.1 Matrix Operations 2. Matrix Algebra. j -th column i -th row Diagonal entries Diagonal matrix : a square matrix whose nondiagonal entries are zero.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Matrix Multiplication
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
2. Matrix Algebra 2.1 Matrix Operations.
Alignment IV BLOSUM Matrices
Presentation transcript:

Dayhoff’s Markov Model of Evolution

Brands of Soup Revisited Brand A Brand B P(B|A) = 2/7 P(A|B) = 2/7

Brands of Soup Revisited Brand A Brand B P(B|A) = p = 2/7 P(A|B) = p = 2/7 P(Ak)= P(Ak-1) (1-p)+P(Bk-1 ) p = 5/7 P(Ak-1) + 2/7 P(Bk-1) P(Bk)= P(Ak-1 ) p + P(Bk-1) (1-p) = 2/7 P(Ak-1) + 5/7 P(Bk-1) Transition Diagram Conditional Probability Formulas

Brands of Soup Revisited Brand A Brand B P(B|A) = p = 2/7 P(A|B) = p = 2/7 P(Ak)= P(Ak-1) (1-p)+P(Bk-1 ) p = 5/7 P(Ak-1) + 2/7 P(Bk-1) P(Bk)= P(Ak-1 ) p + P(Bk-1) (1-p) = 2/7 P(Ak-1) + 5/7 P(Bk-1) Transition Diagram Conditional Probability Formulas Matrix Representation

Brands of Soup Revisited Brand A Brand B P(B|A) = p = 2/7 P(A|B) = p = 2/7 P(Ak)= P(Ak-1) (1-p)+P(Bk-1 ) p = 5/7 P(Ak-1) + 2/7 P(Bk-1) P(Bk)= P(Ak-1 ) p + P(Bk-1) (1-p) = 2/7 P(Ak-1) + 5/7 P(Bk-1) Transition Diagram Conditional Probability Formulas Matrix Representation

Brands of Soup Revisited Brand A Brand B P(B|A) = p = 2/7 P(A|B) = p = 2/7 P(Ak)= P(Ak-1) (1-p)+P(Bk-1 ) p = 5/7 P(Ak-1) + 2/7 P(Bk-1) P(Bk)= P(Ak-1 ) p + P(Bk-1) (1-p) = 2/7 P(Ak-1) + 5/7 P(Bk-1) Transition Diagram Conditional Probability Formulas Matrix Representation

Brands of Soup Revisited Brand A Brand B P(B|A) = p = 2/7 P(A|B) = p = 2/7 P(Ak)= P(Ak-1) (1-p)+P(Bk-1 ) p = 5/7 P(Ak-1) + 2/7 P(Bk-1) P(Bk)= P(Ak-1 ) p + P(Bk-1) (1-p) = 2/7 P(Ak-1) + 5/7 P(Bk-1) Transition Diagram Conditional Probability Formulas Matrix Representation

Markov Processes Can Be Represented by Matrices e.g., a 3-state process: 1/2 1/3 1/4 Can be represented with this matrix:

Each Step Involves an Inner Product

Markov Matrix Properties Sum of probabilities in a row must be 1 No change = diagonal matrix If well-behaved*, multiplying the matrix by itself many times converges to a limit –This limit matrix has identical column elements –The rows of the limit matrix are the “equilibrium probabilities” for the process *(1) Every state can transition to every other state at least indirectly, and (2) the least common denominator of any cycle in the transition diagram is 1

Ask Mathematica! Recall m =

Margaret Dayhoff Had a large (for 1978) database of related proteins DAYHOFF, M. O., R. M. SCHWARTZ, and B. C. ORCUTT A model of evolutionary change in proteinsA model of evolutionary change in proteins. (pp in M. 0. DAYHOFF, ed. Atlas of protein sequence and structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D.C.) Asked “what is the probability that two aligned sequences are related by evolution?”

Dayhoff Model Amino acids change over time independently of their position in a protein. (simplifying assumption) The probability of a substitution depends only on the amino acids involved and not on the prior history (Markov model).

A Sequence Alignment >gi| |sp|P44374|RS5_HAEIN 30S ribosomal protein S5 Length = 166 Score = 263 bits (672), Expect = 1e-70 Identities = 154/166 (92%), Positives = 159/166 (95%) Query: 1 MAHIEKQAGELQEKLIAVNRVSKTVKGGRIFSFTALTVVGDGNGRVGFGYGKAREVPAAI 60 M++IEKQ GELQEKLIAVNRVSKTVKGGRI SFTALTVVGDGNGRVGFGYGKAREVPAAI Sbjct: 1 MSNIEKQVGELQEKLIAVNRVSKTVKGGRIMSFTALTVVGDGNGRVGFGYGKAREVPAAI 60 Query: 61 QKAMEKARRNMINVALNNGTLQHPVKGVHTGSRVFMQPASEGTGIIAGGAMRAVLEVAGV 120 QKAMEKARRNMINVALN GTLQHPVKGVHTGSRVFMQPASEGTGIIAGGAMRAVLEVAGV Sbjct: 61 QKAMEKARRNMINVALNEGTLQHPVKGVHTGSRVFMQPASEGTGIIAGGAMRAVLEVAGV 120 Query: 121 HNVLAKAYGSTNPINVVRATIDGLENMNSPEMVAAKRGKSVEEILG 166 NVL+KAYGSTNPINVVRATID L NM SPEMVAAKRGK+V+EILG Sbjct: 121 RNVLSKAYGSTNPINVVRATIDALANMKSPEMVAAKRGKTVDEILG 166 (Example alignment from a BLAST search)

Observed Substitution Frequencies A R30 N10917 D C Q E G H I L K M F P S T W Y V ARNDCQEGHILKMFPSTWY

Building a Markov Model From the observed substitution data, Dayhoff et al. were able to estimate the joint probabilities of two amino acids substituting for eachother. This yields a big, diagonally symmetric matrix of probabilities. The diagonal elements M ab are close to 1. But the matrix of joint probabilities, P(b∩a) does not represent a Markov process. Recall the elements of a Markov process’ matrix are conditional probabilities, P(b|a) = P(b∩a) / P(a). P(a) is just the probability (frequency) of an amino acid, so each column in M ab is divided by the frequency of the corresponding amino acid. The diagonal elements are still all close to 1. Dayhoff then adjusts the small non-diagonal elements by a common factor that makes the expected number of amino acid substitutions equal to 1 in 100. The diagonal elements are then adjusted to make each row add up to 1 as required by the law of total probability. This is the PAM1 Markov matrix (PAM = Point Accepted Mutation; 1 = 1% substitution frequency).

Using the PAM Model The PAM1 Markov matrix can be multiplied by itself to yield the PAM2 Markov matrix, and again to yield the PAM3 matrix, etc. PAM1 is a “unit of evolutionary distance”. PAM250 is commonly used. Note that 250% of the amino acids have not been substituted – it’s more like 80%. The PAM Markov Matrices arrived at by matrix multiplication need to be converted into the scoring matrices that one would use for BLAST or CLUSTALW.

Probability of an Alignment In a random model, the probability of the independent alignment of two proteins x and y is the product of the probabilities { q a } for all the amino acids. In a match model, the proteins have descended from a common ancestor protein and the amino acid sequences are no longer independent. In this model, the probability can be expressed as a matrix of joint probabilities {{ p ab }} Dayhoff and coworkers could estimate these probabilities from the frequencies of amino acid substitutions she observed in her database of evolutionarily related proteins. (Note that the { q i } are not all the same value of 1/20.) (Note that the p ij = p ji because neither protein is “first”.)

A Log-Odds Score We are interested in the ratio of the match model probability of alignment to the random model probability: In practice, we usually take the log of these quantities for a substitution “scoring” matrix. This changes the multiplications into additions and reduces round-off error. S(a,b) defines the number you usually see in a substitution matrix. These numbers are usually rounded to integers to ease computation.

Questions? I will post a Mathematica notebook.