Measuring the degree of similarity: PAM and blosum Matrix

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
DNA sequences alignment measurement
Last lecture summary.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Introduction to bioinformatics
Sequence Analysis Tools
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequencing a genome and Basic Sequence Alignment
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Sequencing a genome and Basic Sequence Alignment
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Last lecture summary. Flavors of sequence alignment pair-wise alignment × multiple sequence alignment.
Pairwise Sequence Analysis-III
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Last lecture summary.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Presentation transcript:

Measuring the degree of similarity: PAM and blosum Matrix Lecture 13

Introduction Measurement of matching Nucleic acid and amino acid substitutions The blosum Matrix The Pam Matrix Appropriate use of blosum and Pam Matrix Measurement of alignment gaps

Measurement of matching The dot plot gives a visual representation of sequence alignment. So how do we measure the alignment. One way is to count of matches and mismatches: the difference between them Hamming distance; : The distance corresponds to mismatches for strings of equal length. agtc cgta Distance is 2 (give another example)

Measurement of matching If the sequences (strings) are not of equal length the use: The Levenshtein distance: is the minimum number of edit operations (alter/ insert/delete) to required to turn one string into another: ag- tcc cgctca what is the levensthein distance? But what about the biological plausibility of this approach? Strings are not the same as sequences!!! (hint: amino acid alignment)

Nucleic Acid mutations It is know that transitions a<->g are more common than transversions c<->t In sequence alignment we are trying to determine the degree of similarity and not dissimilarity; but the hamming/levenshtein measure dissimilarity. One approach would be to count the number of matches but there is now a need to include the bias associated with possible substitutions.

nucleic acid scoring table Based on known rates we could propose, a simple, table like the following: where the each match scores a 1000 A transition A<-> G scores a 100 A transversion T<->C and others score a 10 The values correspond to the chances of a substitution (no substitution.) A G T C 1000 100 10

nucleic acid scoring table Using this we could attempt to calculate the similarity we would look at each sequence and determine the score seq1 1 to seq 2 . Seq 1: agtc Seq 2: cgta 10 1000 1000 10 since the are, we assume, independent elements (events) we have to multiple them to get the score. LogA+LogB = Log(A*B) However by get the log of each value we only have to add the values: log10 of about is 8. What would be the table if log values were used?

Nucleic Acid Matrix A G T C 3 2 1 So in this case all we have to do is add the values. Note this is example to illustrate the concept. This is not actual substitution matrix for nucleic acids (bases) [it can be found on the internet] . But lesk 2008 p. 255 give an example of one. Measurement of sequence similarity plays a much greater role in assessing proteins. Why do you think the similarity of proteins is more critical than nucleic: (hint: code and AA properties )

Measuring Protein similarity Deriving a matrix for proteins is more complex because: There are 20 amino acids so much larger set of substitutions. The amino acids have properties that affect the structure and so the protein functionality. Therefore substitutions can be conserved or semi-conserved Observations shows that conserved substitutions e.g. Hydrophobic <-> hydrophobic mutations are more common semi conserved; e.g. hydrophilic <-> hydrophobic

PAM 1 matrix Pam (PERCENTAGE ACCEPTED MUTATION) 1 is the chance of a one point mutation per 100 residues; in other words a first round of divergence: the above score is dependent on the expected value of occurrence. Clearly A <-> A, no change, has a high score A hydrophobic <-> Hydrophobic V<->A (13); while V<-> I is (57) A hydrophilic <-> hydrophilic K <-> T (11); K<-> R (37) A hydrophilic <-> hydrophobic: K <-> V (1)

Dayhoff PAM (250) Matrix THE most common PAM matrix is the 250 It represents a greater degree of evolutionary divergence and corresponds to multiplying the PAM 1 by itself 250 times via a process called dynamic programming To dervive the values you use: Observed rate of mutation/ the random mutation rate (based on the AA frequency. In other words : expected value .(no bias, positive bias or negative bias). the log of this expected value is multiplied by 10 to give the results in the table opposite. Therefore a C<->S has a value of 2 or an expected value 1.6 :occurred 1.6 times more often than if it was random.: log((1.6) = 0.2 . Multiply this by 10 gives a value of 2. The values in the PAM 250 are a obviously lower but the distribution is about the same: why?

blosum 62 matrix Another matrix the blosum Matrix used a larger data set (as there was more information available in 1992 than in 1978) Moreover the blosum looked at mutations within blocks of conserved sequences as opposed to point mutations on individual sequences in both conserved and variable regions. [ what was the logic behind excluded] The blosum 62 matrix, unlike the PAM 250 matrix , the blosum multiplied 250 times, is the probabilities are derived from blocks sharing 62% conservation . Like the PAM matrix it Hydrophobic to hydrophobic V<->A (O) V<-> I (3) Hydrophilic to Hydrophilic K <-> T (-1) K<-> R (2) Hydrophobic to hydrophilic K<-> V (-2)

PAM and blosum Matrices In the PAM matrix the as the number increases so does evolutionary distance while it is the reverse it the blosum Matrix. According to Baxevanis (2003) the following represents the equivalence and most appropriate use of both matrices PAM250 and the blosum 45 PAM160 and the blosum 62

PAM and blosum Matrix Matrix Best in determining PAM 40/ blosum 90 Short similar (conserved) alignments PAM 250 Longer more divergent alignments Pam 160/ blosum 80 Detecting members of protein families blosum 62 In finding all potential similarities Adapted from Baxevanis 2005 An excellent review of scoring matrices can be found at : Henikoff and Henikoff 2000

Measurement of alignment gaps Gaps represents insertions and deletions Need to be limited so that they represent biological plausibility. Baxevanis (2005) suggest that no more than “one in 20 is a good rule of thumb”. Baxevanis (2005) proposed that the use of gaps in alignments is penalised; in other words the measurement of the similarity reduces. The penalty associated with the using gaps is dependent on Opening the gap Extending the gap The length of the gap.

The Blast Algorithm The most widely used approach to determine similarity is the BLAST algorithm. Basically the algorithm is a combination of the dot plot and one of the scoring matrices: such as blosum or PAM, Is used to determine the best region of local alignment between the query sequence and target sequences (refer to dot plot example 1 in lecture 12).

Potential Exam Questions Discuss how to derive both the PAM and blosum matrix and why it is necessary to use different variants ,of each, in determining different types of similarity analysis. The dot plot and the PAM and Blosum matrices are important tools in the measurement of amino sequences similarity. Discuss the best variant of each that should be used in the determination of sequence alignment similarity. Distinguish between the two main types of scoring matrices [PAM and blosum] and explain how they are used to measure the amount of similarity between two sequences.

References Baxevanis A.D. 2005 Bioinformatics: a practical guide to the analysis of genes and proteins chapter 11; Wiley Lesk, A. 2008; Introduction to bioinformatics, 3rd edition, oxford university press