Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.

Slides:



Advertisements
Similar presentations
Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.
Advertisements

Substitution matrices
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Introduction to Bioinformatics
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Alignment methods April 12, 2005 Return Homework (Ave. = 7.5)
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Substitution matrices
Dayhoff’s Markov Model of Evolution. Brands of Soup Revisited Brand A Brand B P(B|A) = 2/7 P(A|B) = 2/7.
1 CAP5510 – Bioinformatics Substitution Patterns Tamer Kahveci CISE Department University of Florida.
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Introduction to Bioinformatics Algorithms Sequence Alignment.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Biology 4900 Biocomputing.
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Pairwise Sequence Analysis-III
DNA. Week 2 Review 1.Draw and label a diagram showing the cell membrane. 2.Define Osmosis 3.Define Active and Passive Transport 4.Describe the difference.
In-Class Assignment #1: Research CD2
Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Protein Sequence Alignment Multiple Sequence Alignment
Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.
Sequence similarity search II Searching for remote homologies.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Protein Sequence Alignments
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Constructing Probability Matrices
Alignment IV BLOSUM Matrices
Presentation transcript:

Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices

Proteins are huge molecules made up of large numbers of amino acids. The proteins are usually 100 to 500 amino acids long There are 20 different amino acids that make up the proteins Quotes from page 11 of our Lab Manual:

Name Abbr. Linear structure formula ====================================================== AlanineAlanine ala a CH3-CH(NH2)-COOH ArginineArginine arg r HN=C(NH2)-NH-(CH2)3-CH(NH2)-COOH AsparagineAsparagine asn n H2N-CO-CH2-CH(NH2)-COOH Aspartic acidAspartic acid asp d HOOC-CH2-CH(NH2)-COOH CysteineCysteine cys c HS-CH2-CH(NH2)-COOH GlutamineGlutamine gln q H2N-CO-(CH2)2-CH(NH2)-COOH Glutamic acidGlutamic acid glu e HOOC-(CH2)2-CH(NH2)-COOH GlycineGlycine gly g NH2-CH2-COOH HistidineHistidine his h NH-CH=N-CH=C-CH2-CH(NH2)-COOH IsoleucineIsoleucine ile i CH3-CH2-CH(CH3)-CH(NH2)- LeucineLeucine leu l (CH3)2-CH-CH2-CH(NH2)-COOH LysineLysine lys k H2N-(CH2)4-CH(NH2)-COOH MethionineMethionine met m CH3-S-(CH2)2-CH(NH2)-COOH PhenylalaninePhenylalanine phe f Ph-CH2-CH(NH2)-COOH ProlineProline pro p NH-(CH2)3-CH-COOH SerineSerine ser s HO-CH2-CH(NH2)-COOH ThreonineThreonine thr t CH3-CH(OH)-CH(NH2)-COOH TryptophanTryptophan trp w Ph-NH-CH=C-CH2-CH(NH2)-COOH TyrosineTyrosine tyr y HO-p-Ph-CH2-CH(NH2)-COOH ValineValine val v (CH3)2-CH-CH(NH2)-COOH

Constructing Probability Matrices Using a Smaller Set of AA’s Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine with probability.2 Alanine Serine with probability.1 Leucine Serine with probability.3 We will assume that these probabilities are for changes that take place during one time unit

We can summarize these observations using the language of probability theory. We will use the notation (A|L, t) to mean: “A certain position in our sequence initially contains Leucine and at time, t, it contains Alanine.” Another way of saying this is, “After t time units the position contains Alanine given that it initially contained Leucine.”, i.e. the vertical bar means “given” So, Alanine given Leucine after t time units. We then write: Pr(A|A, 1) =.7 Pr(A|L, 1) =.2 Pr(A|S, 1) =.1 Pr(L|A, 1) =.2 Pr(L|L, 1) =.5 Pr(L|S, 1) =.3 Pr(S|A, 1) =.1 Pr(S|L, 1) =.3 Pr(S|S, 1) =.6 The above can be summarized in a table, called a matrix 1\2 ALS A L S.1.3.6

What about the probabilities two time units later? For example what is the probability that a position that was originally Alanine is Alanine two time units later? This can happen in three ways: A A A L A S A In our original notation, we are saying: (A|A, 2) = (A|A, 1)and(A|A, 1) or (L|A, 1)and(A|L, 1) or (S|A, 1)and(A|S, 1) Thus, to compute the probability, Pr(A|A,2) = Pr(A|A,1)Pr(A|A,1) + Pr(L|A,1)Pr(A|L,1) + Pr(S|A,1)Pr(A|S,1) =.7*.7 +.2*.2 +.1*.1 = =.54 We will work out the 8 other second time unit transition probabilities in class.

ALS A L S After we compute all 9 of the probabilities for the transitions after 2 time units we have the following table. This table required three multiplications and two additions to compute the values placed in each of its nine cells. That is there where 27 multiplications and 18 additions required to produce the above table.

The Matrix Connection Consider the matrix, M, that we constructed earlier when we made the table of probabilities In matrix algebra, the product of two matrices is defined as follows: To compute the product of two matrices A and B, the value placed in row, i, and column, j, is obtained by multiplying each value in row, i, of A by its corresponding element in column, j, of B and summing the results. Translation by way of an illustration to follow.

Let’s suppose we want to square M, i.e. multiply M by itself To compute the value of the product matrix M 2 in row, 2, column, 3, we multiply each element in row 2 of the first matrix by its corresponding element in row 3 of the second matrix and sum the results:.2*.1 +.5*.3 +.3*.6 = =.35 But this is exactly how we calculated Pr(S|L, 2)! This agreement between M 2 and the table of transition probabilities holds for each position. It appears that Matrix Multiplication is exactly what we need to generate the table of transition probabilities after t time units.

Thus, if we use the rules of matrix multiplication, Since the rules of matrix multiplication and those for computing the transition probabilities are essentially the same, we have a marriage made by the divine. So let’s use them to our advantage.

The number, variety, and chemical properties of the Amino Acids make the problem of scoring a pair of Amino Acids a much more complicated problem than scoring a pair of nucleotides. In the late 1970’s Dayhoff, Schwartz, and Orcutt decided to look at a database of similar proteins having common ancestors and obtain substitution frequency data. They looked at 71 groupings of protein data that differed by no more than 15% of their residues, i.e. at least 85% similar. They then built phylogenetic trees where each transition from generation to generation has as few changes as possible, given the data, in each ancestral sequence. From this a value is determined for the entry A ab in a matrix giving the frequency data for each pairing.

Constructing a Parsimonious Phylogenetic Tree (taken from page 40 of Krane & Raymer) ACGCTAFKI A -> G I -> L GCGCTAFKI ACGCTAFKL A -> G A -> L C -> S G -> A GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL Dayhoff and her team used sequences that were at least 85% similar and calculated the frequency with which each protein was substituted for each of the other proteins.

Dayhoff’s Data NOTE: The diagonals are blank since only the changes are recorded. Also, the upper triangular half of the matrix is not shown since it is assumed that the changes α  and  α are symmetrical.

Calculating the Entry in The Substitution Matrix Let P(b|a,t) = Probability that a is substituted for b in t time units adjusted for divergence time (Dayhoff time unit) q a q b = Probability that a would randomly follow b = (frequency if a)(freqency of b) s(a,b | t) = an entry at position (a, b) or (b, a) in the scoring matrix Then, s(a, b | t) =

The Probabilities Found By Dayhoff The entry in cell M ab is the probability that a would be followed by b in one Dayhoff time unit multiplied by 100. Thus, for example, Alanine would be followed by Proline 0.22% of the time.

Note: The previous matrix is NOT the scoring matrix. It is used to derive the scoring matrix. Recall: s(a, b | t) = However, the probability matrix is the main tool for deriving a sensible scoring matrix. To find the probability that amino acid a will mutate be replaced by amino acid b at a time t time units later, we need to calculate the a,b-th entry of the matrix M t. After calculating this entry, then we apply the “log-odds” formula given above. The reason that the logarithm is used in the scoring formula is that it allows us, among other things, to add the scores of the aligned residues when we compute the score for an overall alignment of two sequences.

The matrix having scores found from the original probability matrix is called a 1 PAM matrix PAM stands for Point Accepted Mutation or Percent Accepted Mutation Dayhoff’s term was Accepted Point Mutation, but PAM rolls off the tongue easier than APM. The 1 means that given the degree of similarity between the sequences used to make up the matrix, the scores in this matrix are the frequencies for one evolutionary time unit. Scores representing longer times and are called PAMt matrices = M t. The most widely used matrix is PAM250 or the log-odds matrix based on: M 250 = the 250 th power of M. This matrix shows the probability of change over a long period of time. However, for closely related sequences, say mouse and rat MSH2, a PAM10 matrix may be more appropriate

The PAM250 Matrix We only show the top half because the bottom half is a reflection of the top half, i.e. S a,b = S b,a

Discussion of PAM The 1 PAM matrix was derived by constructing hypothetical phylogenetic trees relating sequences in 71 families. The higher the power of the matrix, the more evolutionary time units represented by the matrix. Criticism – raising M to high powers does not capture the true difference between short time substitutions and long time substitutions. Note short time substitutions are dominated by amino acid substitutions that come from a single base change in the codon triplets of an Amino Acid, whereas the long time substitutions show all kinds of codon changes

BLOSUM (BLOck Substitution Matrix) Matrices

The criticism given at the end of the last discussion is that the large PAM matrices tend to minimize the effects of short time substitutions such as L I L V and Y F In 1991 – 1992 Henikoff and Henikoff used the BLOCKS database at the Fred Hutchison Cancer Research Center This database contains blocks of multiple alignments of more distantly related sequences Such a database can be used to derive scores more directly

Methodology Sequences from each block were clustered Two sequences were placed in the same cluster if their percent differences were above some level, say α% The frequency A ab is calculated from observing residue a in one clustered alignment against residue b in another clustered alignment. Corrections are made for clusters of differing sizes

Calculating the Matrix Entries Let the following be determined from the observed data: q a = the fraction of pairings that include an a p ab = the fraction of parings of a and b Then and The score is calculated as These values are then scaled and rounded to make calculations easier.

If we set the limit, α, to 62, the we have a BLOSUM62 Matrix

Most popular BLOSUM Matrices are BLOWSUM62 and BLOWSUM50. BLOWSUM62 is used mainly for ungapped matching. BLOWSUM50 is used for alignments with gaps. Note: the lower the number the longer the time span in evolutionary units.

Differences Between PAM and BLOSUM PAM assumes that substitutions probabilities for highly related proteins can be extrapolated to the probabilities for distantly related proteins. BLOWSUM matrices are based on the observation of more distantly related protein alignments. NOTE: Both types of matrices use log-odds values in their scoring systems.