Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM

Slides:

Advertisements

Similar presentations

Substitution matrices

Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.

Bayesian Evolutionary Distance P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1— 17, 1996.

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

Last lecture summary.

Introduction to Bioinformatics

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Heuristic alignment algorithms and cost matrices

Sequence analysis course

Introduction to Bioinformatics Algorithms Sequence Alignment.

Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.

Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Introduction to bioinformatics

Sequence similarity.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at

Class 3: Estimating Scoring Rules for Sequence Alignment.

Introduction to Bioinformatics Algorithms Sequence Alignment.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Scoring matrices Identity PAM BLOSUM.

BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.

PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Sequence Alignments Revisited

Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.

Alignment III PAM Matrices. 2 PAM250 scoring matrix.

Substitution matrices

Dayhoff’s Markov Model of Evolution. Brands of Soup Revisited Brand A Brand B P(B|A) = 2/7 P(A|B) = 2/7.

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.

. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.

1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Bioinformatics in Biosophy

An Introduction to Bioinformatics

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

Lecture 3: Markov models of sequence evolution Alexei Drummond.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.

Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Calculating branch lengths from distances. ABC A B C----- a b c.

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Tutorial 4 Substitution matrices and PSI-BLAST 1.

Pairwise Sequence Analysis-III

A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.

Sequence Alignment.

Construction of Substitution matrices

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Protein Sequence Alignment Multiple Sequence Alignment

Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Pairwise Sequence Alignment and Database Searching

Alignment IV BLOSUM Matrices

Presentation transcript:

Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM © Eran Barash, CS, Ben Gurion University

Aligning Protein Sequences We assume: sequence similarity similarity in function. Is that true? Above 30% similarity, this is generally the case. Between 20%-30% - a rather gray area. Similarity in function sequence similarity?

Aligning Protein Sequences Proteins consist of amino acids. Concretely, 20 proteinogenic amino acids. Task given: align two protein sequences. Can the previous alignment algorithms be used? More specifically, how do amino acids differ form on another?

Aligning Protein Sequences A few aspects need to be considered when evaluating the probability of one amino acid mutating to another: Mutational Distance Chemical properties - similarity/difference Evolutionary time

Mutational Distance Assume we start with Methionine, which is encoded by a single codon: ATG. In order to mutate Met to Thr (Threonine), which is encoded by AC[ACGT], one snp (single nucleotide point) mutation is enough. Whereas, 3 point mutations are required to mutate Met to His, which is encoded by CA[TC] And thus, the latter is more distant to Met. ATG ACG

Amino acids’ chemical properties Size Structure Polarity Charge Acidity (pKa) These properties affect mutation probabilities

Amino acids’ chemical properties It is fairly reasonable to assume, that mutation which change functionality (chemical properties), are selected against, and therefore should be considered less likely.

Evolutionary time Time is another aspect which needs attention. Does longer time permits less or more mutation? How can that be included in the scoring system ?

PAM Matrices PAM - Percent Accepted Mutations. The first widely used scoring scheme used for amino acid alignment. Devised by Margaret Oakley Dayhoff and Co. in 1978.

PAM Matrices This model incorporated the observation that pairs of amino acids mutate at different rates. PAM matrices are noted as PAMn matrices, where n represents percent mutation (can be higher than 100).

Constructing PAM Matrices Definitions: An amino acid’s (j) frequency: where n(j) is the number of its appearances and N is the total sequences length (all alignments). An amino acid’s mutability: Where A(i,j) is the amount of observed cases when j mutated to i. M(i,j), the probability of j mutating to i ( ) is: Lambda is a constant

Constructing PAM Matrices is the diagonal on the M matrix. is a parameter meant to maintain 99% conservation of amino acids (PAM1). How to choose ? The number of conserved amino acids is: If we divide it by N and demand it to equal 99% we get: And now we can get .

Université libre de Bruxelles

We’ll take Alanine (A) as an example: The alignments: ABGH ABGH ABGH ABGH ABIJ ABIJ ABGH ABIJ ACGH DBGH ADIJ CBIJ We’ll take Alanine (A) as an example:

Constructing PAM Matrices In order to be able to use to normalize the probabilities, all mutations need to be observed. If, for example, the same amino acid was mutated twice, we could account for at most 1 mutation. Therefore, the sequences Dayhoff used were 85% similar, and hence it is fairly reasonable to assume that each site (a.a) experienced at most 1 mutation.

Constructing PAM Matrices Now, according to the Markov Chain model for amino acid substitutions, and the PAMn matrices are:

The model’s assumptions Only mutations are allow – no indels. Sites evolve independently – mutation in one site, has no effect on another. Evolution at each site, occurs according to Markov Chain model: Next mutation (state) is dependent on current state and is independent on previous mutations.

Problem PAM matrices work quite well for closely related sequences, especially during short evolutionary time. However, they seems to lack the ability to represent more distant/divergent sequences, on a larger evolutionary time scale.

BLOSUM (BLOcks SUbstitutions Matrix) Devised by Henikoff & Henikoff in 1992.

BLOSUM (BLOcks SUbstitutions Matrix) Used to score alignments of evolutionary divergent sequences. As the name hints, the scores are extracted from local “blocks” of conserved sequences. Unlike PAM, the n in BLOSUMn represents the maximal similarity between the sequences and all BLOSUM are computed by observations.

Constructing BLOSUM Conserved blocks in alignments: AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC

Constructing BLOSUM Let’s look at the first column: AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC A A B A C A

Constructing BLOSUM Let’s look at the first column: How many AB pairs are there? A A B A C A

Constructing BLOSUM Let’s look at the first column: How many AB pairs are there? 1 A A B A C A

Constructing BLOSUM Let’s look at the first column: How many AB pairs are there? 2 A A B A C A

Constructing BLOSUM Let’s look at the first column: How many AB pairs are there? 3 A A B A C A

Constructing BLOSUM Let’s look at the first column: How many AB pairs are there? 4 A A B A C A

Constructing BLOSUM Let’s look at the first column: Similarly, there are: 6 AA pairs 4 AB pairs 4 AC pairs 1 CB pair A A B A C A

Constructing BLOSUM We’ll define as number of occurrences of the pair ij in the column k and . To work with frequencies rather than sums, we’ll use the total number of pairs: (m – number of columns) and define as ij’s frequency.

Constructing BLOSUM The expected occurrences of i: And the expected occurrences of the pair ij: (assuming independency) .

Constructing BLOSUM And finally, the score of j mutating to i is: Rounded to the nearest integer.

BLOSUM62