Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM

Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
© Eran Barash, CS, Ben Gurion University

Aligning Protein Sequences
We assume: sequence similarity similarity in function. Is that true? Above 30% similarity, this is generally the case. Between 20%-30% - a rather gray area. Similarity in function sequence similarity?

Proteins consist of amino acids. Concretely, 20 proteinogenic amino acids. Task given: align two protein sequences. Can the previous alignment algorithms be used? More specifically, how do amino acids differ form on another?

A few aspects need to be considered when evaluating the probability of one amino acid mutating to another: Mutational Distance Chemical properties - similarity/difference Evolutionary time

Mutational Distance Assume we start with Methionine, which is encoded by a single codon: ATG. In order to mutate Met to Thr (Threonine), which is encoded by AC[ACGT], one snp (single nucleotide point) mutation is enough. Whereas, 3 point mutations are required to mutate Met to His, which is encoded by CA[TC] And thus, the latter is more distant to Met. ATG ACG

Amino acids’ chemical properties
Size Structure Polarity Charge Acidity (pKa) These properties affect mutation probabilities

Amino acids’ chemical properties
It is fairly reasonable to assume, that mutation which change functionality (chemical properties), are selected against, and therefore should be considered less likely.

Evolutionary time Time is another aspect which needs attention.
Does longer time permits less or more mutation? How can that be included in the scoring system ?

PAM Matrices PAM - Percent Accepted Mutations.
The first widely used scoring scheme used for amino acid alignment. Devised by Margaret Oakley Dayhoff and Co. in 1978.

PAM Matrices This model incorporated the observation that pairs of amino acids mutate at different rates. PAM matrices are noted as PAMn matrices, where n represents percent mutation (can be higher than 100).

Constructing PAM Matrices
Definitions: An amino acid’s (j) frequency: where n(j) is the number of its appearances and N is the total sequences length (all alignments). An amino acid’s mutability: Where A(i,j) is the amount of observed cases when j mutated to i. M(i,j), the probability of j mutating to i ( ) is: Lambda is a constant

is the diagonal on the M matrix. is a parameter meant to maintain 99% conservation of amino acids (PAM1). How to choose ? The number of conserved amino acids is: If we divide it by N and demand it to equal 99% we get: And now we can get

Université libre de Bruxelles

We’ll take Alanine (A) as an example:
The alignments: ABGH ABGH ABGH ABGH ABIJ ABIJ ABGH ABIJ ACGH DBGH ADIJ CBIJ We’ll take Alanine (A) as an example:

In order to be able to use to normalize the probabilities, all mutations need to be observed. If, for example, the same amino acid was mutated twice, we could account for at most 1 mutation. Therefore, the sequences Dayhoff used were 85% similar, and hence it is fairly reasonable to assume that each site (a.a) experienced at most 1 mutation.

Now, according to the Markov Chain model for amino acid substitutions, and the PAMn matrices are:

The model’s assumptions
Only mutations are allow – no indels. Sites evolve independently – mutation in one site, has no effect on another. Evolution at each site, occurs according to Markov Chain model: Next mutation (state) is dependent on current state and is independent on previous mutations.

Problem PAM matrices work quite well for closely related sequences, especially during short evolutionary time. However, they seems to lack the ability to represent more distant/divergent sequences, on a larger evolutionary time scale.

BLOSUM (BLOcks SUbstitutions Matrix)
Devised by Henikoff & Henikoff in 1992.

BLOSUM (BLOcks SUbstitutions Matrix)
Used to score alignments of evolutionary divergent sequences. As the name hints, the scores are extracted from local “blocks” of conserved sequences. Unlike PAM, the n in BLOSUMn represents the maximal similarity between the sequences and all BLOSUM are computed by observations.

Constructing BLOSUM Conserved blocks in alignments: AABCDA...BBCDA
DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC

Constructing BLOSUM Let’s look at the first column: AABCDA...BBCDA
DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC A A B A C A

Constructing BLOSUM Let’s look at the first column: How many AB pairs are there? A A B A C A

Constructing BLOSUM Let’s look at the first column: How many AB pairs are there? 1 A A B A C A

Constructing BLOSUM Let’s look at the first column: Similarly, there are: 6 AA pairs 4 AB pairs 4 AC pairs 1 CB pair A A B A C A

Constructing BLOSUM We’ll define as number of occurrences of the pair ij in the column k and . To work with frequencies rather than sums, we’ll use the total number of pairs: (m – number of columns) and define as ij’s frequency.

Constructing BLOSUM The expected occurrences of i: And the expected occurrences of the pair ij: (assuming independency) .

Constructing BLOSUM And finally, the score of j mutating to i is: Rounded to the nearest integer.

BLOSUM62

Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM

Similar presentations

Presentation on theme: "Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM

Similar presentations

Presentation on theme: "Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM"— Presentation transcript:

Similar presentations

About project

Feedback