Presentation is loading. Please wait.

Presentation is loading. Please wait.

Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.

Similar presentations


Presentation on theme: "Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational."— Presentation transcript:

1 Amino Acid Scoring Matrices Jason Davis

2 Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational sequence alignment Smith-Waterman Algorithm Smith-Waterman Algorithm BLAST BLAST Amino Acid Scoring Matrices Amino Acid Scoring Matrices PAM – Point Accepted Mutations PAM – Point Accepted Mutations BLOSUM – BLOck SUbstitution Matrix BLOSUM – BLOck SUbstitution Matrix mPAM mPAM Metric Conversions Metric Conversions

3 Proteins 3-dimensional stuctures 3-dimensional stuctures Composed of amino acids chained together Composed of amino acids chained together Can be represented as a 2- dimensional sequence Can be represented as a 2- dimensional sequence 20 different amino acids exist 20 different amino acids exist Usually 100-1500 amino acids long Usually 100-1500 amino acids long Have many different shapes and functions Have many different shapes and functions Function depends on both 3d shape and aa sequence Function depends on both 3d shape and aa sequence

4 Protein Synthesis DNA: strand composed of 4 different base pairs DNA: strand composed of 4 different base pairs A, T, C, G A, T, C, G 20 amino acids: 3 base pairs needed to encode each amino acid 20 amino acids: 3 base pairs needed to encode each amino acid Degenerate coding Degenerate coding Signalling Transcription/Translation Protein

5 Protein Evolution Protein ‘families’ Protein ‘families’ Set of homologous proteins Set of homologous proteins Same function, different composition Same function, different composition Similar structure Similar structure Identifying families Identifying families Pairwise sequence alignment Pairwise sequence alignment Multiple sequence alignment Multiple sequence alignment NP-hard NP-hard Other approaches Other approaches Structural, experimental Structural, experimental

6 Pairwise Sequence Alignment Input Input 2 sequences p, q of lengths m,n 2 sequences p, q of lengths m,n 20x20 Amino Acid Substitution Matrix 20x20 Amino Acid Substitution Matrix Insertion (gap) cost Insertion (gap) cost Global Alignment Global Alignment Find optimal set of insertions such that the resulting alignment (length < m+n) is optimal w.r.t. amino acid substitution matrix Find optimal set of insertions such that the resulting alignment (length < m+n) is optimal w.r.t. amino acid substitution matrix Difficult, less useful Difficult, less useful Local Alignment Local Alignment Find significant ‘hotspot’ in the alignment Find significant ‘hotspot’ in the alignment

7 Sequence Alignment Algorithms Dynamic Programming Approaches Dynamic Programming Approaches Global and Local variations Global and Local variations Provably Optimal Provably Optimal O(nm) space and time O(nm) space and time ‘banded’ heuristics can reduce the state space ‘banded’ heuristics can reduce the state space FSA extensions allow varying penalties for gap openings and gap extensions FSA extensions allow varying penalties for gap openings and gap extensions Heuristics Approaches Heuristics Approaches Blast, Fasta Blast, Fasta Sublinear time – look for statistical significance in small local alignments between sequences Sublinear time – look for statistical significance in small local alignments between sequences

8 Substitution Matrices - PAM Dayhoff, Schwartz, Orcutt (1978) Dayhoff, Schwartz, Orcutt (1978) Step 1: extrapolate mutation probabilites from 1 step in evolutionary time Step 1: extrapolate mutation probabilites from 1 step in evolutionary time Pick a set of protein families (71) Pick a set of protein families (71) Restrict proteins in each family to sequences with similarity above a certain threshold (>85%) Restrict proteins in each family to sequences with similarity above a certain threshold (>85%) Build a phylogenetic tree for each family Build a phylogenetic tree for each family Extrapolate frequencies A ab that amino acids a, b evolved from same amino acid Extrapolate frequencies A ab that amino acids a, b evolved from same amino acid A ab and A ba assumed to be the same A ab and A ba assumed to be the same Convert frequencies to probabilities Convert frequencies to probabilities p(a|b) = B ab = A ab /∑ c A ac p(a|b) = B ab = A ab /∑ c A ac

9 Substitution Matrices – PAM (2) Step 2 – Infer greater evolutionary times Step 2 – Infer greater evolutionary times Dayhoff defined a PAM1 matrix to have 1% expected substitutions Dayhoff defined a PAM1 matrix to have 1% expected substitutions For each row, scale off-diagonals and adjust diagonals to keep the matrix row stochastic For each row, scale off-diagonals and adjust diagonals to keep the matrix row stochastic To infer larger evolutionary times, we can view formed matrix C as a 20-state Markov Chain To infer larger evolutionary times, we can view formed matrix C as a 20-state Markov Chain C n is the result of performing n-steps in the Markov Process C n is the result of performing n-steps in the Markov Process

10 Substitution Matrices – PAM (3) Create odds ratio of Create odds ratio of 1) the event that 2 amino acids i,j, evolved from the same ancestor, x 1) the event that 2 amino acids i,j, evolved from the same ancestor, x f i = observed frequency of amino acid i f i = observed frequency of amino acid i p(i,j have same ancestor) = ∑ x f x Pr{x→i} Pr{x→j} = ∑ x f x (C N ) ix (C N ) jx = ∑ x (C N ) ix f x (C N ) jx = ∑ x (C N ) ix f j (C N ) xj = f j (C 2N ) ij p(i,j have same ancestor) = ∑ x f x Pr{x→i} Pr{x→j} = ∑ x f x (C N ) ix (C N ) jx = ∑ x (C N ) ix f x (C N ) jx = ∑ x (C N ) ix f j (C N ) xj = f j (C 2N ) ij 2) the event that the 2 amino acids align at random 2) the event that the 2 amino acids align at random p(independent alignment of i,j) = f i * f j p(independent alignment of i,j) = f i * f j Final log odds ratio: Final log odds ratio: D ij = average[log((C N ) ij / f i ), log(C N ) ji / f j )) D ij = average[log((C N ) ij / f i ), log(C N ) ji / f j )) The log allows for an additive model The log allows for an additive model Final numbers are rounded to nearest integer Final numbers are rounded to nearest integer

11 PAM250 Different values on the diagonal correspond do mutability potential Different values on the diagonal correspond do mutability potential

12 BLOSUM Henikoff & Henikoff, 1992 Henikoff & Henikoff, 1992 Uses aligned, ungapped blocks within protein families that have similarity greater than some level L% Uses aligned, ungapped blocks within protein families that have similarity greater than some level L% q a = ∑ b A ab / ∑ c,d A cd q a = ∑ b A ab / ∑ c,d A cd p ab = A ab / ∑ c,d A cd p ab = A ab / ∑ c,d A cd S(a,b) = log(p ab / q a q b ) S(a,b) = log(p ab / q a q b ) Final entries are rounded Final entries are rounded Blosum62 (L=62), Blosum50 (L=50) Blosum62 (L=62), Blosum50 (L=50) More direct approach, usually yields better results More direct approach, usually yields better results

13 Log-Odds Similarity Matrix Properties Negative numbers needed for Smith-Waterman local alignment algorithm Negative numbers needed for Smith-Waterman local alignment algorithm Nice probabilistic interpretation Nice probabilistic interpretation Amino acid substitutions assumed independent Amino acid substitutions assumed independent Attempts to metricize these matrices Attempts to metricize these matrices Taylor, Jones 93: used various algebraic manipulations to arrive at a metric matrix with minimal disortion Taylor, Jones 93: used various algebraic manipulations to arrive at a metric matrix with minimal disortion D ij = a – S ij D ij = a – S ij Larger values of a yielded better metrics at the cost of high dimensionality Larger values of a yielded better metrics at the cost of high dimensionality Constant Shift Embedding Constant Shift Embedding Linial, et. al. constructed a near metric over aligned segments of length 50 Linial, et. al. constructed a near metric over aligned segments of length 50 D(u,v) = S(u,u) + S(v,v) – 2*S(u,v) D(u,v) = S(u,u) + S(v,v) – 2*S(u,v) 10 -7 error rate 10 -7 error rate

14 mPAM Metric substitution model Metric substitution model Measures the expected time per 250 mutations among 100 amino acids Measures the expected time per 250 mutations among 100 amino acids Same rate as PAM250 Same rate as PAM250 Exponential distribution assumed: f(t) = 1 – e - λt Exponential distribution assumed: f(t) = 1 – e - λt Given pairwise substitution rates p(a,b) Given pairwise substitution rates p(a,b) Solve for λ: f(1) = 1-e - λ = p(a,b) Solve for λ: f(1) = 1-e - λ = p(a,b) Expected time t of an event occuring in an exponential distribution is 1/ λ Expected time t of an event occuring in an exponential distribution is 1/ λ mPAM(a,b) = round(1/ λ) mPAM(a,b) = round(1/ λ) Two values needed to be adjusted to form a metric Two values needed to be adjusted to form a metric Rounding error? Rounding error?

15 mPAM (2) Seller’s Theorem: Seller’s Theorem: If a pairwise alignment is found using a metric, resulting alignment scores are also metrics If a pairwise alignment is found using a metric, resulting alignment scores are also metrics Optimized for BLAST-like lookup Optimized for BLAST-like lookup Smaller alignments Smaller alignments Difficult to compare with other similarity matrices Difficult to compare with other similarity matrices Dynamic programming algorithms rely on negative values in the similarity matrix Dynamic programming algorithms rely on negative values in the similarity matrix Probabilistic interpretation: larger positive alignments are statistically significant Probabilistic interpretation: larger positive alignments are statistically significant

16 mPAM Disadvantages d(x,x) = 0 d(x,x) = 0 This does not capture the relative mutability among different amino acids This does not capture the relative mutability among different amino acids PAM/BLOSUM capture this with different positive values along the diagonal PAM/BLOSUM capture this with different positive values along the diagonal Do amino acids substitute according to an exponential distribution? Do amino acids substitute according to an exponential distribution? Amino Acid Substitution may be inherently non-metric Amino Acid Substitution may be inherently non-metric Comparison to BLOSUM? Comparison to BLOSUM?


Download ppt "Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational."

Similar presentations


Ads by Google