PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Measuring the degree of similarity: PAM and blosum Matrix
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Scoring matrices Identity PAM BLOSUM.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
Sequence Alignments Revisited
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Multiple Sequence Alignment
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Dayhoff’s Markov Model of Evolution. Brands of Soup Revisited Brand A Brand B P(B|A) = 2/7 P(A|B) = 2/7.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Bioinformatics in Biosophy
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
Pairwise Sequence Analysis-III
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Multiple sequence alignment (msa)
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
Intro to Alignment Algorithms: Global and Local
Multiple Sequence Alignment
Computational Genomics Lecture #3a
Alignment IV BLOSUM Matrices
Presentation transcript:

PAM250

M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly changed -> the mutations are “accepted” PAM units – the measure of the amount of evolutionary distance between two amino acid sequences. One PAM unit – S 1 has converted (mutated) to S 2 with an average of one accepted point-mutation event per 100 amino acids.

p a (=N a /N) – probability of occurrence of amino acid ‘a’ over a large, sufficiently varied, data set.  a p a = 1 f ab – the number of times the mutation a b was observed to occur. f a =  b  a f ab the total number of mutations in which a was involved f =  a f a - the total number of amino acid occurrences involved in mutations. 1)Probability matrix 2)Scoring matrix

M - 20x20 probability matrix M ab - the probability of amino acid ‘a’ changing into ‘b’ m a = (f a / f) * 1/(100 * p a ) relative mutability of amino acid ‘a’. It is the probability that the given amino acid will change in the evolutionary period of interest. Assumptions – (a) 1 in 100 amino acids on average is changed. (b) mutations are position independent. (c) mutations are independent on its past.

M aa =1- m a - the probablity of ‘a’ to remain unchanged. M ab = Pr(a -> b) = Pr(a -> b | a changed) Pr(a changed)= = (f ab /f a )m a Easy to see:  b M ab =1 = M aa +  b  a (f ab /f a )m a = 1- m a + m a /f a  b  a f ab = 1  a p a M aa =0.99 -> in average 1 mutation every 100 positions.

What is the probability that ‘a’ mutates into ‘b’ in two PAM units of evolution? a->c->b or a->d-> …  c M ac M cb = M 2 ab ->M 2, M 3, M 4 … M 250 … k->  M k converges to a matrix with identical rows. M k ac = p c - no matter what amino acid you start with, after a long period of evolution the resulting amino acid will be ‘c’ with probability p c.

PAM-k ab = M k ab / p b - probability that a pair ‘ab’ is a mutation as opposed to being a random occurrence (likelihood or odds ratio). M ab / p b = [(f ab /f a )m a ] / p b = (f ab /f a ) f a / (f * 100 * p a * p b ) = f ab / (f * 100 * p a * p b ) = M ba / p a The total alignment score is the product of Pam-k ab. To avoid accuracy problems: Pam-k ab = 10 log M k ab / p b -> The total alignment score is the sum of Pam-k ab. PAM-k matrix

Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family characteristics. Three questions: 1.Scoring 2.Computation of Mult-Seq-Align. 3.Family representation.

Multiple Sequence Alignment

Scoring: SP (sum of pairs) SP – the sum of pairwise scores of all pairs of symbols in the column. ρ 3 (-,A,A) = (-,A)+(-,A)+(A,A) SP Total Score = Σ ρ i (-,-) = 0

Induced pairwise alignment Induced pairwise alignment or projection of a multiple alignment. a(S 1, S 2 ) a(S 2, S 3 ) a(S 1, S 3 ) (-,-) = 0 SP Total Score = Σ i<j score[ a(S i, S j ) ]

Dyn.Prog. Solution

Dynamic Programming Solution The best multiple alignment of r sequences is calculated using an r- dimensional hyper-cube The size of the hyper-cube is O( Πn i ) Time complexity O(2 r n r ) * O( computation of the ρ function ). Exact problem is NP-Complete (metrics: sum-of-pairs or evolutionary tree). more efficient solution is needed

Multiple Alignment from Pairwise Alignments ? Problem: The best pairwise alignment does not necessary lead to the best multiple alignment.

Pattern-APattern-X Pattern-APattern-X Pattern-B Pattern-XPattern-B Pattern-D S1 S3 S2 S1S2S1S3S2S3 Pattern-APattern-BPattern-D Empty Correct Solution S1S2S3 Pattern-X

Center Star Alignment S1S1 S2S2 S3S3 SkSk ScSc S k-1 S k-2 (a)Scoring scheme – distance. (b)Scoring scheme satisfies the triangle inequality: for any character a,b,c dist(a,c) ≤ dist(a,b) + dist(b,c) (in practice not all scoring matrices satisfy the triangle inequality) (c) D(S i, S j ) – score of the optimal pairwise alignment. (d) D(M) = Σ i<j a M (S i, S j ) – score of the multiple alignment M. (e) a M (S i, S j ) – pairwise alignment/score induced by M.

S1S1 S2S2 S3S3 SkSk ScSc S k-1 S k-2 The Center Star Algorithm: (a) Find S c minimizing Σ i  c D(S c, S i ). (b) Iteratively construct the multiple alignment M c : 1. M c ={S c } 2. Add the sequences in S\{S c } to M c one by one so that the induced alignment a Mc (S c, S i ) of every newly added sequence S i with S c is optimal. Add spaces, when needed, to all pre-aligned sequences. Running time: * O(n 2 ). AC-BC DCABC AC--BC DCAAB C AC--BC DCA-BC DCAAB C

D(M c ) is at most twice the score of the D(M opt ) D (M c ) / D (M opt ) ≤ 2(k-1)/k ( < 2 ) Proof: (a) a(S i, S j ) ≥ D (S i, S j ) (any induced align. is not better than optimal align.) a Mc (S c, S j ) = D (S c, S j ) (b) a Mc (S i, S j ) ≤ a Mc (S i, S c ) + a Mc (S c, S j ) = D (S i, S c ) + D (S c, S j ) (follows from the triangle inequality) (c) 2 D(M c ) = Σ i=1..k Σ j=1..k,j  i a Mc (S i, S j ) ≤ Σ i=1..k Σ j=1..k,j  i ( a Mc (S i, S c ) + a Mc (S c, S j ) )= 2(k-1) Σ j  c a Mc (S c, S j ) = 2(k-1) Σ j  c D(S c, S j )

(d) k Σ j=1..k,j  c D(S c, S j ) = Σ i=1..k Σ j=1..k,j  c D(S c, S j ) ≤ Σ i=1..k Σ j=1..k,j  i D(S i, S j ) ≤ Σ i=1..k Σ j=1..k,j  i a Mopt (S i, S j ) = 2 D(M opt ) (e) → 2 D(M c ) ≤ 2(k-1) Σ j  c D(S c, S j ) k Σ j  c D(S c, S j ) ≤ 2 D(M opt ) → D(M c )/(k-1) ≤ Σ j  c D(S c, S i ) Σ j  c D(S c, S i ) ≤ 2 D(M opt )/k → D (M c ) / D (M opt ) ≤ 2(k-1)/k