Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Similar presentations


Presentation on theme: "Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University."— Presentation transcript:

1 Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htmhttp://www.bioalgorithms.info/slides.htm

2 Review: Dynamic Programming for LCS -Edit graph representation of alignment -Path = alignment -Incrementally fill in the table -Backtrack to find the best alignment

3 The LCS Recurrence Revisited The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0: s i-1, j-1 +1 if v i = w j s i,j = max s i-1, j + 0 s i, j-1 + 0 Insertion/deletion score Matching score How do we improve scoring?

4 How do we improve the scoring of alignments? Can we still find an alignment efficiently?

5 Outline Improve Scoring –Scoring Matrix –Affine Gap Penalty Variants of Alignment –Global vs. Local alignment Assessing Score Significance

6 Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score with comparison of a gap character “-”. This will simplify the scoring algorithm as follows: s i-1,j-1 + δ (v i, w j ) s i,j = max s i-1,j + δ (v i, -) s i,j-1 + δ (-, w j ) The same dynamic programming algorithm would still work!

7 The Global Alignment Problem Find the best alignment between two strings under a given scoring matrix Input : Strings v & w and a scoring matrix δ Output : Alignment of maximum score Algorithm: Dynamic programming s i-1,j-1 + δ (v i, w j ) s i,j = max s i-1,j + δ (v i, -) s i,j-1 + δ (-, w j ) The only question left is how to define the scoring matrix…

8 Measuring Similarity Measuring the extent of similarity between two sequences –Based on percent sequence identity –Based on conservation

9 Percent Sequence Identity The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G 70% identical mismatch indel

10 Simple Scoring When mismatches are penalized by some constant –μ, indels are penalized by some other constant –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels)

11 Making a Better Scoring Matrix Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations in the sequence. Some of these mutations have little effect on the organism’s function, therefore some penalties, δ(v i, w j ), will be less harsh than others.

12 Scoring Matrix: Example ARNK A5-2 R-7 3 N--70 K---6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids  will not greatly change function of protein.

13 Scoring matrices Amino acid substitution matrices –PAM –BLOSUM DNA substitution matrices –DNA: less conserved than protein sequences –Less effective to compare coding regions at nucleotide level –Simple scoring is often used

14 PAM Point Accepted Mutation (Dayhoff et al.) 1 PAM = PAM 1 = 1% average change of all amino acid positions –After 100 PAMs of evolution, not every residue will have changed some residues may have mutated several times some residues may have returned to their original state some residues may have not changed at all

15 PAM X PAM x = PAM 1 x –PAM 250 = PAM 1 250 PAM 250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys... A R N D C Q E G H I L K... Ala A 13 6 9 9 5 8 9 12 6 8 6 7... Arg R 3 17 4 3 2 5 3 2 6 3 2 9 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5... Trp W 0 2 0 0 0 0 0 0 1 0 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 Val V 7 4 4 4 4 4 4 4 5 4 15 10 Think of PAM 1 as 1-step transitions and PAM 250 as 250-step transitions

16 BLOSUM Blocks Substitution Matrix Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance –BLOSUMx was created using sequences sharing no more than x% identity –E.g., BLOSUM62 62% identity

17 The Blosum50 Scoring Matrix Val(x,y)=log(p(x,y)/p(x)p(y)) Probability of seeing x aligned with y Probability of seeing x (or y) alone

18 Deficiency in Scoring of Indels A fixed penalty σ is given to every indel: –-σ when there is 1 indel, -2σ for 2 consecutive indels, -3σ for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels

19 Deficiency in Scoring of Indels (cont.) In nature, many times indels come as a unit, not just at 1 nucleotide at a time. Normal scoring would give the same score for both alignments In nature, this is more likely.

20 Accounting for Gaps Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: -(ρ + σx), where ρ >0 is the penalty for introducing a gap. ρ will be large relative to σ because you do not want to add too much of a penalty for extending the gap.

21 Affine Gap Penalties Gap penalties: –-ρ-σ when there is 1 indels, -ρ-2σ when there are 2 indels, -ρ-3σ when there are 3 indels, etc. –-ρ- x * σ (-gap opening - x gap extensions) Somehow reduced penalties (as compared to naïve scoring) are given to runs of horizontal and vertical edges

22 Affine Gap Penalty Recurrences s i,j = s i-1,j - σ max s i-1,j –(ρ+σ) s i,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) s i,j = s i-1,j-1 + δ (v i, w j ) max s i,j s i,j Continue Gap in w (deletion) Start Gap in w (deletion) Continue Gap in v (insertion) Start Gap in v (insertion) Match or Mismatch End deletion End insertion Once again, the same dynamic programming algorithm would work!

23 Local vs. Global Alignment The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.

24 Local vs. Global Alignment (cont’d) Global Alignment Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc

25 Local Alignments: Why? Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: –Homeobox genes have a short region called the homeodomain that is highly conserved between species. –A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence

26 The Local Alignment Problem Goal: Find the best local alignment between two strings Input : Strings v, w and scoring matrix δ Output : Alignment of substrings of v & w whose alignment score is maximum among all possible alignment of all possible substrings

27 Local Alignment in Edit Graph Global alignment Local alignment Compute a “mini” Global Alignment to get Local

28 The Problem with this Problem Problem of this, long run time O(n 4 ): - There are ~n 2 pairs of vertices (i,j) - For each pair of vertices computing an alignment takes O(n 2 ) time. Solution: Dynamic programming again! Question: How do we recursively compute the best score of any local (as opposed to global) alignment for each cell in the edit graph?

29 The Local Alignment Recurrence The largest value of s i,j over the whole edit graph is the score of the best local alignment. The recurrence is shown below: 0 s i,j = max s i-1,j-1 + δ (v i, w j ) s i-1,j + δ (v i, -) s i,j-1 + δ (-, w j ) Notice there is only this change from the original recurrence of a Global Alignment This is the well-known Waterman-Smith local alignment algoirthm

30 Assessing Score Signficance In general, larger s  more significant. The question is how large should s be? Factors to be considered: –Sequence length: longer sequences are expected to give higher scores –# sequences in the database: the score of the best alignment is expected to be higher for a larger DB –Evolution time: longer evolution causes more mismatches, making a lower score more significant The Challenge is how to quantify all these…

31 Two Basic Approaches The classical approach: Extreme value distribution (EVD) –Assume a null (random) model for scores M R –P(Score > s|M R, a(x, y))=? (a(x,y)=alignment of x, y) The Bayesian approach: Model comparison –Assume two models for a(x,y): random R; aligned: M –P(M|a(x,y))/P(R|a(x,y))=? Log-odds score of the alignment prior

32 EVD of the Best Score in Ungapped Local Alignment The number of unrelated local matches with score higher than S is approximately Poisson distributed, with mean The probability that there is a match of score greater than S is K and can be fit using randomly generated data This gives a way to test statistical significance p(x>21)= 0.01 vs. p(x>21)=0.3 Sequence lengths Parameters

33 Bayesian Model Comparison M is a model for related sequences R is a model for unrelated sequences (random) Ungapped alignment n=m Alignment of each pair is independent Assumptions: Prior (Subjective!) Score S(x,y) This partially addresses Q1: how to design the scoring function? BLOSUM Scoring

34 Pairwise Alignment Summary X=x 1,…,x n Y=y 1,…,y m Model: scoring function s: A  Possible alignments of X and Y: A ={a 1,…,a k } Find the best alignment(s) X=x 1,…,x n Y=y 1,…,y m … Q3: How can we find a* quickly? Q1: How should we define s? S(a*)= 21 Q4: Is the alignment biologically Meaningful or just the best alignment of two unrelated sequences? Q2: How should we define A? (Dynamic programming) (Application-specific) (Modeling evolution) (Models for scores) Q1 & Q4 are related!

35 What You Should Know Alignment Scoring Methods (Matrix & Gap) Global vs. Local alignments How the dynamic programming algorithm solves both local and global alignments with a number of scoring strategies Basic idea in assessing score significance


Download ppt "Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University."

Similar presentations


Ads by Google