Download presentation
Presentation is loading. Please wait.
Published byHenry Thomas Modified over 9 years ago
1
Dynamic Programming I Definition of Dynamic Programming
Sequence Alignment Longest Common Subsequence (LCS) Problem Global Alignment Scoring Matrices Local Alignment
2
Definition of Dynamic Programming (Planning)
Dynamic Programming: Solve the subproblems (they may overlap with each other) and save their solutions in a table, and then use the table to solve the original problem. Example: Compute Fibonacci number f(0)=0, f(1)=1, f(n)=f(n-1)+f(n-2) f(0) f(1) f(2) f(3) f(4) …… f(n) 1 2 3 Procedure FibonacciNum(n) f(0) = 0 f(1) = 1 for i = 2 to n f(i) = f(i-1) + f(i-2) Return f(n) Time complexity = O(n)
3
Sequence Alignment Example: Cystic Fibrosis
Cystic fibrosis (CF) is a chronic and frequently fatal genetic disease of the body's mucus glands (abnormally high level of mucus in glands). CF primarily affects the respiratory systems in children. Mucus is a slimy material that coats many epithelial surfaces and is secreted into fluids such as saliva
4
Cystic Fibrosis: Finding the Gene
CFTR: Cystic fibrosis transmembrane conductance regulator
5
Cystic Fibrosis: Mutation Analysis
If a high % of cystic fibrosis (CF) patients have a certain mutation in the gene and the normal patients don’t, then that could be an indicator of a mutation that is related to CF A certain mutation was found in 70% of CF patients, convincing evidence that it is a predominant genetic diagnostics marker for CF
6
Cystic Fibrosis and CFTR Gene :
7
Cystic Fibrosis and the CFTR Protein
CFTR (Cystic Fibrosis Transmembrane conductance Regulator) protein is acting in the cell membrane of epithelial cells that secrete mucus These cells line the airways of the nose, lungs, the stomach wall, etc.
8
Aligning DNA Sequences
V = ATCTGATG n = 8 4 matches mismatches insertions deletions m = 7 1 W = TGCATAC 2 match mismatch 2 V A T C G W deletion indels insertion
9
Longest Common Subsequence (LCS) – Alignment without Mismatches
Given two sequences v = v1 v2…vm and w = w1 w2…wn The LCS of v and w is a sequence of positions in v: 1 < i1 < i2 < … < it < m and a sequence of positions in w: 1 < j1 < j2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal
10
LCS: Example 1 1 2 2 3 4 3 5 4 5 6 6 7 7 8 i coords: elements of v A T
1 1 2 2 3 4 3 5 4 5 6 6 7 7 8 i coords: elements of v A T -- C -- T G A T C elements of w -- T G C A T -- A -- C j coords: (0,0) (1,0) (2,1) (2,2) (3,3) (3,4) (4,5) (5,5) (6,6) (7,6) (8,7) positions in v: 2 < 3 < 4 < 6 < 8 Matches shown in red positions in w: 1 < 3 < 5 < 6 < 7
11
Edit Graph for LCS Problem
j 1 2 3 4 5 6 7 8 Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges i T 1 G 2 C 3 A 4 T 5 A 6 C 7
12
Computing LCS Let vi = prefix of v of length i: v1 … vi
and wj = prefix of w of length j: w1 … wj The length of LCS(vi,wj) is computed by: si-1,j si,j = max si,j i-1,j -1 i-1,j si-1,j , if vi = wj 1 i,j -1 i,j
13
Alignments as LCS (without mismatch)
and represent indels in v and w with score 0. represent matches with score 1. The score of the alignment path is 5. 1 2 3 4 5 6 7 G A T C w v Every path in the edit graph corresponds to an alignment:
14
Alignment as LCS w v Old Alignment 0122345677 v= AT_GTTAT_
2 3 4 5 6 7 G A T C w v Old Alignment v= AT_GTTAT_ w= ATCGT_A_C New Alignment v= AT_GTTAT_ w= ATCG_TA_C
15
Alignment: Dynamic Programming
si,j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 {
16
Dynamic Programming Example
1 2 3 4 5 6 7 G A T C w v Initialize 1st row and 1st column to be all zeroes. Or, to be more precise, initialize 0th row and 0th column to be all zeroes.
17
Dynamic Programming Example
1 2 3 4 5 6 7 G A T C w v Si,j = Si-1, j-1 max Si-1, j Si, j-1 { 1 1 1 1 1 1 1 value from NW +1, if vi = wj value from North (top) value from West (left) 1 1 1 1 1 1
18
Alignment: Backtracking
Arrows show where the score originated from. if from the top if from the left if vi = wj
19
Backtracking Example w v Find a match in row and column 2.
i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). Since vi = wj, si,j = si-1,j-1 +1 s2,2 = [s1,1 = 1] + 1 s2,5 = [s1,4 = 1] + 1 s4,2 = [s3,1 = 1] + 1 s5,2 = [s4,1 = 1] + 1 s7,2 = [s6,1 = 1] + 1 1 2 3 4 5 6 7 G A T C w v 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2
20
Backtracking Example w v
1 2 3 4 5 6 7 G A T C w v Continuing with the dynamic programming algorithm gives this result. 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 3 3 3 3 1 2 2 3 4 4 4 1 2 2 3 4 4 4 1 2 2 3 4 5 5 1 2 2 3 4 5 5
21
LCS Algorithm { LCS(v,w) for i 1 to n si,0 0 for j 1 to m
s0,j 0 si-1,j si,j max si,j-1 si-1,j-1 + 1, if vi = wj “ “ if si,j = si-1,j bi,j “ “ if si,j = si,j-1 “ “ if si,j = si-1,j-1 + 1 return (sn,m, b) {
22
Now What? w v LCS(v,w) created the alignment grid
1 2 3 4 5 6 7 G A T C w v LCS(v,w) created the alignment grid Now we need a way to read the best alignment of v and w Follow the arrows backwards from sink
23
Printing LCS: Backtracking
PrintLCS(b,v,i,j) if i = 0 or j = 0 return if bi,j = “ “ PrintLCS(b,v,i-1,j-1) print vi else PrintLCS(b,v,i-1,j) PrintLCS(b,v,i,j-1)
24
LCS Runtime It takes O(nm) time to fill in the nxm dynamic programming matrix. Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix.
25
From LCS to Alignment: Change up the Scoring
The Longest Common Subsequence (LCS) problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). In the LCS Problem, we scored 1 for matches and 0 for indels Consider penalizing indels and mismatches with negative scores Simplest scoring schema: +1 : match premium -μ : mismatch penalty -σ : indel penalty
26
Simple Scoring When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels)
27
The Global Alignment Problem
Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score ↑→ = -б = 1 if match = -µ if mismatch si-1,j if vi = wj si,j = max s i-1,j-1 -µ if vi ≠ wj s i-1,j - σ s i,j-1 - σ m : mismatch penalty σ : indel penalty {
28
Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”. This will simplify the algorithm as follows: si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) {
29
Measuring Similarity Measuring the extent of similarity between two sequences Based on percent sequence identity Based on conservation
30
Percent Sequence Identity
The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical
31
Making a Scoring Matrix
Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations. Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others.
32
Scoring Matrix: Example
K 5 -2 -1 - 7 3 6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein. AKRANR KAAANK -1 + (-1) + (-2) = 11
33
Conservation Amino acid changes that tend to preserve the physico-chemical properties of the original residue Polar to polar aspartate glutamate Nonpolar to nonpolar alanine valine Similarly behaving residues leucine to isoleucine (side chain with the structure CH2CH(CH3)2, an essential amino acid, meaning you must obtain it from food, as opposed to being able to make it from other molecules, which your cells can do with the majority of the amino acids.)
34
Scoring matrices Amino acid substitution matrices PAM BLOSUM
DNA substitution matrices DNA is less conserved than protein sequences Less effective to compare coding regions at nucleotide level
35
PAM some residues may have mutated several times
Point Accepted Mutation (Dayhoff et al.) 1 PAM = PAM1 = 1% average change of all amino acid positions After 100 PAMs of evolution, not every residue will have changed some residues may have mutated several times some residues may have returned to their original state some residues may not changed at all
36
PAMX PAMx = PAM1x PAM250 = PAM1250
PAM250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A Arg R Asn N Asp D Cys C Gln Q ... Trp W Tyr Y Val V
37
BLOSUM Blocks Substitution Matrix
Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity
38
The Blosum50 Scoring Matrix
39
Local vs. Global Alignment
The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. In the edit graph with negatively-scored edges, Local Alignmet may score higher than Global Alignment
40
Local vs. Global Alignment (cont’d)
Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc
41
Local Alignment: Example
Compute a “mini” Global Alignment to get Local Local alignment Global alignment
42
Local Alignments: Why? Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: Homeobox genes have a short region called the homeodomain that is highly conserved between species. A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence
43
The Local Alignment Problem
Goal: Find the best local alignment between two strings Input : Strings v, w and scoring matrix δ Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings
44
Local Alignment: Free Rides
Yeah, a free ride! Vertex (0,0) The dashed edges represent the free rides from (0,0) to every other node.
45
The Local Alignment Recurrence
The largest value of si,j over the whole edit graph is the score of the best local alignment. The recurrence: Notice there is only this change from the original recurrence of a Global Alignment si,j = max si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) {
46
The Local Alignment Recurrence
The largest value of si,j over the whole edit graph is the score of the best local alignment. The recurrence: Power of ZERO: there is only this change from the original recurrence of a Global Alignment - since there is only one “free ride” edge entering into every vertex si,j = max si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) {
47
Scoring Indels: Naive Approach
A fixed penalty σ is given to every indel: -σ for 1 indel, -2σ for 2 consecutive indels -3σ for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels
48
Affine Gap Penalties ATA__GC ATATTGC ATAG_GC AT_GTGC
In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC This is more likely. This is less likely. Normal scoring would give the same score for both alignments
49
Accounting for Gaps Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for extending the gap.
50
Summary of Algorithms for LCS, Global Alignment and Local Alignment
51
(1) LCS Problem from Bioinformatics: the DNA of one organism may be
S1 = ACCGGTCGAGTGCGCGGAAGCCGGCCGAAA, while the DNA of another organism may be S2 = GTCGTTCTTAATGCCGTTGCTCTGTAAA. One goal of comparing two strands of DNA is to determine how “similar” the two strands are, as some measure of how closely related the two organisms are. LCS Problem Formulization Given a sequence X = ( ), another sequence Z = ( ) is a subsequence of X if there exists a strictly increasing sequence ( ) of indices of X such that for all j = 1, 2, …k, we have Theorem Let X = ( ) and Y = ( ) be sequence, and Z = ( ) be any LCS of X and Y. Find a match Didn’t find a match
52
Let c[i,j] be the length of an LCS of the sequences
Find a match Let c[i,j] be the length of an LCS of the sequences Didn’t find a match Procedure LCS-Length(X,Y) m := length[X]; n := length[Y]; for i := 0 to m do c[i, 0] := 0 for j := 0 to n do c[0,j] := 0 for i := 1 to m do for j := 1 to n do if X[i] = Y[j] then c[i,j] := c[i-1, j-1] +1 else if c[i-1,j] > c[i,j-1] then c[i,j] := c[i-1,j] else c[i,j] := c[i,j-1] return c Procedure Print-LCS(c,X,i,j) 1 if i = 0 or j = 0 then return 2 if c[i,j] = c[i-1,j-1] + 1 then { Print-LCS(c,X,i-1,j-1) print X[i] } else if c[i-1, j] > c[i,j-1] then Print-LCS(c,X,i-1,j) else Pring-LCS(c,X,i,j-1)
53
(2) Global Alignment in Bioinformatics
Find a match (2) Global Alignment in Bioinformatics Let s[i,j] be the score of the sequences Didn’t find a match indel mismatch Procedure GAlign(X,Y) m := length[X]; n := length[Y]; for i := 0 to m do s[i, 0] := 0; for j := 0 to n do s[0,j] := 0; for i := 1 to m do for j := 1 to n do if X[i] = Y[j] then s[i,j] = s[i-1,j-1]+1 else s[i,j] = s[i-1,j-1] – mu; if s[i,j] < s[i-1,j] – sigma then s[i-1, j] = s[i-1,j – sigma; if s[i,j] < s[i,j-1] – sigma; then s[i, j] = s[i,j-1] –sigma; Procedure Print-GAlign(s,X,i,j) 1 if i = 0 or j = 0 then return 2 if s[i,j] = s[i-1,j-1] + 1 or s[I,j] = s[i-1,j-1] - mu then { Print-GAlign(c,X,i-1,j-1) print X[i] } else if s[i, j] = s[i-1,j] - sigma then Print-GAlign(s,X,i-1,j) else Print-GAlign(s,X,i,j-1)
54
(3) Local Alignment in Bioinformatics
Find a match (3) Local Alignment in Bioinformatics Let s[i,j] be the score of the sequences Didn’t find a match Procedure LAlign(X,Y) m := length[X]; n := length[Y]; for i := 0 to m do s[i, 0] := 0; for j := 0 to n do s[0,j] := 0; for i := 1 to m do for j := 1 to n do if X[i] = Y[j] then s[i,j] = s[i-1,j-1]+1 else s[i,j] = s[i-1,j-1] – mu; if s[i,j] < s[i-1,j] – sigma then s[i-1, j] = s[i-1,j – sigma; if s[i,j] < s[i,j-1] – sigma; then s[i, j] = s[i,j-1] –sigma; if s[i,j] < 0 then s[i,j] = 0; Procedure Print-LAlign(s,X,i,j) 1 Find the location (i,j) for the largest score in s; 2 if i = 0 or j = 0 or s[i,j] = 0 then return 3 if s[i,j] = s[i-1,j-1] + 1 or s[I,j] = s[i-1,j-1] - mu then { Print-LAlign(c,X,i-1,j-1) print X[i] } else if s[i, j] = s[i-1,j] - sigma then Print-LAlign(s,X,i-1,j) else Pring-LAlign(s,X,i,j-1)
55
Homework problem: Calculate the table of scores and find LCS, the global alignment, and the local alignment with the largest score, respectively, for the following DNA sequence: AATGGCCTACTG TACGGCCTAGTC where mu = -0.8, sigma = 0.5. Project (Bonus): Implement LCS (4 points), global alignment (3 points), and local alignment algorithms (3 points). Find the global alignment, and the local alignment with the largest score, respectively, for the following DNA sequence: AATGGCCTACTG TACGGCCTAGTC where mu = -0.8, sigma = 0.5. Report requirements: problem description, algorithms, time complexity analysis, code, and testing result.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.