Chapter 6 Dynamic Programming
Algorithmic Paradigms
Greed. Build up a solution incrementally, myopically optimizing some local criterion. Divide-and-conquer. Break up a problem into two sub-problems, solve each sub-problem independently, and combine solution to sub-problems to form solution to original problem. Dynamic programming. Break up a problem into a series of overlapping sub-problems, and build up solutions to larger and larger sub-problems. overlapping sub-problem = sub-problem whose results can be reused several times
Dynamic Programming History
Bellman. Pioneered the systematic study of dynamic programming in the 1950s. Etymology. Dynamic programming = planning over time. Secretary of Defense was hostile to mathematical research. Bellman sought an impressive name to avoid confrontation. "it's impossible to use dynamic in a pejorative sense" "something not even a Congressman could object to" Reference: Bellman, R. E. Eye of the Hurricane, An Autobiography.
Dynamic Programming Applications
Areas Bioinformatics. Control theory. Information theory. Operations research (e.g. knapsack problem) Computer science: theory, graphics, AI, systems, …. Some famous dynamic programming algorithms Viterbi’s algorithm for hidden Markov models. Unix diff for comparing two files. Smith-Waterman for sequence alignment. Bellman-Ford for shortest path routing in networks. Cocke-Kasami-Younger for parsing context free grammars.
6.4 Knapsack Problem 1-24,26-37
Knapsack Problem Knapsack problem. Given n objects and a "knapsack."
Item i weighs wi > 0 kilograms and has value vi > 0. Knapsack has capacity of W kilograms. Goal: fill knapsack so as to maximize total value. Ex: { 3, 4 } has value 40. Greedy: repeatedly add item with maximum ratio vi / wi. Ex: { 5, 2, 1 } achieves only value = 35 greedy not optimal. Item Value Weight 1 1 1 2 6 2 W = 11 3 18 5 4 22 6 5 28 7
Dynamic Programming: False Start
Def. OPT(i) = max profit subset of items 1, …, i. Case 1: OPT does not select item i. OPT selects best of { 1, 2, …, i-1 } Case 2: OPT selects item i. accepting item i does not immediately imply that we will have to reject other items without knowing what other items were selected before i, we don't even know if we have enough room for i Conclusion. Need more sub-problems!
Dynamic Programming: Adding a New Variable
Def. OPT(i, w) = max profit subset of items 1, …, i with weight limit w. Case 1: OPT does not select item i. OPT selects best of { 1, 2, …, i-1 } using weight limit w Case 2: OPT selects item i. new weight limit = w – wi OPT selects best of { 1, 2, …, i–1 } using this new weight limit
Knapsack Problem: Bottom-Up
Knapsack. Fill up an n-by-W array. Input: n, w1,…,wN, v1,…,vN for w = 0 to W M[0, w] = 0 for i = 1 to n for w = 1 to W if (wi > w) M[i, w] = M[i-1, w] else M[i, w] = max {M[i-1, w], vi + M[i-1, w-wi ]} return M[n, W]
Knapsack Algorithm { 1, 2 } { 1, 2, 3 } { 1, 2, 3, 4 } { 1 }
W + 1 { 1, 2 } { 1, 2, 3 } { 1, 2, 3, 4 } { 1 } { 1, 2, 3, 4, 5 } 1 2 6 3 7 4 5 18 19 22 24 28 8 25 29 9 34 10 11 40 n + 1 1 Value 18 22 28 Weight 5 6 2 7 Item 3 4 OPT: { 4, 3 } value = = 40 W = 11
Knapsack Problem: Running Time
Running time. O (n W) How much storage is needed? The table we used above is of size O(n W). For example, if W = and n = 100, we need a total of 400 Mbytes (106 x 100 x 4 bytes). This is too high. The space requirement can be reduced: once row j has been computed, we don’t need rows j – 1 and below. Space reduces to 2W = 8 Mbytes in the above case.
6.6 Sequence Alignment 1-24,26-37
String Similarity How similar are two strings? ocurrance Occurrence o
- o c c u r r e n c e 5 mismatches, 1 gap o c - u r r a n c e o c c u r r e n c e 1 mismatch, 1 gap o c - u r r - a n c e o c c u r r e - n c e 0 mismatches, 3 gaps
String Similarity How similar are two strings? ocurrance Occurrence
Key concept used by Search Engines to correct spelling errors. You get a response: “Did you mean xxxxx ?” o c u r r a n c e - o c c u r r e n c e 5 mismatches, 1 gap o c - u r r a n c e o c c u r r e n c e 1 mismatch, 1 gap o c - u r r - a n c e o c c u r r e - n c e 0 mismatches, 3 gaps 14
Google search for “allgorythm”
When a word is not in the dictionary, the search engine searches through all words in data base and finds the one that is closest to “allgorythm” and asks if you probably meant it.
Edit Distance Applications.
Basis for Unix diff: determine how similar two files are. Plagiarism detection: “document similarity” is a good indicator of whether a document was plagiarized. Speech recognition. Sound pattern corresponding to a spoken word is matched against the sound pattern corresponding to various standard spoken words in a data base. Computational biology. Levenshtein introduced concept of edit-distance; Needleman-Wunsch first applied it to aligning biological sequences; our algorithm is closest in spirit to Smith-Waterman (but their algorithm is for local alignment) Presumably alpha_pp = 0, but assumption not needed (could be a profit instead of penalty) bonus application: spam filter - compare message with known spam messages
Plagiarism checker – a simple test
Here a paragraph from Chapter 6 of the text (current chapter on Dynamic Programming) and made some small changes: “we now turn to a more powerful and subtle design technique, dynamic programming. It'll be simpler to day precisely what characterizes dynamic programming after we have seen it in action, but the basic idea is drawn from the intuition behind divide-and-conquer ad is essentially the opposite of the greedy strategy: one implicitly explores the space of all possible solutions, by carefully decomposing things into a series of subproblems, and then building up correct solutions to larger and larger subproblems. In a way, we can thus view dynamic programming as operating dangerously close to the edge of brute-force search.” Let us try if the detector can find the original source:
Edit Distance Edit distance. [Levenshtein 1966, Needleman-Wunsch 1970]
Gap penalty ; mismatch penalty pq. Cost = sum of gap and mismatch penalties. Levenshtein introduced concept of edit-distance; Needleman-Wunsch first applied it to aligning biological sequences; our algorithm is closest in spirit to Smith-Waterman (but their algorithm is for local alignment) Presumably alpha_pp = 0, but assumption not needed (could be a profit instead of penalty) bonus application: spam filter - compare message with known spam messages C T G A C C T A C C T - C T G A C C T A C C T C C T G A C T A C A T C C T G A C - T A C A T TC + GT + AG+ 2CA 2 + CA 21
Edit Distance In the simplest model we will first study, consider just two operations: insert and delete each costing one unit. Example: What is DIS(“abbaabab”, “bbabaaba”)? abbaabab -> bbaabab (delete a) -> bbabab (delete a) -> bbabaab (insert a) -> bbabaaba (insert a) DIS(abbaabab, bbabaaba) <= 4. Is there a way to perform this transformation with 3 or fewer edit operations? Levenshtein introduced concept of edit-distance; Needleman-Wunsch first applied it to aligning biological sequences; our algorithm is closest in spirit to Smith-Waterman (but their algorithm is for local alignment) Presumably alpha_pp = 0, but assumption not needed (could be a profit instead of penalty) bonus application: spam filter - compare message with known spam messages 22
Edit distance: Problem Structure
Def. DIS(i, j) = min edit distance between x1 x xi and y1 y yj. Case 1: xi = yj in this case, DIS(i, j) = DIS(i-1, j-1). Case 2: xi = yj. Now there are two choices. Either delete xi and x1 x xi-1 to y1 y yj OR convert x1 x xi and y1 y yj-1 and insert yj. 23
Extending the model to include substitution
Now add another operation: insert (i, c) insert c as the i-th symbol (cost = 1) delete (i) delete the i-th symbol (cost = 1) change(i, c, d) change i-th symbol from c to d (cost = 1.5) Definition of DIS(i, j) :
Alignment problem in molecular biology
Given two fragments F1 and F2 of DNA, we want to know how closely related they are. (Closeness is an indicator of how likely is it that one evolved from the other.) This kind of analysis plays a critical role in molecular biology. DNA sequences evolve by insertions, deletions and mutations. Thus the problem is almost the same as the one we studied in the previous slide.
Sequence Alignment Goal: Given two strings X = x1 x xm and Y = y1 y yn find alignment of minimum cost. Def. An alignment M is a set of ordered pairs xi-yj such that each item occurs in at most one pair and no crossings. Def. The pair xi-yj and xi'-yj' cross if i < i', but j > j'. Ex: CTACCG vs. TACATG. Sol: M = x2-y1, x3-y2, x4-y3, x5-y4, x6-y6. x1 x2 x3 x4 x5 x6 C T A C C - G - T A C A T G y1 y2 y3 y4 y5 y6
Sequence Alignment: Problem Structure
Def. OPT(i, j) = min cost of aligning strings x1 x xi and y1 y yj. Case 1: OPT matches xi-yj. pay mismatch for xi-yj + min cost of aligning two strings x1 x xi-1 and y1 y yj-1 Case 2a: OPT leaves xi unmatched. pay gap for xi and min cost of aligning x1 x xi-1 and y1 y yj Case 2b: OPT leaves yj unmatched. pay gap for yj and min cost of aligning x1 x xi and y1 y yj-1
Sequence Alignment: Algorithm
Analysis. (mn) time and space. Computational biology: m = n = 105 or more is common. 10 billions ops OK (may take a few minutes), but 10GB array? Sequence-Alignment(m, n, x1x2...xm, y1y2...yn, , ) { for i = 0 to m M[0, i] = i for j = 0 to n M[j, 0] = j for i = 1 to m for j = 1 to n M[i, j] = min([xi, yj] + M[i-1, j-1], + M[i-1, j], + M[i, j-1]) return M[m, n] }
Problem (HW # 4, Problem 1) Input: NY[j], j = 1, 2, …, n and SF[j], j = 1, 2, …, n and moving cost k (= 10). Output: optimum cost of operating the business for n months. Solution idea: Define OPT(j, NY) as the optimal cost of operating the business for the first j months under the condition that the operation was located in NY in the j-th month and similarly for OPT(j, SF). Now write a recurrence formula for OPT(j, NY) in terms of OPT(j-1, NY) and OPT(j-1, SF) and similarly for OPT(j, SF). From this it is easy to compute OPT(j, NY) and OT(j, SF).
6.5 RNA Secondary Structure
RNA Secondary Structure
RNA. String B = b1b2bn over alphabet { A, C, G, U }. Secondary structure. RNA is single-stranded so it tends to loop back and form base pairs with itself. This structure is essential for understanding behavior of molecule. C A Ex: GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA A A A U G C C G U A A G G U A U U A G A C G C U G C G C G A G C G A U G complementary base pairs: A-U, C-G 31
RNA Secondary Structure
Secondary structure. A set of pairs S = { (bi, bj) } that satisfy: [Watson-Crick.] S is a matching and each pair in S is a Watson-Crick complement: A-U, U-A, C-G, or G-C. [No sharp turns.] The ends of each pair are separated by at least 4 intervening bases. If (bi, bj) S, then i < j - 4. [Non-crossing.] If (bi, bj) and (bk, bl) are two pairs in S, then we cannot have i < k < j < l. Free energy. Usual hypothesis is that an RNA molecule will form the secondary structure with the optimum total free energy. Goal. Given an RNA molecule B = b1b2bn, find a secondary structure S that maximizes the number of base pairs. approximate by number of base pairs 32
RNA Secondary Structure: Examples
G G G G G G G C U C U C G C G C U A U A U A G U A U A U A base pair A U G U G G C C A U A U G G G G C A U A G U U G G C C A U 4 ok sharp turn crossing 34
RNA Secondary Structure: Sub-problems
First attempt. OPT(j) = maximum number of base pairs in a secondary structure of the substring b1b2bj. Difficulty. Results in two sub-problems. Finding secondary structure in: b1b2bt-1. Finding secondary structure in: bt+1bt+2bn-1. match bt and bn 1 t n OPT(t-1) need more sub-problems 36
Dynamic Programming formulation
Notation. OPT(i, j) = maximum number of base pairs in a secondary structure of the substring bibi+1bj. Case 1. If i j - 4. OPT(i, j) = 0 by no-sharp turns condition. Case 2. Base bj is not involved in a pair. OPT(i, j) = OPT(i, j-1) Case 3. Base bj pairs with bt for some i t < j - 4. non-crossing constraint decouples resulting sub-problems OPT(i, j) = 1 + maxt { OPT(i, t-1) + OPT(t+1, j-1) } take max over t such that i t < j-4 and bt and bj are Watson-Crick complements 37
Dynamic Programming - algorithm
Q. What order to solve the sub-problems? A. Do shortest intervals first. Running time. O(n3). RNA(b1,…,bn) { for k = 5, 6, …, n-1 for i = 1, 2, …, n-k j = i + k Compute M[i, j] return M[1, n] } 4 3 i 2 1 6 7 8 9 using recurrence j 38
