CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 21
Dynamic Programming Over Intervals Write down final recurrence (DONE IN CLASS) What order to solve the sub-problems? –Do shortest intervals first –Running time: O(n 3 ) –Example: ACCGGUAGU (DONE IN CLASS) RNA(b 1,…,b n ) { Initialize Opt[i, j] = 0 whenever i j-4 for k = 5, 6, …, n-1 for i = 1, 2, …, n-k j = i + k Compute Opt[i, j] return Opt[1, n] } using recurrence
String Similarity Consider a dictionary interface or a spell checker. How similar are two strings? –ocurrance –occurrence ocurrance ccurrenceo - 5 mismatches, 1 gap ocurrance ccurrenceo - 1 mismatch, 1 gap ocurrnce ccurrnceo --a e- 0 mismatches, 3 gaps
String Similarity Dictionary interfaces and spell checkers not the most computationally intensive application for this type of problem Determining similarities among strings is one of the central computational problems facing molecular biologists today –Strings arise very naturally in biology (e.g., an organism’s genome is divided up into giant linear DNA molecules known as chromosomes, think of a chromosome as an enormous linear tape containing a string over the alphabet {A, C, G, T}) –A certain substring in the DNA of some organism may code for a certain kind of toxin. If we discover a very “similar” substring in the DNA of another organism, might be able to hypothesize without any experimentation that it codes for similar toxin
Edit Distance Applications –Basis for Unix diff –Speech recognition –Computational biology Edit distance [Levenshtein 1966, Needleman-Wunsch 1970] –Gap penalty ; mismatch penalty pq –Cost = sum of gap and mismatch penalties 2 + CA CGACCTACCT CTGACTACAT TGACCTACCT CTGACTACAT - T C C C TC + GT + AG + 2 CA -
Sequence Alignment Goal: Given two strings X = x 1 x 2... x m and Y = y 1 y 2... y n find alignment of minimum cost Def: An alignment M is a set of ordered pairs x i - y j such that each item occurs in at most one pair and no crossings Ex: CTACCG vs. TACATG (DONE IN CLASS)
Sequence Alignment: Problem Structure In the optimal alignment M of X = x 1 x 2... x m and Y = y 1 y 2... y n, either (x m, y n ) M or (x m, y n ) M. That is, either the last two symbols in the two strings are matched to each other, or they aren’t. By itself, is this fact enough to provide us with a DP solution? –In the optimal alignment M of X = x 1 x 2... x m and Y = y 1 y 2... y n. If (x m, y n ) M, then either x m or y n is not matched in M –Proof (DONE IN CLASS)
Sequence Alignment: Problem Structure In an optimal alignment M of X = x 1 x 2... x m and Y = y 1 y 2... y n, at least one of the following is true –(x m, y n ) M; or –X m is not matched; or –y n is not matched Def: OPT(i, j) = min cost of aligning strings x 1 x 2... x i and y 1 y 2... y j. Write down final recurrence (DONE IN CLASS)
Sequence Alignment: Algorithm Analysis: (mn) time and space –English words or sentences: m, n 10 –Computational biology: m = n = 100, billions ops OK, but 10GB array? Sequence-Alignment(m, n, x 1 x 2...x m, y 1 y 2...y n, , ) { for i = 0 to m Opt[0, i] = i for j = 0 to n Opt[j, 0] = j for i = 1 to m for j = 1 to n Opt[i, j] = min( [x i, y j ] + Opt[i-1, j-1], + Opt[i-1, j], + Opt[i, j-1]) return Opt[m, n] }