Dynamic Programming (Edit Distance). Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Comp 122, Spring 2004 Greedy Algorithms. greedy - 2 Lin / Devi Comp 122, Fall 2003 Overview  Like dynamic programming, used to solve optimization problems.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Overview What is Dynamic Programming? A Sequence of 4 Steps
COMP8620 Lecture 8 Dynamic Programming.
Advanced Algorithm Design and Analysis (Lecture 6) SW5 fall 2004 Simonas Šaltenis E1-215b
Outline 1. General Design and Problem Solving Strategies 2. More about Dynamic Programming – Example: Edit Distance 3. Backtracking (if there is time)
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Chapter 15 Dynamic Programming Lee, Hsiu-Hui Ack: This presentation is based on the lecture slides from Hsu, Lih-Hsing, as well as various materials from.
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
§ 8 Dynamic Programming Fibonacci sequence
Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Dynamic Programming Solving Optimization Problems.
Greedy Algorithms Reading Material: –Alsuwaiyel’s Book: Section 8.1 –CLR Book (2 nd Edition): Section 16.1.
Dynamic Programming Code
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Distance Functions for Sequence Data and Time Series
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 6 Instructor: Paul Beame TA: Gidon Shavit.
11-1 Matrix-chain Multiplication Suppose we have a sequence or chain A 1, A 2, …, A n of n matrices to be multiplied –That is, we want to compute the product.
16.Greedy algorithms Hsu, Lih-Hsing. Computer Theory Lab. Chapter 16P An activity-selection problem Suppose we have a set S = {a 1, a 2,..., a.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Dynamic Programming Tutorial &Practice on Longest Common Sub-sequence.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
1 Summary: Design Methods for Algorithms Andreas Klappenecker.
COSC 3101A - Design and Analysis of Algorithms 7 Dynamic Programming Assembly-Line Scheduling Matrix-Chain Multiplication Elements of DP Many of these.
Algorithm Paradigms High Level Approach To solving a Class of Problems.
Dynamic Programming Louis Siu What is Dynamic Programming (DP)? Not a single algorithm A technique for speeding up algorithms (making use of.
1 Dynamic Programming Andreas Klappenecker [partially based on slides by Prof. Welch]
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Dynamic Programming.
Dynamic Programming: Edit Distance
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Dynamic Programming Min Edit Distance Longest Increasing Subsequence Climbing Stairs Minimum Path Sum.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
Dynamic Programming Tutorial &Practice on Longest Common Sub-sequence.
TU/e Algorithms (2IL15) – Lecture 3 1 DYNAMIC PROGRAMMING
1 Algorithms CSCI 235, Fall 2015 Lecture 29 Greedy Algorithms.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
1 Chapter 15-2: Dynamic Programming II. 2 Matrix Multiplication Let A be a matrix of dimension p x q and B be a matrix of dimension q x r Then, if we.
Dynamic Programming for the Edit Distance Problem.
Greedy Algorithms Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
Greedy Algorithms Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
Approximate k-edit-distance
Greedy Algorithms General principle of greedy algorithm
Analysis of Algorithms CS 477/677
Distance Functions for Sequence Data and Time Series
Sequence Alignment 11/24/2018.
CSE 589 Applied Algorithms Spring 1999
Data Structure and Algorithms
Lecture 8. Paradigm #6 Dynamic Programming
Algorithms CSCI 235, Spring 2019 Lecture 29 Greedy Algorithms
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
DYNAMIC PROGRAMMING.
Greedy Algorithms Comp 122, Spring 2004.
Dynamic Programming II DP over Intervals
Matrix Chain Multiplication
Presentation transcript:

Dynamic Programming (Edit Distance)

Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA Target: – Find the smallest distance between S1 and S2 – In other words, the smallest number of edit operations to covert S1 into S2 Edit Operations – Insert (I), Delete (d), align(a)

Example S1:TCGACGTCA S2: TGACGTGC Three operations to convert S1 to S2: S1:TCGACGTGCA S2: T GACGTGC – Delete C (position 2) and A (position 10) – Insert G (position 8)

Edit Distance ACGTCGCAT A C G T G T G C 0i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a Cost of inserting T into S1 to match S2 S1 is empty  S2 is empty

Edit Distance ACGTCGCAT A C G T G T G C 0i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a Cost of inserting TC into S1 to match S2 S1 is empty  S2 is empty 2i

Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a S1 is empty  S2 is empty

Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a Cost of deleting T from S1 to match S2 S1 is empty  S2 is empty

Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a Cost of deleting TG from S1 to match S2 S1 is empty  S2 is empty

Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a S1 is empty  S2 is empty

Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a S1 is empty  S2 is empty What we did so far is called Initialization Phase M[0][j] = j * Cost of insert (for all j) M[k][0] = k * cost of delete (for all k) What we did so far is called Initialization Phase M[0][j] = j * Cost of insert (for all j) M[k][0] = k * cost of delete (for all k)

Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a S1 is empty  S2 is empty For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different

Edit Distance ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different

Edit Distance ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty i,j Smallest Cost for converting S1[1..i] to match S2[1...j] n,m Our goal is to covert S1[1..n] to match S2[1…m]

Edit Distance ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min

Edit Distance: Case 1 ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGA from S1 with TCGA from S2 + align C with C

Edit Distance: Case 2 ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGA from S1 with TCGAC from S2 + delete C from S1

Edit Distance: Case 3 ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGAC from S1 with TCGA from S2 + insert C from S1

Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 Case 1: = 0 Case 2: = 2 Case 3: =2 Case 1: = 0 Case 2: = 2 Case 3: =2

Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 Case 1: = 2 Case 2: = 3 Case 3: =1 Case 1: = 2 Case 2: = 3 Case 3: =1

Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min

Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min

Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min

Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Case 1: = 4 Case 2: = 6 Case 3: = 4 Case 1: = 4 Case 2: = 6 Case 3: = 4 Two equivalent options to reach this cell

Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min

Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = Final answer (To covert from S1 to S2 we need 3 edit operations)

Summary of Steps >> We considers all combinations (all possible alignments) (Navigate the solution space) >> We started will small sub-problems to solve optimally (Optimal sub-structure) >> At each step from problem of size K, use the results from the possible K-1 sub-problems to find your best answer (Need to keep these results, not compute them again)

Edit Distance: Algorithm int matrix[n+1][m+1]; for (x = 0; x <= n; x++) matrix[x][0] = x; for (y = 1; y <= m; y++) matrix [0][y] = y; for (x = 1; x <= n; x++) for (y = 1; y <= m; y++) if (S1[x] == S2[y]) matrix[x][y] = matrix[x-1][y-1]; else matrix[x][y] = min(matrix[x][y-1] + 1, matrix[x-1][y] + 1); return matrix[n][m]; Initialization step S1 of size n, S2 of size m If matching, then go diagonal with 0 additional cost Consider the other two options and take the least

Edit Distance: Algorithm Analysis >> We compute (n m) cells >> For each cell we compare with at most 3 surrounding cells Time Complexity  O (nm) Space Complexity is also  O (nm)

How to Backtrack Keep extra information with each cell c – From where did you arrive to c (diagonal, left, or top) We now know that the cost is 3. What are the operations and in what order? Always in Dynamic Programming, to backtrack you may need to keep which optimal sub-problem did you use at each step

Backtrack A C G T G T G C S1 S2 S1 is empty  S2 is empty M[i, j] = Means align Means insert Means delete Operations of S1 ACGTCGCAT ACGTG C C A G T ACGTG GC T Original S1 Insert C (position 2) Delete G (position 7) Insert A (position 9