1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Outline The power of DNA Sequence Comparison The Change Problem
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Comp 122, Fall 2004 Dynamic Programming. dynprog - 2 Lin / Devi Comp 122, Spring 2004 Longest Common Subsequence  Problem: Given 2 sequences, X =  x.
All Pairs Shortest Paths and Floyd-Warshall Algorithm CLRS 25.2
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
Dynamic Programming Technique. D.P.2 The term Dynamic Programming comes from Control Theory, not computer science. Programming refers to the use of tables.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Dynamic Programming Solving Optimization Problems.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Distance Functions for Sequence Data and Time Series
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
Introduction To Bioinformatics Tutorial 2. Local Alignment Tutorial 2.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Class 2: Basic Sequence Alignment
15-853:Algorithms in the Real World
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Dynamic Programming – Part 2 Introduction to Algorithms Dynamic Programming – Part 2 CSE 680 Prof. Roger Crawfis.
Sequence Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Qualitative approximation to Dynamic Time Warping similarity between time series data Blaž Strle, Martin Možina, Ivan Bratko Faculty of Computer and Information.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
Minimum Edit Distance Definition of Minimum Edit Distance.
1 Approximate Algorithms (chap. 35) Motivation: –Many problems are NP-complete, so unlikely find efficient algorithms –Three ways to get around: If input.
1 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA.
CS 3343: Analysis of Algorithms Lecture 18: More Examples on Dynamic Programming.
Dynamic Programming Min Edit Distance Longest Increasing Subsequence Climbing Stairs Minimum Path Sum.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
COSC 3101NJ. Elder Announcements Midterms are marked Assignment 2: –Still analyzing.
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
Dynamic Programming (Edit Distance). Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Dynamic Programming for the Edit Distance Problem.
CSCI2950-C Genomes, Networks, and Cancer
All-pairs Shortest paths Transitive Closure
Definition of Minimum Edit Distance
Optimization problems such as
CS 3343: Analysis of Algorithms
Approximate Algorithms (chap. 35)
Distance Functions for Sequence Data and Time Series
CSCE 411 Design and Analysis of Algorithms
Single-Source All-Destinations Shortest Paths With Negative Costs
Sequence Alignment Using Dynamic Programming
Computational Biology Lecture #6: Matching and Alignment
Single-Source All-Destinations Shortest Paths With Negative Costs
Computational Biology Lecture #6: Matching and Alignment
Intro to Alignment Algorithms: Global and Local
Approximate Matching of Run-Length Compressed Strings
Problem Solving 4.
CSE 589 Applied Algorithms Spring 1999
Lecture 8. Paradigm #6 Dynamic Programming
Trevor Brown DC 2338, Office hour M3-4pm
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
Longest Common Subsequence
Dynamic Programming II DP over Intervals
Advanced Analysis of Algorithms
Presentation transcript:

1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann

2 Dynamic programming Algorithm design technique often used for optimization problems Generally usable for recursive approaches if the same partial solutions are required more than once Approach: store partial results in a table Advantage: improvement of complexity, often polynomial instead of exponential

3 Two different approaches Bottom-up: + controlled efficient table management, saves time + special optimized order of computation, saves space - requires extensive recoding of the original program -possible computation of unnecessary values Top-down: (Note-pad method) + original program changed only marginally or not at all + computes only those values that are actually required -separate table management takes additional time -table size often not optimal

4 Problem: similarity of strings Edit distance For two given A and B, compute, as efficiently as possible, the edit distance D(A,B) and a minimal sequence of edit operations which transforms A into B. i n f o r m a t i k - i n t e r p o l - a t i o n

5 Problem: similarity of strings Approximate string matching For a given text T, a pattern P, and a distance d, find all substrings P´ in T with D(P,P´)  d Sequence alignment Find optimal alignments of DNA sequences G A G C A - C T T G G A T T C T C G G C A C G T G G

6 Edit distance Given: two strings A = a 1 a a m and B = b 1 b 2... b n Wanted: minimal cost D(A,B) for a sequence of edit operations to transform A into B. Edit operations: 1. Replace one character in A by a character from B 2. Delete one character from A 3. Insert one character from B

7 Edit distance Cost model: We assume the triangle inequality holds for c: c(a,c)  c(a,b) + c(b,c)  Each character is changed at most once

8 Edit distance Trace as representation of edit sequences A = b a a c a a b c B = a b a c b c a c or using indels A = - b a a c a - a b c B = a b a - c b c a - c Edit distance (cost): 5 Division of an optimal trace results in two optimal sub-traces  dynamic programming can be used

9 Computation of the edit distance Let A i = a 1...a i and B j = b 1....b j D i,j = D(A i,B j ) A B

10 Computation of the edit distance Three possibilities of ending a trace: 1. a m is replaced by b n : D m,n = D m-1,n-1 + c(a m, b n ) 2. a m is deleted: D m,n = D m-1,n b n is inserted: D m,n = D m,n-1 + 1

11 Computation of the edit distance Recurrence relation, if m,n  1:  Computation of all D i,j is required, 0  i  m, 0  j  n. D i-1,j-1 D i-1,j D i,j D i,j-1 +d+d +1

12 Recurrence relation for the edit distance Base cases: D 0,0 = D( ,  ) = 0 D 0,j = D( , B j ) = j D i,0 = D(A i,  ) = i Recurrence equation:

13 Order of computation for the edit distance b 1 b 2 b 3 b b n a1a1 amam D i-1,j D i,j D i-1,j-1 D i,j-1 a2a2

14 Algorithm for the edit distance Algorithm edit_distance Input: two strings A = a a m and B = b 1... b n Output: the matrix D = (D ij ) 1 D[0,0] := 0 2 for i := 1 to m do D[i,0] = i 3 for j := 1 to n do D[0,j] = j 4 for i := 1 to m do 5 for j := 1 to n do 6D[i,j] := min( D[i - 1,j] + 1, 7 D[i,j - 1] + 1, 8 D[i –1, j – 1] + c(a i,b j ))

15 Example abac b1 a2 a3 c4

16 Computation of the edit operations Algorithm edit_operations (i,j) Input: matrix D (computed) 1 if i = 0 and j = 0 then return 2 if i  0 and D[i,j] = D[i – 1, j] then „delete a[i]“ 4 edit_operations (i – 1, j) 5 else if j  0 and D[i,j] = D[i, j – 1] then „insert b[j]“ 7 edit_operations (i, j – 1) 8 else /* D[i,j] = D[i – 1, j – 1 ] + c(a[i], b[j]) */ 9 „replace a[i] by b[j] “ 10 edit_operations (i – 1, j – 1) Initial call: edit_operations(m,n)

17 Trace graph of the edit operations B = a b a c A = b a a c

18 Sub-graph of the edit operations Trace graph: Overview of all possible traces for the transformation of A into B, directed edges from vertex (i, j) to (i + 1, j), (i, j + 1) and (i + 1, j + 1). Weights of the edges represent the edit costs. Costs are monotonic increasing along an optimal path. Each path with monotonic increasing cost from the upper left corner to the lower right corner represents an optimal trace.

19 Approximate string matching Given: two strings P = p 1 p 2... p m (pattern) and T = t 1 t 2... t n (text) Wanted: an interval [j´, j], 1  j´  j  n, such that the substring T j´, j = t j´... t j of T is the one with the greatest similarity to pattern P, i.e. for all other intervals [k´, k], 1  k´  k  n: D(P,T j´, j )  D(P, T k´, k ) T P j

20 Approximate string matching Naïve approach: for all 1  j´  j  n do compute D(P,T j´, j ) choose minimum

21 Approximate string matching Consider a related problem: T j i E(i, j) P For each text position j and each pattern position i compute the edit distance of the substring T j´,j of T ending at j which has the greatest similarity to P i.

22 Approximate string matching Method: for all 1  j  n do compute j´ such that D(P,T j´, j ) is minimal For 1  i  m and 0  j  n let: Optimal trace: P i = b a a c a a b c T j´, j = b a c b c a c

23 Approximate string matching Recurrence relation: Remark: j´ can be completely different for E i-1, j-1, E i – 1,j and E i, j – 1. A subtrace of an optimal trace is an optimal subtrace.

24 Approximate string matching Base cases: E 0,0 = E( ,  ) = 0 E i,0 = E(P j,  ) = i but E 0,j = E( ,T j ) = 0 Observation: The optimal edit sequence from P to T j´, j does not start with an insertion of t j´.

25 Approximate string matching T = a b b d a d c b c P = a d b b c Dependency graph

26 Approximate string matching Theorem If there is a path from E 0, j´- 1 to E i, j in the dependency graph, then T j´, j is a substring of T ending in j with the greatest similarity to P i and D(P i, T j´,j ) = E i, j