1 String Edit Distance Matching Problem With Moves Graham Cormode S. Muthukrishna November 2001.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

WSPD Applications.
Lecture 24 MAS 714 Hartmut Klauck
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Longest Common Subsequence
Greedy Algorithms Amihood Amir Bar-Ilan University.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
CSC 2300 Data Structures & Algorithms March 16, 2007 Chapter 7. Sorting.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Optimal Merging Of Runs
Tirgul 5 AVL trees.
Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.
Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.
Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang.
Graphs and Trees This handout: Trees Minimum Spanning Tree Problem.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Chapter 9 Greedy Technique Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Chapter 9: Huffman Codes
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Orgad Keller Modified by Ariel Rosenfeld Less Than Matching.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Chapter 12: Context-Free Languages and Pushdown Automata
Huffman Codes Message consisting of five characters: a, b, c, d,e
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Chapter 7: Sorting Algorithms Insertion Sort. Sorting Algorithms  Insertion Sort  Shell Sort  Heap Sort  Merge Sort  Quick Sort 2.
1 Splay trees (Sleator, Tarjan 1983). 2 Goal Support the same operations as previous search trees.
Regular Grammars Chapter 7. Regular Grammars A regular grammar G is a quadruple (V, , R, S), where: ● V is the rule alphabet, which contains nonterminals.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 9 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Section 12.4 Context-Free Language Topics
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
1 Greedy Technique Constructs a solution to an optimization problem piece by piece through a sequence of choices that are: b feasible b locally optimal.
Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.
On the Hardness of Optimal Vertex Relabeling and Restricted Vertex Relabeling Amihood Amir Benny Porat.
CSE 311 Foundations of Computing I Lecture 24 FSM Limits, Pattern Matching Autumn 2011 CSE 3111.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
1 Closure E.g., we understand number systems partly by understanding closure properties: Naturals are closed under +, , but not -, . Integers are closed.
Complexity and Computability Theory I Lecture #12 Instructor: Rina Zviel-Girshin Lea Epstein.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
HUFFMAN CODES.
New Characterizations in Turnstile Streams with Applications
Greedy Technique.
Computing Connected Components on Parallel Computers
@#? Text Search g ~ A R B n f u j u q e ! 4 k ] { u "!"
Taku Aratsu1, Kouichi Hirata1 and Tetsuji Kuboyama2
Spectral Clustering.
Chapter 9: Huffman Codes
Chapter 7 Regular Grammars
Advanced Algorithms Analysis and Design
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Huffman Coding Greedy Algorithm
CSE 589 Applied Algorithms Spring 1999
Analysis of Algorithms CS 477/677
Presentation transcript:

1 String Edit Distance Matching Problem With Moves Graham Cormode S. Muthukrishna November 2001

2 Pattern Matching Text T of length n Pattern P of length m Our goal: find “good” matches of P in T as measured by some edit distance function d(_,_) For every i= 1,2… n find: D[i] = back

3 Pattern Matching example T = “assasin” P = “ssi” -D[1]=2 assasin Or assasin _ssi ss i -D[3]=1 assa sin s_si -naively it will take O(mn^3)

4 The main idea d(X,Y) = smallest number of operations to turn X into Y. The operations are insertion, deletion, replacement of a character and moves of a substring. The idea is to approximate D[i] up to a factor of O(LognLog*n)

5 General algorithm Embed the string distance into L1 vector distance, up to a O(LognLog*n) factor : compute the vector with a single pass over the string. Find the vector representation of O(n) substrings. Do all this in O(nLogn) time. (a deterministic algorithm with an aproximate result)

6 Edit Sensitive Parsing (ESP) for the embedding We want to parse the string so that edit operations will have a limit effect on the parsing (an edit operation on char i changes the parsing only of the “neighborhood” of i ) For example: “abcbbabcbdfgj” “xyzbbabcbdfgj” “a bc bb abc b dfgj” “xy z bb abc b dfgj”

7 Edit Sensitive Parsing (ESP) for the embedding In practice, find landmarks in the strings, based only on their locality and parse by them. Local maxima are good landmarks, “abcegiklmrtabc” but may be far apart in large alphabets, so we will reduce the alphabet.

8 Edit Sensitive Parsing (ESP) choosing the landmarks We use a technique called Alphabet reduction: Text: c a b a g e f a c e d Binary: Label: Label(A[i]) = 2l + bit(l,A[i]) l = location of first bit that is different between A[i] and A[i-1] The value of the l bit in A[i]

9 Edit Sensitive Parsing (ESP) Alphabet reduction (cont.) - In each iteration the alphabet is reduced from Σ to 2Log|Σ|. - After Log*|Σ| iterations we get |Σ| < 7. - Then we reduce from 6 to 3, ensuring no adjacent pairs are identical (start from 6 then 5 then 3): Final iteration: After reduction:

10 Edit Sensitive Parsing (ESP) Alphabet reduction (cont.) Properties of final labels: 1) Final alphabet is {0,1,2}. 2) No adjacent pair is identical. 3) Takes Log*|Σ| iterations. 4) Each label depends on O(Log*|Σ|) characters to its left.

11 Edit Sensitive Parsing (ESP) So how do we choose the landmarks? For repeats, parse in a regular way: aaaaaaa -> (aaa)(aa)(aa) For varying substrings, use alphabet reduction then mark landmarks as follows:

12 Edit Sensitive Parsing (ESP) So how do we choose the landmarks? Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final:

13 Edit Sensitive Parsing (ESP) So how do we choose the landmarks? Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final: Then mark any local minima if not adjacent to a marked char.

14 Edit Sensitive Parsing (ESP) So how do we choose the landmarks? Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final: Then mark any local minima if not adjacent to a marked char. Clearly, distance between marked labels is 2 or 3.

15 Edit Sensitive Parsing (ESP) What did we achieve so far? By now the whole string has been arranged in pairs and triples. The important outcome is that 2 strings with small edit distance will be parsed to a “very close” arrangement. (the parsing of each character depends on a Log*n neighborhood)

16 Edit Sensitive Parsing (ESP) constructing the ESP tree We will now re-label each pair or triple – can be done by building a dictionary (Karp-Miller-Rosenberg) or by hashing (Karp-Rabin 1987). Hash(w[0…m-1]) =(integer value of w[0…m-1]) mod q For some large q

17 Edit Sensitive Parsing (ESP) constructing the ESP tree In O(nLogn) construction time we get a 2-3 tree:

18 How do we represent a 2-3 tree as a vector? The vector is the frequency of occurrence of each (level,label) pair: (0,a) (0,b) (0,c) (0,d) (0,e) (0,f) (0,g) (0,_) (1,2) (1,3) (1,6) (1,7) (1,8) (1,10) (1,12) (1,14) (1,16) (1,20) (1,21) (2,5) (2,7) (2,10) (2,13) (2,17) (2,20) (3,3) (3,15) (3,23) (4,10)

19 Proof of correctness Theorem: 1/2d(X,Y) V(X) – V(Y) 1 O(lognlog*n)d(X,Y)

20 Upper bound proof: V(X) – V(Y) 1 O(lognlog*n)d(X,Y) Insert/change/delete a character: at most log*n + c nodes change per tree level. Move a substring: the only changes are at the fringes, that is 4(log*n +c) nodes change per tree level. since there are logn levels in the tree: Conclusion: each operation changes V by O(lognlog*n) and there are d(X,Y) changes.

21 Lower bound proof: d(X,Y) 2 V(X) – V(Y) 1 The Idea is to transform X into Y using at most 2 V(X) – V(Y) 1 operations: We want to keep hold of large pieces of the string that are common to both X and Y. So we will go through and protect enough pieces of X that will be needed in Y.

22 Lower bound proof: (cont.) We avoid doing any changes in the protected pieces. At the first level of the tree we add or remove characters as needed. (if a character appears in Y and not in X, we add it to the end of X). So we get V 0 (X)-V 0 (Y) 1 = 0. We proceed inductively up on the tree, then to make any node in level i, we need to move at most 2 nodes from level i-1. (we know that on level i-1 we have enough nodes i.e: V i-1 (X)-V i-1 (Y) 1 = 0)

23 Lower bound proof: example Y: X: D AABCB E BA F CB E GH I J CABBA F CB E CB E KM L Mark protected pieces: Counter=0

24 Lower bound proof: example Y: X: D AABCB E BA F CB E GH I J CABBA F CB E CB E KM L Remove and add characters as needed: Counter=0Counter=1 (deletion) A Counter=2 (insertion)

25 Lower bound proof: example Y: X: D AABCB E BA F CB E GH I J ABBA F CB E CB E KM L A Move to level 2, move nodes in level 1 as needed: Counter=2 (insertion) Counter=3 (1 move)

26 Lower bound proof: example Y: X: D AABCB E BA F CB E GH I D ABBA F CB E CB E KM L A Move to level 3, move nodes in level 2 as needed: This node will not move Counter=5 (2 moves) Counter=3 (1 move)

27 Lower bound proof: example Y: X: D AABCB E BA F CB E GH I D ABCB E BA F CB E GH I A Counter=5 (2 moves) That’s it!!

28 Lower bound proof: (example with d(X,Y)=1) Did we achieve what we wanted? Y X 2 V(Y)-V(X) 1 = 2(#red nodes + # green nodes) = 18 And we surely created X from Y in less then 18 moves D AABCB E BA F CB E GH I J CABBA F CB E CB E KM L

29 Application to string matching To find D[i], we need to compare P against allD[i] possible substrings of T. we can reduce this to O(n): d(T[1,m],P) d(T[1,m], T[1,r]) + d(T[1,r], P) = |r–m| + d(T[1,r], P) 2d(T[1,r], P) So we only need to consider O(n) substrings of length m And we get a 2-approximation of the optimal matching !! Since we need at least |r–m| operations to make T[1,r] the same length as P.

30 Application to string matching – Final algorithm Create ESP trees for T and P. Find V(T[1:m])-V(P) 1 Iteratively compute D[i] V(T[i:i+m-1])-V(P) 1 Overall we get O(nLogn) time cost for the whole algorithm, and compute every D[i] up to a factor of O(lognlog*n)

31 Application to string matching – Final algorithm example:

32