Fast Algorithm for String Matching with k Mismatches by Amihood Amir, Moshe Lewenstein, and Ely Porat, Journal of Algorithms, to appear, 2003/2004 Speaker:

Slides:



Advertisements
Similar presentations
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Advertisements

Speaker: C. C. Lin Adviser: R. C. T. Lee
1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
Boosting Textual Compression in Optimal Linear Time.
Parameterized Matching Amir, Farach, Muthukrishnan Orgad Keller Modified by Ariel Rosenfeld.
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
1 Turing Machines and Equivalent Models Section 13.2 The Church-Turing Thesis.
Longest Common Subsequence
Greedy Algorithms Amihood Amir Bar-Ilan University.
Asynchronous Pattern Matching - Metrics Amihood Amir CPM 2006.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
T(n) = 4 T(n/3) +  (n). T(n) = 2 T(n/2) +  (n)
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
1 2 Dimensional Parameterized Matching Carmit Hazay Moshe Lewenstein Dekel Tsur.
Lecture 15UofH - COSC Dr. Verma 1 COSC 3340: Introduction to Theory of Computation University of Houston Dr. Verma Lecture 15.
Function Matching Amihood Amir Yonatan Aumann Moshe Lewenstein Ely Porat Bar Ilan University.
Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen
1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation,
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
Reverse Colussi algorithm
Document Retrieval Problems S. Muthukrishnan. Storyline Zvi Galil gave a talk on the 13 th on 13 open problems he posed 13 years ago in string matching.
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
Asynchronous Pattern Matching - Address Level Errors Amihood Amir Bar Ilan University 2010.
S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Chapter 12: Context-Free Languages and Pushdown Automata
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
1 A -Approximation Algorithm for Shortest Superstring Speaker: Chuang-Chieh Lin Advisor: R. C. T. Lee National Chi-Nan University Sweedyk, Z. SIAM Journal.
Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:
Random Knapsack in Expected Polynomial Time 老師:呂學一老師.
1 Chapter 1 Introduction to the Theory of Computation.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
On The Connections Between Sorting Permutations By Interchanges and Generalized Swap Matching Joint work of: Amihood Amir, Gary Benson, Avivit Levy, Ely.
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Swaps + Mismatches Based on Estrella Eizenberg M.Sc. Thesis Supervised by Ely Porat.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Section 12.4 Context-Free Language Topics
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
23 Jan, 2008SOFSEM A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns Costas Iliopoulos M. Sohel Rahman.
Sorting by placement and Shift Sergi Elizalde Peter Winkler By 資工四 B 周于荃.
Amihood Amir, Gary Benson, Avivit Levy, Ely Porat, Uzi Vishne
Fast Fourier Transform
COSC 3340: Introduction to Theory of Computation
2-Dimensional Pattern Matching
CSE 589 Applied Algorithms Spring 1999
String Matching with k Mismatches
Presentation transcript:

Fast Algorithm for String Matching with k Mismatches by Amihood Amir, Moshe Lewenstein, and Ely Porat, Journal of Algorithms, to appear, 2003/2004 Speaker: R 李宜益 R 何明彥 R 余宗恩 Advisor: 呂學一 老師

Speaker: R 李宜益 Speaker: R 李宜益 General Case

Outline Outline Introduction Introduction Problem Definition and Preliminaries Problem Definition and Preliminaries Large and Small Alphabets Large and Small Alphabets General Alphabets General Alphabets

Introduction Two types of matching problems Two types of matching problems Generalized matching problem Generalized matching problem Approximate matching problem Approximate matching problem Previous research Previous research Landau and Vishkin : O( ) Landau and Vishkin : O( ) Abrahamson : O( ) Abrahamson : O( )

Introduction Complexity : O( ) Complexity : O( ) Contribution : Contribution : The fastest known algorithm for string matching with k mismatches. The fastest known algorithm for string matching with k mismatches. Identifying and exploiting a new technique that has been implicitly used in some recent papers – counting. Identifying and exploiting a new technique that has been implicitly used in some recent papers – counting.

Problem Definition and Preliminaries Let a, b . Define Let a, b . Define Let be two strings over alphabet . Then the hamming distance between X and Y (ham(X, Y)) is defined as Let be two strings over alphabet . Then the hamming distance between X and Y (ham(X, Y)) is defined as

Problem Definition and Preliminaries The The String matching with k mismatches Problem is defined as follows: The The String matching with k mismatches Problem is defined as follows: Input : Text T = t 0 …t n-1, pattern P = p 0 …p m-1, where , = ; =, and a natural number. where t i, p j , i = 0,…n-1 ; j = 0,…m-1, and a natural number k. Output : All pairs, where i is a text location for which ham(P,T (i) )  k, where T (i) = t i t i+1 …t i+m-1

Lager and Small Alphabets Large alphabets Large alphabets Number of different alphabets in the pattern exceeds 2k Number of different alphabets in the pattern exceeds 2k Small alphabets Small alphabets Number of different alphabets in the pattern less than Number of different alphabets in the pattern less than

Large Alphabets(1) Two stages Two stages Marking stage Marking stage Identifying the potential starts of the pattern. Identifying the potential starts of the pattern. Verification stage Verification stage Verifying which of the potential candidates is indeed a pattern occurrence. Verifying which of the potential candidates is indeed a pattern occurrence.

Large Alphabets(2) The Marking Stage The Marking Stage Let { a 1,…,a 2k } be 2k different alphabet symbols appearing in the text and let i j be the smallest index in the pattern where a j appears, j = 1,..., 2k Let { a 1,…,a 2k } be 2k different alphabet symbols appearing in the text and let i j be the smallest index in the pattern where a j appears, j = 1,..., 2k a1a1 a2a2 a3a3 ajaj …… a 2k a1a1 a1a1 a2a2 a3a3 ajaj ijij Text Pattern

Large Alphabets(3) M.1. for every symbol t i ; if t i = a j then mark text location i – j M.1. for every symbol t i ; if t i = a j then mark text location i – j M.2. discard every text location that is marked less than k marks M.2. discard every text location that is marked less than k marks Time: O(n) Time: O(n)

Example Text : aabcabc Text : aabcabc Pattern : abc Pattern : abc K : 1 K : 1 aabcabc # of marks

Lemma 1 Lemma 1 After the marking stage, there are at most undiscarded locations After the marking stage, there are at most undiscarded locationsproof:

Verification Stage Using suffix tree and Lowest Common Ancestor to check whether a location exists a matching that is less than k mismatches. Using suffix tree and Lowest Common Ancestor to check whether a location exists a matching that is less than k mismatches. takes O(k) for each candidate takes O(k) for each candidate Total time : O( ) Total time : O( )

Small Alphabets Using convolutions, as introduced by Fischer and Paterson Define String S = s0…sn-1, then SR is the reverse of the string  sn-1…s0

Example Text(T) Pattern(P) K aa bc a bc a bc

Example

Time complexity Each multiplication takes O ( nlogm ) using FFT Each multiplication takes O ( nlogm ) using FFT We do multiplications We do multiplications Can be solved in O ( n logm ) Can be solved in O ( n logm )

General Alphabets Cases which the size of the pattern alphabet is between 2 and 2 k Cases which the size of the pattern alphabet is between 2 and 2 k Definition Definition A symbol that appears in the pattern at least 2 times is called frequent. A symbol that is not frequent is called rare. A symbol that appears in the pattern at least 2 times is called frequent. A symbol that is not frequent is called rare.

Many Frequent Symbols More than frequent symbols More than frequent symbols Lemma 2 Lemma 2 Let be frequent symbols. Then there exist in the text at most locations where there is a pattern occurrence with no more than k errors. Let be frequent symbols. Then there exist in the text at most locations where there is a pattern occurrence with no more than k errors.proof: Choose 2 occurrences of every frequent symbol in pattern and call them relevant occurrences The total number of marks is at most  There are at most

Finding the Potential Locations Example : k = 2 Example : k = 2 Frequent symbols : Frequent symbols : aab c ab c aab c ab d ab frequent symbols :

Finding the Potential Locations  : don ’ t care  : don ’ t care Using the “ less than matching with “ don ’ t care ” problem ” proposed by Amir et. This can be done in O( ) Using the “ less than matching with “ don ’ t care ” problem ” proposed by Amir et. This can be done in O( ) aabcabc aabcabd 

The Verification Stage By lemma 2, we have at most candidates By lemma 2, we have at most candidates Using suffix tree and Lowest Common Ancestor to check whether a location exists a matching that is less than k mismatches. Using suffix tree and Lowest Common Ancestor to check whether a location exists a matching that is less than k mismatches. takes O( k ) for each candidate takes O( k ) for each candidate Total time : O( n + )=O( ) Total time : O( n + )=O( )

Few Frequent Symbols Using the convolutions as described in “ Small Alphabets ” to deal with the frequent symbols Using the convolutions as described in “ Small Alphabets ” to deal with the frequent symbols takes O( ) takes O( ) Then replace all frequent symbols in p by “ don ’ t cares ” Then replace all frequent symbols in p by “ don ’ t cares ” Case 1 : the remaining symbols and all their occurrences together less than 2k Case 1 : the remaining symbols and all their occurrences together less than 2k Case 2 : the remaining symbols and all their occurrences together at least 2k Case 2 : the remaining symbols and all their occurrences together at least 2k

Case 1 Using the algorithm “ Pattern Matching with Swaps ” of Amir et. This can be done in O( ) Using the algorithm “ Pattern Matching with Swaps ” of Amir et. This can be done in O( ) Total time complexity : O( ) Total time complexity : O( )

Case 2 Choose any 2k symbols Choose any 2k symbols # of chosen symbols does not exceed # of chosen symbols does not exceed Using the previous method “ finding the potential positions ” Using the previous method “ finding the potential positions ” We have at most O( ) potential positions We have at most O( ) potential positions and verifying each location is O( k ) Total time complexity: O( ) Total time complexity: O( )

Speaker : R 何明彥 Speaker : R 何明彥 Introduction to Break

Assumption Assumption Periodicity Periodicity Break Break Counting Argument Counting Argument P has 2k disjoint k-breaks P has 2k disjoint k-breaks P has 2k disjoint l-breaks P has 2k disjoint l-breaks Local matches Local matches OUTLINE

Assumption Assumption Periodicity Periodicity Break Break Counting Argument Counting Argument P has 2k disjoint k-breaks P has 2k disjoint k-breaks P has 2k disjoint l-breaks P has 2k disjoint l-breaks Local matches Local matches OUTLINE (Partition)

Assumption(1/2) Text T: |T|=n Text T: |T|=n Pattern P: |P|=m Pattern P: |P|=m =>n=2m =>n=2m T: T: P: P:

Assumption(2/2) Therefore, spilt text into substring of length 2m. Therefore, spilt text into substring of length 2m. Every pattern occurrence appears in some substring. Every pattern occurrence appears in some substring. for 2m length substrings of the text yields an algorithm of for 2m length substrings of the text yields an algorithm of

Assumption Assumption Periodicity Periodicity Break Break Counting Argument Counting Argument P has 2k disjoint k-breaks P has 2k disjoint k-breaks P has 2k disjoint l-breaks P has 2k disjoint l-breaks Local matches Local matches OUTLINE

Periodicity(1/2) Def:  A string S[1..n] is periodic if such that S[j]=S[i+j-1]. SSSS is periodic if : j 2, is a prefix of ; otherwise is aperiodic. ex: ABCABCAB ABCDABC periodic aperiodic

If P is periodic with a short period, it is quite simple to come up with a quick algorithm for string matching with k mismatch. If P is periodic with a short period, it is quite simple to come up with a quick algorithm for string matching with k mismatch.T:P: Periodicity(2/2)

Assumption Assumption Periodicity Periodicity Break Break Counting Argument Counting Argument P has 2k disjoint k-breaks P has 2k disjoint k-breaks P has 2k disjoint l-breaks P has 2k disjoint l-breaks Local matches Local matches OUTLINE

Def: Def:  A break of a string S is an aperiodic substring of S.  An l-break is a break of length l. l-breakBreak(1/4) period l-break period l-break A large number of breaks are useful for fast algorithm for string matching with k mismatches. A large number of breaks are useful for fast algorithm for string matching with k mismatches. aperiod

Lemma 3: Lemma 3: Let P be a pattern with 2k disjoint Let P be a pattern with 2k disjoint l-break and let T be a text. In each match of P in T at least k of the l-break match exactly. l-break and let T be a text. In each match of P in T at least k of the l-break match exactly. Break(2/4)

Pf/ Pf/ There are at most k mismatches in a match and P has 2k disjoint l-breaks. There are at most k mismatches in a match and P has 2k disjoint l-breaks. Since at most k do not match exactly, at least k must match exactly. Since at most k do not match exactly, at least k must match exactly. Break(3/4)

Lemma 4: Lemma 4: P is an m length pattern with < 2k l-breaks. the length of T is 2m. Then all matches of P in T are in a substring of T which has at most O(k) l-breaks. Break(4/4) proved in section 6 from Cole, Hariharan "Approximate string matching: a simple faster algorithm "

Assumption Assumption Periodicity Periodicity Break Break Counting Argument Counting Argument P has 2k disjoint k-breaks P has 2k disjoint k-breaks P has 2k disjoint l-breaks P has 2k disjoint l-breaks Local matches Local matches OUTLINE

Theorem 1: Theorem 1: P is a pattern with 2k disjoint k-breaks. In every k contiguous locations in T,at most 4 matches of the pattern. Counting Arguments(1/3)

ABCDABCDABC ABCDABC kbreak ABCDABC ABCDABC k Counting Arguments(2/3) pf/ pf/TP

For k contiguous locations in T, the overall numbers of exact matches of the k-breaks is at most 4k. For k contiguous locations in T, the overall numbers of exact matches of the k-breaks is at most 4k. This means that at most 4 locations have k k-breaks with an exact match, in their respective locations. This means that at most 4 locations have k k-breaks with an exact match, in their respective locations. Counting Arguments(3/3)

Assumption Assumption Periodicity Periodicity Break Break Counting Argument Counting Argument P has 2k disjoint k-breaks P has 2k disjoint k-breaks P has 2k disjoint l-breaks P has 2k disjoint l-breaks Local matches Local matches OUTLINE

Corollary 1: Corollary 1: If P has 2k disjoint k-breaks then there are at most matches of P in T. If P has 2k disjoint k-breaks then there are at most matches of P in T. These matches can be found in O(n+m) These matches can be found in O(n+m) time. time. P has 2k disjoint k-breaks (1/4)

pf/ pf/ From Theorem 1 there are at most matches of P in T. Therefore, if we knew these locations in advance, verification would take O(k) per location. next we describe a method of finding the next we describe a method of finding the candidate location in time O(n) P has 2k disjoint k-breaks (2/4)

1.F ind all exact matches of all breaks in the text. 2.F or every such match, mark all text locations for pattern occurrence appropriate for this break. 3.D iscard every text location that is marked less than k marks. 1)There are O(n) exact matches of breaks and they can be found in linear time. 2)There is a total of O(n) marks. P has 2k disjoint k-breaks (3/4)

2) There are l distinct breaks, appearing a 1 …a l time respectively. The total # of appearance of each distinct k-break does not exceed The total # of marks is 1)Each distinct k-break can appear at most times in the text and since there are 2k times in the text and since there are 2k k-breaks.#of all k-breaks in the text does not exceed 4n. The total length of all k-breaks ≤m. k-breaks.#of all k-breaks in the text does not exceed 4n. The total length of all k-breaks ≤m. All exact matches of all k-breaks in the text can be found in O(n+m) All exact matches of all k-breaks in the text can be found in O(n+m) P has 2k disjoint k-breaks (4/4)

Assumption Assumption Periodicity Periodicity Break Break Counting Argument Counting Argument P has 2k disjoint k-breaks P has 2k disjoint k-breaks P has 2k disjoint l-breaks P has 2k disjoint l-breaks Local matches Local matches OUTLINE

P has 2k disjoint l-breaks (1/7) The pattern does not always contain 2k k-breaks. Nevertheless, they may be an l such that there are 2k l-breaks. By Corollary 1, finding them may take costly time. To circumvent this problem, rather than searching for all matches, se need a way to seek for local match.

P has 2k disjoint l-breaks (2/7) Lemma 5: let P be a pattern with 2k disjoint l-breaks and let P be a pattern with 2k disjoint l-breaks and let T be a text of size n. let T be a text of size n. We can preprocess T in O(n) time such that, given l contiguous text locations, we can identify the, at most 4, locations where P matches in time O(klogk) We can preprocess T in O(n) time such that, given l contiguous text locations, we can identify the, at most 4, locations where P matches in time O(klogk)

pf/ S={B 1,…,B 2k } :set of 2k l-breaks of P S={B 1,…,B 2k } :set of 2k l-breaks of P S’={B 1 ’,…,B f ’}:,be the maximal subset of distinct l-breaks of S. S’={B 1 ’,…,B f ’}:,be the maximal subset of distinct l-breaks of S. S’ can be found in time by constructing a trie of the strings in S. S’ can be found in time by constructing a trie of the strings in S. P has 2k disjoint l-breaks (3/7)

Since each break in S’ is distinct, the overall number of exact matches of l-breaks of S’ in T is bounded by n. These exact matches can be found in Since each break in S’ is distinct, the overall number of exact matches of l-breaks of S’ in T is bounded by n. These exact matches can be found in P has 2k disjoint l-breaks (4/7)

A[1..n]:length =n, corresponding to the n location of the text. A[1..n]:length =n, corresponding to the n location of the text. A[i] is the index of the certain l-break of S’ matches at location i of T. A[i] is the index of the certain l-break of S’ matches at location i of T. Partition: Partition: k pieces of size k P has 2k disjoint l-breaks (5/7)

leaves corresponding to the locations containing j in this piece of size k for each piece of size k and each break B’ j in S’ create a balanced binary search tree. for each piece of size k and each break B’ j in S’ create a balanced binary search tree. #of tree is #of tree is P has 2k disjoint l-breaks (6/7)

The size of each tree is O(1)+O(# of leaves) The size of each tree is O(1)+O(# of leaves) The leaves of all trees correspond to all exact matches of the l-breaks of S ’ in T. The leaves of all trees correspond to all exact matches of the l-breaks of S ’ in T. P has 2k disjoint l-breaks (7/7) Since there are at most n such exact matches, the overall size of the trees is O(n). The trees can be constructed in O(n) time.

Assumption Assumption Periodicity Periodicity Break Break Counting Argument Counting Argument P has 2k disjoint k-breaks P has 2k disjoint k-breaks P has 2k disjoint l-breaks P has 2k disjoint l-breaks Local matches Local matches OUTLINE

Lemma 3 shows that a matches of the pattern dictates that k out of the pattern ’ s l-breaks B 1, …,B k match exactly in T at appropriate shift. Lemma 3 shows that a matches of the pattern dictates that k out of the pattern ’ s l-breaks B 1, …,B k match exactly in T at appropriate shift. Since 2 exact matches of B i must appear in l contiguous location, we can find them by using exactly one of the BST. Since 2 exact matches of B i must appear in l contiguous location, we can find them by using exactly one of the BST. Local matches O(klogk) (1/3)

The BST is a balanced Binary Search Tree; therefore finding the 2 exact matches take O(logk) time. The BST is a balanced Binary Search Tree; therefore finding the 2 exact matches take O(logk) time. Finding exact matches of all B i and marking potential matches take O(klogk) with at most 4k marks Finding exact matches of all B i and marking potential matches take O(klogk) with at most 4k marks Local matches O(klogk) (2/3)

Since only 4,out of l, locations for potential matches can have k marks, the pattern can match at most 4 locations. Since only 4,out of l, locations for potential matches can have k marks, the pattern can match at most 4 locations. These 4 potential locations can be verified for a match in O(k) time. These 4 potential locations can be verified for a match in O(k) time. Local matches O(klogk) (3/3)

Speaker : R 余宗恩 Speaker : R 余宗恩 Improved Algorithm

Discuss the Frequency of Symbols Divided String matching problem into Marking & Verification Stage Faster Algorithm for Small K Faster Algorithm for Small K Concept about "Break" Handling Concept about "Break" Handling The Algorithm The Algorithm O( ) Schema of the paper Improve to

The Algorithm Goal: Goal: An efficient algorithm for general case An efficient algorithm for general case Time Complexity: O( ) Time Complexity: O( ) Additional Techniques Required: Additional Techniques Required: l-boundry l-boundry Dominated pattern Dominated pattern overlapping overlapping

Chance to Improve Can we find an optimal length of break? Can we find an optimal length of break? l-boundry l-boundry Can we make use of the repetition? Can we make use of the repetition? Dominating pattern Dominating pattern

l-boundry Appropriate length of l, such that Appropriate length of l, such that # of l-1 break >=2k # of l-1 break >=2k # of l break <=2k # of l break <=2k Ex. aadccbcbcacd Ex. aadccbcbcacd # of 2-break=3 # of 2-break=3 # of 3-break=2 # of 3-break=2 aadccbcbcacd

Property of l-boundry By theorem 1: P has 2k l-breaks By theorem 1: P has 2k l-breaks =>at most n/k non-discard location l-1 breaks >=2k l-1 breaks >=2k => at most O(n/k) non-discard location => at most O(n/k) non-discard location

Property of l-boundry By Lemma 5: By Lemma 5: 若 P 有 <= 2k 個 l-break 若 P 有 <= 2k 個 l-break -> 所有 match 會在 T 的一段 substring 上,且該 substring 有 O(k) 個 l-break P 有 <= 2k 個 l-break P 有 <= 2k 個 l-break => P 與 T 上的 l-break 都是 O(k) 個 => P 與 T 上的 l-break 都是 O(k) 個

How to find l-boundry? 1<=l<=k 1<=l<=k use binary search use binary search O(logk) O(logk) Complexity: Complexity: O(mlogk) O(mlogk)

Improve by Peroidicity Idea: If there are many occurences of repetition, we can make use of the former result of matching. Idea: If there are many occurences of repetition, we can make use of the former result of matching. T:abcabcabda T:abcabcabda P:ab P:ab # of mismatches in location 1, 2, 3 are equal to 4, 5, 6 respectively. # of mismatches in location 1, 2, 3 are equal to 4, 5, 6 respectively. abcabcabda ab

Definition w: a string with length at most l/2 w: a string with length at most l/2 w*: infinite string w*: infinite string w 2l *: 2 l length prefix of w* w 2l *: 2 l length prefix of w* a string s of length l has period w a string s of length l has period w => s is a substring of w 2l *

Definition (con'd) l-segment: l-segment: Divided Pattern and text into strings of length l Bad l-segment: Bad l-segment: a period stretch which doesn't have period w a period stretch which doesn't have period w has period w but intersect with breaks has period w but intersect with breaks

Definition (con'd) Pattern P is a dominating pattern Pattern P is a dominating pattern => P has at most 4k l-segments which doesn't have period w Text Location (i) is overlapping: Text Location (i) is overlapping: P i

Algorithm Case 1: P is a dominating pattern Case 1: P is a dominating pattern Case 2: P is not a dominating pattern Case 2: P is not a dominating pattern

P is a dominating pattern Case 1: Text location (i) is overlapping Case 1: Text location (i) is overlapping P i

Text location (i) is not overlapping P

A bad l-segment in P won't match a bad l-segment in T A bad l-segment in P won't match a bad l-segment in T A bad l-segment in T won't match a bad l-segment in P A bad l-segment in T won't match a bad l-segment in P # of mismatch in location i # of mismatch in location i = # of mismatch in location (assume location is not overlapping)

Algorithm ABABABABCDAAABAB ABABCDAB #### ABAB 2

Algorithm 1. find all matches of P in T at overlapping locations 2. for each bad l-segment B, do P.M with mismatches, with pattern B and Text w 2l * do P.M with mismatches, with pattern B and Text w 2l * 3. do P.M with mismatches, with pattern w and Text w 2l * 4. compute the # of mismatches of P at the first w locations of T using step 2 and 3 using step 2 and 3 5. i <- 6. while end of text not reached 6a. if i is not an overlapping location 6aa. # of mismatch at location i <- # of mismatch at location 6aa. # of mismatch at location i <- # of mismatch at location 6ab. i <- i+1 6ab. i <- i+1 6b. else, if j is the next non-overlapping location 6ba. for each bad l-segment participating in an overlap in the overlapping locations 6ba. for each bad l-segment participating in an overlap in the overlapping locations i to j, update the # of mismatches it accrues in the next locations i to j, update the # of mismatches it accrues in the next locations 6bb. i <- j 6bb. i <- j O( ) O( n )

Complexity analysis Lemma 6: P has at most 8k bad l-segment Lemma 6: P has at most 8k bad l-segment 1. l-segments which doesn't have period w <= 4k 1. l-segments which doesn't have period w <= 4k 2. l-segments which has period w 2. l-segments which has period w but intersect with breaks but intersect with breaks <=4k <=4k Total <= 8k Total <= 8k

Complexity Analysis Total O(n+mlogk+ ) Total O(n+mlogk+ ) O(n): Segmentation O(n): Segmentation O(mlogk): finding l-boundry O(mlogk): finding l-boundry O( ): find # of mismatches O( ): find # of mismatches

P is a non-dominating pattern Algorithm 1: Algorithm 1: adjust the algorithm used above adjust the algorithm used above O( ) O( ) Algorithm 2: Algorithm 2: Find candidates more effectively Find candidates more effectively O( ) O( )

Algorithm 1 find a substring S of P, S has at most 2k bad l-segments find a substring S of P, S has at most 2k bad l-segments do P.M with k mismatches with pattern S and text T do P.M with k mismatches with pattern S and text T verify locations where has less than k mismatches verify locations where has less than k mismatches Complexity: O( ) Complexity: O( )

Algorithm 2 Idea: if location i is a match Idea: if location i is a match S 上有最多 k 個 l-segment match T 上沒有 period w 的 l-segment S 上有最多 k 個 l-segment match T 上沒有 period w 的 l-segment T 上最多 k 個 l-segment match S 上沒有 period w 的 l-segment T 上最多 k 個 l-segment match S 上沒有 period w 的 l-segment 可用 Convolution 找到符合上述兩條件的 location 可用 Convolution 找到符合上述兩條件的 location O(nlogm)=O(nlogk) O(nlogm)=O(nlogk) 01000

Complexity analysis Total: O( ) Total: O( ) Convolution: O( ) Convolution: O( ) Verification: O( ) Verification: O( )