Backward Nondeterministic DAWG Matching Algorithm

Slides:



Advertisements
Similar presentations
1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.
Advertisements

1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Speaker: C. C. Lin Adviser: R. C. T. Lee
Tuned Boyer Moore Algorithm
Space-for-Time Tradeoffs
MSc Bioinformatics for H15: Algorithms on strings and sequences
Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu
1 String Matching of Bit Parallel Suffix Automata.
Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005.
Boyer Moore Algorithm String Matching Problem Algorithm 3 cases Searching Timing.
Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain.
String Recognition Simple case: recognize 1101 “ ” 0 “1” 0 “11” 0 Reset 1 “110” “1101”
1 Fastest Approach to Exact Pattern Matching Date:102/3/13 Publisher:Information and Emerging Technologies (ICIET), 2010 Information and Emerging Technologies.
1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
1 Morris-Pratt algorithm Advisor: Prof. R. C. T. Lee Reporter: C. S. Ou A linear pattern-matching algorithm, Technical Report 40, University of California,
Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan
Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen
1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation,
1 Reverse Factor Algorithm Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen Speeding up on two string matching algorithms, Algorithmica, Vol.12, 1994, pp
1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule.
1 Rules in Exact String Matching Algorithms 李家同. 2 The Exact String Matching Problem: We are given a text string and a pattern string and we want to find.
1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
1 Rules in Exact String Matching Algorithms 李家同. 2 The Exact String Matching Problem: We are given a text string and a pattern string and we want to find.
1 Two Way Algorithm Advisor: Prof. R. C. T. Lee Speaker: C. C. Yen Two-way string-matching Journal of the ACM 38(3): , 1991 Crochemore M., Perrin.
Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.
A Fast String Searching Algorithm Robert S. Boyer, and J Strother Moore. Communication of the ACM, vol.20 no.10, Oct
1 KMP Skip Search Algorithm Advisor: Prof. R. C. T. Lee Speaker: Z. H. Pan Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian,
Smith Algorithm Experiments with a very fast substring search algorithm, SMITH P.D., Software - Practice & Experience 21(10), 1991, pp Adviser:
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
1 Convolution and Its Applications to Sequence Analysis Student: Bo-Hung Wu Advisor: Professor Herng-Yow Chen & R. C. T. Lee Department of Computer Science.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
1 Rules in Exact String Matching Algorithms 李家同. 2 The Exact String Matching Problem: We are given a text string and a pattern string and we want to find.
1 The Galil-Giancarlo algorithm Advisor: Prof. R. C. T. Lee Speaker: S. Y. Tang On the exact complexity of string matching: upper bounds, SIAM Journal.
The Zhu-Takaoka Algorithm
Reverse Colussi algorithm
1 Boyer and Moore Algorithm Adviser: R. C. T. Lee Speaker: H. M. Chen A fast string searching algorithm. Communications of the ACM. Vol. 20 p.p ,
Raita Algorithm T. RAITA Advisor: Prof. R. C. T. Lee
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
1 Boyer-Moore Charles Yan Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )
The Galil-Giancarlo algorithm
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,
Advisor: Prof. R. C. T. Lee Speaker: T. H. Ku
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
Application: String Matching By Rong Ge COSC3100
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Advanced Data Structure: Bioinformatics
Source : Practical fast searching in strings
COMP261 Lecture 20 String Searching 2 of 2.
Exact string matching: one pattern (text on-line)
Recuperació de la informació
13 Text Processing Hongfei Yan June 1, 2016.
Rabin & Karp Algorithm.
Boyer and Moore Algorithm
Adviser: R. C. T. Lee Speaker: C. W. Cheng National Chi Nan University
Chapter 7 Space and Time Tradeoffs
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
Knuth-Morris-Pratt Algorithm.
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Improved Two-Way Bit-parallel Search
Presentation transcript:

Backward Nondeterministic DAWG Matching Algorithm A Bit-parallel Approach to Suffix Automata: Fast Extended String Matching, Navarro, G. and Raffinot, M., Lecture Notes in Computer Science, Vol.1448, 1998, pp. 14-33 Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

Problem Definition: Input : A text T and a pattern P. Output : All the locations where P matches T.

This algorithm uses rule 1: Suffix to Prefix Rule: For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern. T P

Find the longest suffix U of the window which is equal to some prefix of P. Skip the pattern as follows: U

Example T = GCA TCGACAGAC TATACAGTACG P = GACGGATCA ∵The longest suffix of the window which is equal to a prefix of P is “GAC”, slide the window by 6. T = GCATCGACAGACTATACAGTACG P = GACGGATCA

We give an example to introduce how this algorithm find the longest suffix of the window which is equal to a prefix of P.

Text : ABDDCCDBADEGGGGJJ Example: Text : ABDDCCDBADEGGGGJJ Pattern : BADADCEAD We want to find the longest suffix of “BDDCCDBAD” which is also a prefix of the pattern.

Text : ABDDCCDBADEGGGGJJ Example: Text : ABDDCCDBADEGGGGJJ Pattern : BADADCEAD First, we read “D”.

Text : ABDDCCDBADEGGGGJJ Example: Text : ABDDCCDBADEGGGGJJ Pattern : BADADCEAD We find all the substrings ”D” in the pattern.

Text : ABDDCCDBADEGGGGJJ Example: Text : ABDDCCDBADEGGGGJJ Pattern : BADADCEAD We read the next character “A”. We check if the right of the substrings ”D” are “A” or not.

Text : ABDDCCDBADEGGGGJJ Example: Text : ABDDCCDBADEGGGGJJ Pattern : BADADCEAD Thus, we find out all the substrings ”AD” in the pattern.

Text : ABDDCCDBADEGGGGJJ Example: Text : ABDDCCDBADEGGGGJJ Pattern : BADADCEAD We read the next character “B”. We check if the right of the substrings “AD” are “B” or not.

Text : ABDDCCDBADEGGGGJJ Example: Text : ABDDCCDBADEGGGGJJ Pattern : BADADCEAD We find that the substring ”BAD” is in the pattern. Note that “BAD” is also a prefix of P.

Text : ABDDCCDBADEGGGGJJ Example: Text : ABDDCCDBADEGGGGJJ Pattern : BADADCEAD We read the next character “D”. We can not find a character “D” in the right of the substring “BAD”. We report that “BAD” is the longest suffix of “BDDCCDBAD” which is equal a prefix of P.

Text : ABDDCCDDADEGGGGJJ Another example: Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD We want to find the longest suffix of “BDDCCDDAD” which is also a substring of the pattern.

Text : ABDDCCDDADEGGGGJJ Another example: Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD First, we find all the substrings ”D” in the pattern.

Text : ABDDCCDDADEGGGGJJ Another example: Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD mismatch Then we find out all the substrings ”AD” in the pattern.

Text : ABDDCCDDADEGGGGJJ Another example: Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD Then we find out all the substrings ”AD” in the pattern.

Text : ABDDCCDDADEGGGGJJ Another example: Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD mismatch We find out all the substrings ”DAD” in the pattern.

Text : ABDDCCDDADEGGGGJJ Another example: Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD We find out all the substrings ”DAD” in the pattern.

Text : ABDDCCDDADEGGGGJJ Another example: Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD mismatch We find all the substrings ”DDAD” in the pattern.

Text : ABDDCCDDADEGGGGJJ Another example: Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD mismatch We find all the substrings ”DDAD” in the pattern. There is no substring “DDAD” in the pattern. There is no any suffix of “BDDCCDDAD” which is equal to a prefix of P.

The idea that we explained above is the main idea of this algorithm. And next we will use bit-parallel method to implement this algorithm.

We use bits to store the positions of a character in P. Example: P: CABBCAD P: CABBCAD A: 0 1 00 01 0 For character “A”, we store B: 0 0 11 0 00 For character “B”, we store For character “C”, we store C: 1 0 0 0 100 For character “D”, we store D: 0 0 0 0 0 01 For the characters do not exit in P we store *: 0 0 0 0 0 0 0

Here, we explain how to use bit-parallel to find the substring of a pattern which is equaled to a suffix of the window. Text: ABCABCABA ,∑={A,B,C,D} Pattern: CABBCAD Pattern: CABCCAD A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000 D: 1111111 We use a mask D to record some information.

<<1: left shift one bit. Text: ABCABCABA ,∑={A,B,C,D} Pattern: CABBCAD Pattern: CABCCAD A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000 D: 1111111 And A: 0100010 0100010 D: 1000100 <<1: left shift one bit. D= 0100010<<1 =1000100

Text: ABCABCABA ,∑={A,B,C,D} Pattern: CABBCAD Pattern: CABCCAD other: 0000000 D: 1000100 And C: 1000100 1000100 D: 0001000 We know “CA” is a suffix of the window which is equal to a prefix of the pattern. D= 1000100<<1 =0001000

Text: ABCABCABA ,∑={A,B,C,D} Pattern: CABBCAD Pattern: CABCCAD other: 0000000 D: 0001000 And B: 0011000 0001000 D: 0010000 We know “BCA” is a substring of the pattern. D= 0001000<<1 =0010000

Text: ABCABCABA ,∑={A,B,C,D} Pattern: CABBCAD Pattern: CABCCAD other: 0000000 D: 0010000 And A: 0100010 0000000 There is no substring “ABCA” in the pattern.

Text: ABCABCABA ,∑={A,B,C,D} Pattern: CABBCAD “CA” is a suffix of “BCA” which is a prefix of the pattern.

We take another example: Text: ABCABCCBA ,∑={A,B,C,D} Pattern: ACBCCBD

Example: Text: ABCABCCBA ,∑={A,B,C,D} Pattern: ACBCCBD First, we build: Pattern: ACBCCBD A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000

Example: Text: ABCABCCBA ,∑={A,B,C,D} Pattern: ACBCCBD others: 0000000 Pattern: ACBCCBD D: 1111111

Example: Text: ABCABCCBA ,∑={A,B,C,D} Pattern: ACBCCBD others: 0000000 Pattern: ACBCCBD D: 1111111

Example: Text: ABCABCCBA ,∑={A,B,C,D} Pattern: ACBCCBD others: 0000000 Pattern: ACBCCBD D: 1111111 And C: 0101100 D: 1111111 0101100 Where there is a “1”, there is a substring “C” in Pattern. We set D = 0101100<<1= 1011000

Example: Text: ABCABCCBA ,∑={A,B,C,D} Pattern: ACBCCBD others: 0000000 Pattern: ACBCCBD D: 1011000 And C: 0101100 D: 1011000 0001000 Where there is a “1”, there is a substring “CC” in Pattern. We set D = 0001000<<1= 0010000

Example: Text: ABCABCCBA ,∑={A,B,C,D} Pattern: ACBCCBD others: 0000000 Pattern: ACBCCBD D: 0010000 And B: 0010010 D: 0010000 0010000 Where there is a “1”, there is a substring “BCC” in Pattern. We set D = 0010000<<1= 0100000

Example: Text: ABCABCCBA ,∑={A,B,C,D} Pattern: ACBCCBD others: 0000000 Pattern: ACBCCBD D: 0100000 And A: 1000000 D: 0100000 0000000 There is no substring “ABCC” in Pattern. There is no any suffix of the window which is equal to a prefix of the pattern.

Time Complexity: If the length of the text is n and the length of pattern is m, the time complexity of this algorithm is O(mn) in the worst case.

Reference [BG92]A new approach to text searching, R. Baeza-Yates and Navarro, G., CACM. Vol. 35, 1992, pp.74-82. [BEH89]Average sizes of suffix trees and dawgs., Blumer, A., Ehrenfeucht, A. and Haussler, D., Discrete Applied Mathematics, Vol. 24, 1989, pp.37-45. [BM77] A fast string searching algorithm. Boyer, R. S. and Moore, J. S., Communications of the ACM, Vol. 20, 1977, pp.762-772. [GM98] A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching, G. NAVARRO and M. RAFFINOT, In Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 1448, Springer-Verlag, Berlin, 1998, pp.14-31.