Parallel String Matching Algorithm(s) Using Associative Processors Original work by Mary Esenwein and Dr. Johnnie Baker Presented by Shannon Steinfadt April 18, 2007 Original work by Mary Esenwein and Dr. Johnnie Baker Presented by Shannon Steinfadt April 18, 2007
2 String Matching Problem / Aka. pattern matching or string searching / Useful in many applications such as text editing and information retrieval, DNA analysis, Homeland Security / Aka. pattern matching or string searching / Useful in many applications such as text editing and information retrieval, DNA analysis, Homeland Security
3 What are we doing? / Given a pattern and some text, find out if the pattern is IN the text / Is pattern AB in the text ABAA? If so, where? / Given a pattern and some text, find out if the pattern is IN the text / Is pattern AB in the text ABAA? If so, where? AB ABAA
4 What’s the notation? / P is a pattern string of length m / T is a text string of length n, usually n ≥ m / P is a pattern string of length m / T is a text string of length n, usually n ≥ m
5 Goal of String Matching / To find all occurrences of a pattern string in the text string / Locate all positions i in T such that T[i+j-1] = P[j] for all j, 1 ≤ j ≤ m / To find all occurrences of a pattern string in the text string / Locate all positions i in T such that T[i+j-1] = P[j] for all j, 1 ≤ j ≤ m Why use P[j]? How does it relate to T[i+j-1]?
6 Pattern Variations / An exact pattern / A “Don’t Care” character ( *) in pattern / Flexibility in matching / * indicates character(s) of the text that are irrelevant to the matching process / An exact pattern / A “Don’t Care” character ( *) in pattern / Flexibility in matching / * indicates character(s) of the text that are irrelevant to the matching process
7 General “Don’t Care” Character’s (*) Characteristics / Single character of text / Multiple consecutive text characters / No characters / Combination of above three Example: / Pattern AB*CD could match ABBCD, ABBBBBCD, or ABCD (* is null) / Single character of text / Multiple consecutive text characters / No characters / Combination of above three Example: / Pattern AB*CD could match ABBCD, ABBBBBCD, or ABCD (* is null)
8 String Matching using ASC / Three parallel algorithms using associative computing (using 1-D mesh) / String matching for exact match / String matching with fixed length “don’t care” / I.e., exactly 1 character / String matching with variable length “don’t care” / a “don’t care” can have any length or be null / Three parallel algorithms using associative computing (using 1-D mesh) / String matching for exact match / String matching with fixed length “don’t care” / I.e., exactly 1 character / String matching with variable length “don’t care” / a “don’t care” can have any length or be null
9 ASC Exact Match Algorithm for (j = patt_length - 1; j >= 0; j--) { Responders are text[$] == patt_string[j] and counter[$] == patt_counter; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell; for (j = patt_length - 1; j >= 0; j--) { Responders are text[$] == patt_string[j] and counter[$] == patt_counter; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell;
@00 A00 B00 B00 B00 A00 B00 B00 B00 A00 B00 A00 Text[$]Match[$]Counter[$] Pattern: BBA Text: ABBBABBBABA m=pattern length n=text length j = pattern index i = text index Pattern: BBA 0 patt_ counter patt_length 3
11
@01 A00 B03 B12 B01 A00 B03 B12 B01 A02 B01 A00 Text[$] Match[$] Counter[$] Pattern: BBA Text: ABBBABBBABA m = pattern length n = text length j = pattern index i = text index Final State of Exact Match Algorithm B B A B B A
13 Algorithm for unit length "don't cares" using ASC for (j = patt_length - 1; j >= 0; j--) { if (pattern[j] == '*') Responders are counter[$] == patt_counter; else // pattern[j] is not the “don’t care” character Responders are text[$] == pattern[j] and counter[$] == patt_counter; If no Responders are detected, exit; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell; for (j = patt_length - 1; j >= 0; j--) { if (pattern[j] == '*') Responders are counter[$] == patt_counter; else // pattern[j] is not the “don’t care” character Responders are text[$] == pattern[j] and counter[$] == patt_counter; If no Responders are detected, exit; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell;
14 ASC Exact Match Algorithm (again) for (j = patt_length - 1; j >= 0; j--) { Responders are text[$] == patt_string[j] and counter[$] == patt_counter; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell; for (j = patt_length - 1; j >= 0; j--) { Responders are text[$] == patt_string[j] and counter[$] == patt_counter; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell;
@00 A00 B00 B00 B00 A00 B00 B00 B00 A00 B00 A00 Text[$]Match[$]Counter[$] Pattern: BBA Text: ABBBABBBABA m=pattern length n=text length j = pattern index i = text index Pattern: B*A 0 patt_ counter patt_length 3
16
@01 A00 B03 B12 B01 A00 B03 B12 B01 A02 B01 A00 Text[$] Match[$] Counter[$] Pattern: B*A Text: ABBBABBBABA m = pattern length n = text length j = pattern index i = text index Final State of Exact Match Algorithm B B A B B A
18 VLDC Algorithm (added) / Works on each “segment” of the pattern broken up by the * character / AB*BB*A has three sections / Consecutive ** characters not necessary, not allowed / This VLDC algorithm unique / Provides information to find all continuation points of all matches following each “*” / Works on each “segment” of the pattern broken up by the * character / AB*BB*A has three sections / Consecutive ** characters not necessary, not allowed / This VLDC algorithm unique / Provides information to find all continuation points of all matches following each “*”
19 VLDC ALGORITHM USING ASC int patt_length = m; int maxcell = n + 2; /* Special handling for ‘*’ at end of pattern */ if (pattern[m-1] == ‘*’) { Responders are cell index > 1; Responders set segment$[0] = 1; patt_counter = 1; k = 1; /* Reset initial segment index */ } while ((patt_length -= patt_counter) > 0 && maxcell > 0) { patt_counter = 0; for ( I = patt_length - 1; I>= 0 && pattern[I] != ‘*’; I--) { Responders are text$ == pattern[I] and counter$ == patt_counter and cell index < maxcell; Responders add 1 to counter$ and store result in counter$ of preceding cell; patt_counter++; } Responders are counter$ == patt_counter; int patt_length = m; int maxcell = n + 2; /* Special handling for ‘*’ at end of pattern */ if (pattern[m-1] == ‘*’) { Responders are cell index > 1; Responders set segment$[0] = 1; patt_counter = 1; k = 1; /* Reset initial segment index */ } while ((patt_length -= patt_counter) > 0 && maxcell > 0) { patt_counter = 0; for ( I = patt_length - 1; I>= 0 && pattern[I] != ‘*’; I--) { Responders are text$ == pattern[I] and counter$ == patt_counter and cell index < maxcell; Responders add 1 to counter$ and store result in counter$ of preceding cell; patt_counter++; } Responders are counter$ == patt_counter;
20 VLDC continued Responders set segment$[k] = patt_counter in next cell; Responders are segment$[k] > 0; maxcell = maximum cell index value of Responders else if no Respondersmaxcell = 0; All cells become Responders and set counter$ = 0; patt_counter++; k++ } /* When pattern has been processed */ Responders are segment$[--k] > 0; Responders set match$ = 1; /* Special handling for ‘*’ at start of pattern */ if (pattern[0] == ‘*’) { Responders are cell index 1; Responders set match$ = 1; } Responders set segment$[k] = patt_counter in next cell; Responders are segment$[k] > 0; maxcell = maximum cell index value of Responders else if no Respondersmaxcell = 0; All cells become Responders and set counter$ = 0; patt_counter++; k++ } /* When pattern has been processed */ Responders are segment$[--k] > 0; Responders set match$ = 1; /* Special handling for ‘*’ at start of pattern */ if (pattern[0] == ‘*’) { Responders are cell index 1; Responders set match$ = 1; }
Pattern: AB*BB*A Text: ABBBABBBABA After third pattern segment in VLDC 00 1 0000Y N A 00010100Y B B B 00 1 0000Y N A 00010100Y B B B 00 1 0000Y N A 00010100Y B 00 1 0000Y N A 00010100Y 0 1 T$M$C$ 6 13 Maxcell S0$S1$S2$ Patt_counter 12 Responder$
Pattern: AB*BB*A Text: ABBBABBBABA After second pattern segment in VLDC A 00 1 2 0100Y B 0 00 20Y Y Y B 00 1 000 20Y Y N B 00000Y N A 00 1 2 0100Y B 0 00 20Y Y Y B 00 1 000 20Y Y N B 00000Y N A 00 1 0100 B 00000Y N A 120123012012 T$M$Counter$ 6 13 12 Maxcell S0$S1$S2$ Patt_counter 12 Responder$ (Used to keep pattern segments in order, I.e. AB occurs before BB)
Pattern: AB*BB*A Text: ABBBABBBABA After first pattern segment in VLDC 00 2 0000Y A 00 1 0100202Y N B 00 1 0020Y N B 00 1 0020Y N B 00 2 0000Y N Y A 00 1 0100202Y N B B B A B A 12012301230120123012 T$M$Counter$ 6 13 12 8 Maxcell S0$S1$S2$ Patt_counter 12 Responder$ (Used to keep pattern segments in order, I.e. AB occurs before BB)
Pattern: AB*BB*A Text: ABBBABBBABA Final State in VLDC A 10102Y B B B A 10102Y B B B A B A 12012301230120123012 T$M$Counter$ 6 13 12 8 Maxcell S0$S1$S2$ Patt_counter 12 Responder$ (Used to keep pattern segments in order, I.e. AB occurs before BB)
25 Finding All Continuation Points / Match starts where M$ = 1 / Match to any pattern segment begins where S$[x] == segment length / i.e. where any S$[x] > 0 / Continuation of match in S$[x-1] whose cell/PE index is >= (S$[x] + segment size) of S$[x]’s cell/PE index / Match starts where M$ = 1 / Match to any pattern segment begins where S$[x] == segment length / i.e. where any S$[x] > 0 / Continuation of match in S$[x-1] whose cell/PE index is >= (S$[x] + segment size) of S$[x]’s cell/PE index
Pattern: AB*BB*A Text: ABBBABBBABA Using the Final State in VLDC A B B B A B B B A B A T$M$C$ S0$S1$S2$ 12 Start with index 2, where there’s a match M$=1 Work from S2$ down and left, count down 2 values and move into S1$, count down 2 values and move to S0$ That produces: 2 4 6 ABBBA Any index >= 4 in S1[$] whose value is >0 will also produce a correct match 2 7 10 ABBBABBBA 2 8 10 ABBBABBBA Some of the additional matches are: 2 4 10 ABBBABBBA 2 4 12 ABBBABBBABA 2 8 12 ABBBABBBABA 6 8 10 ABBBA 6 8 12 ABBBABA
27 Existing Algorithms / Sequential Algorithms / Naïve algorithm: O(mn) / Knuth, Morris, & Pratt, or Boyer-Moore: O(m+n) / Parallel Algorithms / A PRAM exact string matching: O(n) / On a reconfigurable mesh: O(1) on n(n-m+1) PEs / On a SIMD hypercube (limited to {0,1}): O(lg n) on n/lg n PEs / On a neural network: O(1) on nm PEs / ASC algorithms: O(m) time on O(n) PEs / Sequential Algorithms / Naïve algorithm: O(mn) / Knuth, Morris, & Pratt, or Boyer-Moore: O(m+n) / Parallel Algorithms / A PRAM exact string matching: O(n) / On a reconfigurable mesh: O(1) on n(n-m+1) PEs / On a SIMD hypercube (limited to {0,1}): O(lg n) on n/lg n PEs / On a neural network: O(1) on nm PEs / ASC algorithms: O(m) time on O(n) PEs
28 Question to consider / The “don’t care” character allows non-matching for an arbitrary length. This is discussed on slide 13. Instead, consider “*” to allow a non-match for two characters and make necessary changes in trace in Slide