Multiple Pattern Matching Revisited

Multiple Pattern Matching Revisited
Robert Susik1, Szymon Grabowski1, Kimmo Fredriksson2 1 Lodz University of Technology, Institute of Applied Computer Science, Łódź, Poland 2 University of Eastern Finland, School of Computing, Kuopio, Finland PSC, Prague, Sept. 2014

Multiple pattern matching
The problem: report all text T1..n positions i such that one of r patterns P1..m matches T for some 1 ≤ i ≤ n both over a common integer alphabet of size σ. Usage: antivirus scanning, intrusion detection, web searches, etc.

Related work Aho–Corasick (1975), works in linear time,
Commentz–Walter (1979), based on Boyer–Moore (BM) algorithm - suffix-based approach, Fredriksson and Grabowski (2009), an average-optimal filtering variant of the classic AC algorithm Wu and Manber (1994), based on backward matching over a sliding text window, Aho-Corasick trie implementation for he, she, his and hers. Commentz-Walter trie implementation for he, she, his and hers. Wu and Manber Boyer Moore approach. Images taken from: S.M. Vidanagamachchi, S.D. Dewasurendra, R.G. Ragel, M.Niranjan, "Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification?" November 2012; Koloud Al-Khamaiseh Int. Journal of Engineering Research and Applications July 2014

Related work DAWG-match (Crochemore et al., 1999) and MultiBDM (Crochemore & Rytter, 1994), based on backward matching, linear in the worst case, complex, Multi-BNDM (Navarro & Raffinot, 1998) – bit-parallel version, simplified, Set Backward Oracle Matching (Allauzen & Raffinot, 1999), similar as above but simpler and is very efficient in practice, Succinct Backward DAWG Matching (Fredriksson, 2003), practical for huge pattern sets due to use of succinct index, Faro & Külekci, use of the SSE technology, e.g. wsfp (word- size fingerprint instruction) operation used to identify text blocks that may contain a matching pattern (2012), Salmela et al. tried a similar approach to ours (not very successful for short patterns in their tests), 2006.

Shift-Or (Baeza-Yates & Gonnet, 1992)
Shift-Or simulates a non-deterministic finite automaton (NFA), with bit-parallelism Bit-parallelism: Frequently used in stringology when the results of single operations are boolean or small integers Many (even w, computer word size) operations can be made in parallel Reinvented several times, but BY-G (1992) is the most known

Shift-Or – in work V := ~0; i := 0 while i < n do
gcaga B[g] = 01101 B[c] = 10111 B[a] = 11010 B[] – bit-vector for each alphabet symbol, m * bits in total. V := ~0; i := 0 while i < n do V := (V << 1) | B[T[i]] if (m–1)-th bit of V is 0 then report match at position i i := i + 1 Search Preproc T = gcatcgcagagat P = gcaga

Shift-Or Pros: Cons: Fast: O(n m / w) time in the worst case
when m = O(w), it is linear in time Cons: Avg-case is the same as the worst-case but faster methods are possible

Average Optimal Shift-Or (AOSO) (Fredriksson & Grabowski, 2005, 2009)
Motivation: Improve the avg-case of Shift-Or Idea: Sample T every k symbols: T’ = t0, tk, t2k, … Need to match k subpatterns of P: P0, …, Pk–1, each sampled in the same way as T, starting from 0, 1, …, k–1 When some subpattern matches, verify whether there is a true match

AOSO – example T’ = gaccggt T = gcatcgcagagat P = gcagag P0 = gaa
P1 = cgg Processing: T’ = g.a.c.c.g.g.t P0 = g a a P1 = c g g no match of subpattern

P1 = cgg Processing: T’ = g.a.c.c.g.g.t P0 = g a a P1 = c g g match of subpattern! verification in T – success

P1 = cgg Processing: T’ = g.a.c.c.g.g.t P0 = g a a P1 = c g g no match of subpattern

AOSO Pros: Faster than Shift-Or: O(n log (m)/m) time in the avg case Cons: Needs verification to exclude false matches, not a big problem in practice

Multi-pattern AOSO (MAOSO)
Idea: Merge r patterns (input patterns) into one superimposed pattern Check only one superimposed pattern, then exclude false matches Example (for r = 2): P0 = ATGG, P1 = ACTA Merging: P* = [A][TC][GT][GA]

MAOSO – some details Just set the bit-vectors (in the manner of Shift- Or) if any of the symbols at given position of superimposed pattern is present Use AOSO for such superimposed pattern Problem: If r is large and (especially) σ small, then there’s a lot of verifications

Q-grams Idea: grouping q successive text chars into supersymbols. New alphabet size: σq. Enlarging the alphabet may reduce the number of comparisons between the text and the pattern.

Alphabet mapping Map large alphabet of σ symbols to smaller alphabet of σ’ symbols. We achieve this using bin-packing method. Symbol Occurrences ’E’ 27 ’T’ 15 ’A’ 10 ’C’ 8 ’D’ 7 ’B’ 5 ’G’ 2 Bin Mapped symbols ’E’ 1 ’T’ 2 ’A’, ’C’ 3 ’D’, ’B’, ’G’ New alphabet (σ’ = 4)

Multi AOSO on q-Grams (MAG)
Super-alphabet reduces verification number. We have p = O( (qr)/σq ) probability of match, so verification probability is O( p⌊ m / (kq) ⌋ ) and the cost is O(rqm) q-gram based search makes steps bigger (equals q), or in other words text is smaller (n/q) FAOSO runs in O(n/k · ⌈(m/q)/w⌉) time in our case, where w is the number of bits in computer word (typically 64).

Simple Multi AOSO on q-Grams (SMAG)
Simpler version of previously mentioned method. In this case the whole text is encoded prior to starting the actual search algorithm, which is then more streamlined. Total complexity is Ω(n), the time to encode the text. A little faster search, but much longer preprocessing phase. Maybe useful if text is searched multiple times in short period and we have space to store it in encoded form.

Experimental results Hardware: Intel Core i GHz CPU 128KB L1, 512KB L2 and 3 MB L3 cache, 4GB of 1333MHz DDR3 RAM Compiler: g++ version with -O3 optimization OS: Ubuntu 64-bit OS with kernel Text: taken from Pizza & Chili Corpus ( 200MB each Tests: All source codes have been taken from authors and compiled on the same test machine (some of them cannot handle long patterns, ie. m=64).

Experimental results, varying r

Experimental results, varying m

Experimental results, varying q

Conclusions Our work can be seen as a new and quite successful combination of known building bricks. The presented algorithm, MAG, usually wins with its competitors on the three test datasets (english and proteins, dna). One of the key successful ideas was alphabet quantization (binning),which is performed in a greedy manner, after sorting the original alphabet by frequency.

Future work Different alphabet mapping techniques could improve efficiency. Is it possible to choose the algorithm’s parameters in order to reach average optimality (for m = O(w))? SSE instructions seem to offer great opportunities, especially for bit-parallel algorithms. Dense codes (e.g., ETDC) for words or q-grams not only serve for compressing data (texts), but also enable faster pattern searches (our preliminary results are rather promising).

Multiple Pattern Matching Revisited

Similar presentations

Presentation on theme: "Multiple Pattern Matching Revisited"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Pattern Matching Revisited

Similar presentations

Presentation on theme: "Multiple Pattern Matching Revisited"— Presentation transcript:

Similar presentations

About project

Feedback