String Matching Algorithm Overview & Analysis By cyclone @ NSlab, RIIT Sep. 6 2008
Structure Algorithm overview Performance experiments Solution and future work Bibliography and resources
Algorithm Overview
Definition Given an alphabet bet S, a pattern P of length m and a text T of length n, find if P is in T or the position (s) P matches a substring of T, where usually m<<n Considering the pattern P string exact string matching string with errors approximate string matching regular expression regular expression matching
Categories Category 1 Category 2 Category 3 …… Single string matching Algorithms Multiple string matching Algorithms Category 2 Prefix based Algorithms Suffix based Algorithms Factor based Algorithms Category 3 Automaton based Algorithms Trie based Algorithms Table based Algorithms ……
Prefix based algorithms The search is done forward in the search window Search for the longest prefix of the window which is also a prefix of the pattern(s) all the characters are read
Suffix based algorithms The search is done backwards along the search window Search for the longest suffix of the window which is also a suffix of the pattern(s) not all the characters are read, which leads to sublinear average-case algorithms
Factor based algorithms The search is done backwards along the search window Search for the longest suffix of the window which is also a factor of the pattern(s) not all the characters are read Require to recognize the set of factors of the pattern(s) and this is quite complex
Roadmap
KMP MP: border the longest prefix v of the pattern which is also the suffix of the portion u of the text KMP: Longest border add “C (u+1) not equal to C(v+1)”
Knuth-Morris-Pratt preprocessing phase in O(m) space and time complexity searching phase in O(n+m) time complexity (independent from the alphabet size)
Aho-Corasick Automata (Finite) a finite set of States Q, among which one is Initial, and some are Terminal transitions between states are labeled by elements of characters orε, which is decided by a transition function F (S,Q,I,T,F) NFA & DFA
Aho-Corasick: NFA AC automata (above pattern set {ATATATA, TATAT, ACGATAT}) Extend the concept of Border in KMP to search pattern set NFA Three main function Goto function (real transition) Failure function (dashed transition) Output function (double circle state)
Aho-Corasick: NFA Search stage is simple Start from state 0 Read the character one after another If current state has Goto transition for the reading character, current Goto (current) If current state has Failure transition for the reading character, then While (current has relevent Goto transition) current Failure (current) current Goto (current) Check if current state has Output function and report
Aho-Corasick: DFA Preprocess: Conversion from NFA to DFA Search Traverse the NFA and previously calculate all failure path Each reading character can find Goto transition Search Without travel back the failure path Trade-off between storage and search speed
Aho-Corasick Pro: Cons: searching time complexity is O(n) (independent from the pattern set size) Cons: When pattern set size increase, the memory needed increase drastically. Cache and memory access time changing will compromise the time performance.
AC-Modified AC-sparse and AC-banded State and Path compression by tuck Sparse vector storing method for transitions State and Path compression by tuck Character index AC by Jianming Others……
Boyer-Moore How to safely shifting the search window without missing possible match Two major heuristics Good Suffix (two shift value: s1,s2) Bad Character (one shift value: s3) In the searching stage, the shift value is calculated by max(min(s1,s2),s3).
Good suffix heuristic
Bad Character heuristic
Boyer-Moore Excellent time performance Worse case: O(mn) Average case: sublinear Best case: O(n/m) ex: am-1b in bn Fast when alphabet size is large, common in NIMS Cons: calculating the shift value in both heuristics is somewhat complex
BM-Horspool For large alphabet, bad character heuristic always has the bigger shift value New bad character heuristic: Only consider the last character in the window
Roadmap
Commnetz-Walter Natural extension of BM Use reversed trie of pattern set Trie: a set of nodes put together with unidirectional links, each link has a label. It can represents a set of strings
Set BM-Horspool Natural extension of BM-Horspool Use reversed trie of pattern set
Wu-Manber AC_BM is complex in shift value calculation SBMH has bad performance when pattern set size is large the probability of a character appear in a certain pattern is high, so the average shift value is comparatively small So, WM extend “bad character” of SBMH to “bad character block”
Wu-Manber Use a hash table called SHIFT to store the shift values of character blocks Use HASH table to link the patterns has the same last character block Use PREFIX table to discriminate patterns link with the HASH entry SHIFT table and HASH table share the same hash function
Wu-Manber SHIFT Block: Bl Size of block: B Hash function: h() Minimum pattern length: lmin SHIFT(j) entry stores the minimum shift value of all the block that h(Bl)=j The shift value of block is calculated as follows: If Bl does not appear in any pattern, its shift value is lmin-B+1 If Bl appear in some patterns, find the rightmost of them, let the offset of Bl in that pattern be j, its shift value is lmin-B-j
Wu-Manber Searching stage Read character block and hash it Find the relevent SHIFT entry value shift If shift>0, move the window backwards according to it If shift=0, find the relevent HASH entry, and then verify all the possible matching patterns link to this entry, with the help of PREFIX table. No matter match or not, move the window backwards by 1
Wu-Manber Pros: Excellent average time performance Cons: Hash function Avoid unnecessary character comparison Cons: Bad worse case performance Ex: {baa, caa, daa} in an shift value is limited by lmin
Roadmap
Backward Dawg Matching (BDM) Read backwards and check whether u is the factor of the pattern or not If not, we can safely shift the pattern to the beginning of u Use suffix automata to recognize all the factors of the reversed pattern
Backward Dawg Matching (BDM)
BOM There is no need for us to confirm that u is a factor of the pattern. know that au is not the factor of the pattern is enough BOM use a light-weight data structure called Factor Oracle to replace the suffix automata
BOM Factor Oracle of reversed pattern {announce} Simpler than suffix automata But include “error factor” (ex: cnna in the picture)
Set BDM Natural extension of BDM Cons: The suffix automata consume lots of memory The construction of automata is complex
Set BOM Over come the shortcoming of suffix automata by factor oracle Consider all lmin-length prefix of the patterns, reverse them and build Factor Oracle Example: {announce, annual, annually}
Set BOM Searching stage Read the character backwards if the factor recognition stop, we can shift the search window If reach the beginning of the window, then we need to first verify the lmin path in SBOM with the characters in the window. If verification pass, we can further verify the whole pattern If verification fail, move the window backwards by one character.
Roadmap
AC, WM & SBOM 1000 random patterns, 10MB random text
Performance Experiments
Performance Test Algorithm Test environment AC, AC_BM, WM, HybridWM, SBOM, RSI Test environment Processor : Intel Centrino Duo,1.83GHz Cache: 32KB L1 instruction, 32KB L1 data. 2048KB shared L2 Cache. DRAM:1.5GB DDR2, 667MHz OS —— windows XP sp2
Test 1-Random Scenario Alphabet size: 256 Random text 32MB with manually set matches of 10% Random pattern set with special length distribution (pattern length from 4 to 100, about 80% of patterns are of length 8 to 16) Pattern set size from 50 to 5000
Searching Performance
Analysis WM is the most efficient algorithm under such scenarios. Long random patterns Low matching rate Low memory requirement AC performance does not suffer great decline when comparing with others. Matches with theoretical analysis
Memory
Analysis WM consume much less memory than other algorithms Hash table AC and AC_BM consumes lots of memory Automata data structure SBOM is in the middle Light weight Factor Oracle
Test 2 Snort Pattern set (1785 patterns, average length 16)
Test 2-Real Scenario MIT DARPA IDS Data set LLS-DOS1.0, LLS-DOS2.02, winNT-inside Size(MB) Match Times Match Rate(%) P1 P2 P3 P4 ALL LLDOS1 116 41953565 1161663 42008979 1106249 43115228 55% LLDOS2 63 22396353 814154 22433249 777258 23210507 59% inside 226 72021566 3986124 72169387 3838303 76007690 63%
Results Full Pattern Set Algorithm Throughput (Mbps) MEM(MB) LLDOS 1 inside AC 376.48 377.76 387.28 29.39 AC_BM 154.96 153.52 150.48 20.18 HybridWM 133.12 135.2 141.68 0.42
Results P1 (plen<4) P2 (plen>=4) Algorithm Throughput (Mbps) MEM(MB) LLDOS 1 LLDOS 2 inside AC 420 420.4 448.72 0.17 AC_BM 212 204.8 214 0.2 SBOM 127.52 128.56 140.24 0.08 P1 (plen<4) Algorithm Throughput (Mbps) MEM(MB) LLDOS 1 LLDOS 2 inside AC 782.88 758.8 752 29.18 AC_BM 281.12 277.92 270.48 20.04 WM 420.56 417.68 404.8 0.42 RSI 371.04 378.72 346.72 6.45 SBOM 181.44 181.6 178.72 6.82 P2 (plen>=4)
Results P3 (plen<6) P4 (plen>=6) Algorithm Throughput (Mbps) MEM(MB) LLDOS 1 LLDOS 2 inside AC 404.32 412 429.92 0.91 AC_BM 184.56 183.36 183.44 0.85 SBOM 56 56.96 61.28 0.25 P3 (plen<6) Algorithm Throughput (Mbps) MEM(MB) LLDOS 1 LLDOS 2 inside AC 756.48 758 755.6 28.44 AC_BM 335.52 328.32 322.8 19.53 WM 644.88 626.64 578.88 0.41 RSI 552.72 533.12 505.52 5.35 SBOM 386.48 383.2 377.12 9.21 P4 (plen>=6)
Analysis AC performance does not suffer great decline from P1 to P3 WM performance greatly increase from P2 to P4 divided pattern length could be bigger than 6 AC possess higher searching performance, even comparing with WM in P2 and P4. Many of the patterns in snort has same prefixes, which is not good for WM, especially when matching rate is high.
Test 3-AC & WM Hybrid MIT DARPA IDS Data set Flexible pattern division LLS-DOS2.02 (match rate: 59%) Mix with background flows LLDOS30 (match rate: 34%) LLDOS10 (match rate: 18%) Flexible pattern division Division line from 4~12
Results Divide length MEM (KB) Throughput(Mbps) LLDOS2 (59%) LLDOS (34%) (18%) AC WM total 4 127 426 424 212 535 637 291 671 1007 402 6 633 418 416 640 248 521 984 341 642 1633 461 8 1280 412 406 892 280 511 1368 372 623 2241 488 10 2441 1320 316 502 1909 398 612 2924 506 11 3080 397 384 296 497 2101 611 3246 514 12 3922 391 1496 302 501 2322 609 3551 520
Solution and Future Work
Alternatives AC Only AC & WM Hybrid
Future Work AC (memory compression) WM Automata: NFA, Banded, Sparse or other idea? Pattern Set: Sub set division? …… WM Same prefixes problem: dynamic cut? Worst-case problem: matches signal? Performance improve: intelligent verification?
Bibliography All the paper involved in this presentation has been upload to NSlab server 20 Categorized by prefix, suffix and factor (done) There is also a document name current for new papers appeared in recent high-ranked conferences like INFOCOM, SIGCOMM, USENIX Security, CPM and so on. (in progress) \\166.111.137.20\venus\文献 资源\Zongwei\string matching algorithm
Other Useful Resource Book Websites Pattern Matching Pointer http://www.cs.ucr.edu/~stelo/pattern.html Maintained by associate professor Stefano Lonardi of UC Reiverside EXACT STRING MATCHING ALGORITHMS http://www-igm.univ-mlv.fr/~lecroq/string/ Description, complexity analysis, C source code of many single string matching algorithms