String Matching Algorithm

Slides:

Advertisements

Similar presentations

1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.

Advertisements

Jing-Shin Chang1 Regular Expression: Syntax for Specifying String Patterns Basic Alphabet empty-string: any symbol a in input symbol set Basic Operators.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Space-Time Tradeoffs in Software-based Deep Packet Inspection Author: Anat Bremler-Barr, Yotam Harchol, and David Hay Published in Proc. IEEE HPSR 2011.

Space-for-Time Tradeoffs

Algorithm : Design & Analysis [19]

Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle.

15-853Page : Algorithms in the Real World Suffix Trees.

296.3: Algorithms in the Real World

1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.

Modern Information Retrieval

March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.

A Fast String Matching Algorithm The Boyer Moore Algorithm.

Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.

1 A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994.

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)

1 Performing packet content inspection by longest prefix matching technology Authors: Nen-Fu Huang, Yen-Ming Chu, Yen-Min Wu and Chia- Wen Ho Publisher:

Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)

Indexing and Searching

A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.

Design and Analysis of Algorithms

On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.

Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)

1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.

String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.

CSE7701: Research Seminar on Networking

1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b

20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,

String Matching Fundamental Data Structures and Algorithms April 22, 2003.

MCS 101: Algorithms Instructor Neelima Gupta

1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.

An Efficient Regular Expressions Compression Algorithm From A New Perspective  Author: Tingwen Liu, Yifu Yang, Yanbing Liu, Yong Sun, Li Guo  Publisher:

MCS 101: Algorithms Instructor Neelima Gupta

String Searching CSCI 2720 Spring 2007 Eileen Kraemer.

String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.

Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.

Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.

Author : Sarang Dharmapurikar, John Lockwood Publisher : IEEE Journal on Selected Areas in Communications, 2006 Presenter : Jo-Ning Yu Date : 2010/12/29.

CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.

Fundamental Data Structures and Algorithms

An Improved Multi-Pattern Matching Algorithm for Large-Scale Pattern Sets Author : Zhan Peng, Yu-Ping Wang and Jin-Feng Xue Conference: IEEE 10th International.

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]

1/39 COMP170 Tutorial 13: Pattern Matching T: P:.

Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)

CSG523/ Desain dan Analisis Algoritma

15-853:Algorithms in the Real World

Advanced Data Structure: Bioinformatics

Exact string matching: one pattern (text on-line)

CSE7701: Research Seminar on Networking

Recuperació de la informació

13 Text Processing Hongfei Yan June 1, 2016.

Rabin & Karp Algorithm.

Adviser: R. C. T. Lee Speaker: C. W. Cheng National Chi Nan University

Chapter 7 Space and Time Tradeoffs

Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching

Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.

Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.

Space-for-time tradeoffs

Knuth-Morris-Pratt Algorithm.

Chap 3 String Matching 3 -.

Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007

Space-for-time tradeoffs

Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching

Space-for-time tradeoffs

Sequences 5/17/ :43 AM Pattern Matching.

Week 14 - Wednesday CS221.

Presentation transcript:

String Matching Algorithm Overview & Analysis By cyclone @ NSlab, RIIT Sep. 6 2008

Structure Algorithm overview Performance experiments Solution and future work Bibliography and resources

Algorithm Overview

Definition Given an alphabet bet S, a pattern P of length m and a text T of length n, find if P is in T or the position (s) P matches a substring of T, where usually m<<n Considering the pattern P string exact string matching string with errors approximate string matching regular expression regular expression matching

Categories Category 1 Category 2 Category 3 …… Single string matching Algorithms Multiple string matching Algorithms Category 2 Prefix based Algorithms Suffix based Algorithms Factor based Algorithms Category 3 Automaton based Algorithms Trie based Algorithms Table based Algorithms ……

Prefix based algorithms The search is done forward in the search window Search for the longest prefix of the window which is also a prefix of the pattern(s) all the characters are read

Suffix based algorithms The search is done backwards along the search window Search for the longest suffix of the window which is also a suffix of the pattern(s) not all the characters are read, which leads to sublinear average-case algorithms

Factor based algorithms The search is done backwards along the search window Search for the longest suffix of the window which is also a factor of the pattern(s) not all the characters are read Require to recognize the set of factors of the pattern(s) and this is quite complex

Roadmap

KMP MP: border the longest prefix v of the pattern which is also the suffix of the portion u of the text KMP: Longest border add “C (u+1) not equal to C(v+1)”

Knuth-Morris-Pratt preprocessing phase in O(m) space and time complexity searching phase in O(n+m) time complexity (independent from the alphabet size)

Aho-Corasick Automata (Finite) a finite set of States Q, among which one is Initial, and some are Terminal transitions between states are labeled by elements of characters orε, which is decided by a transition function F (S,Q,I,T,F) NFA & DFA

Aho-Corasick: NFA AC automata (above pattern set {ATATATA, TATAT, ACGATAT}) Extend the concept of Border in KMP to search pattern set NFA Three main function Goto function (real transition) Failure function (dashed transition) Output function (double circle state)

Aho-Corasick: NFA Search stage is simple Start from state 0 Read the character one after another If current state has Goto transition for the reading character, current Goto (current) If current state has Failure transition for the reading character, then While (current has relevent Goto transition) current Failure (current) current Goto (current) Check if current state has Output function and report

Aho-Corasick: DFA Preprocess: Conversion from NFA to DFA Search Traverse the NFA and previously calculate all failure path Each reading character can find Goto transition Search Without travel back the failure path Trade-off between storage and search speed

Aho-Corasick Pro: Cons: searching time complexity is O(n) (independent from the pattern set size) Cons: When pattern set size increase, the memory needed increase drastically. Cache and memory access time changing will compromise the time performance.

AC-Modified AC-sparse and AC-banded State and Path compression by tuck Sparse vector storing method for transitions State and Path compression by tuck Character index AC by Jianming Others……

Boyer-Moore How to safely shifting the search window without missing possible match Two major heuristics Good Suffix (two shift value: s1,s2) Bad Character (one shift value: s3) In the searching stage, the shift value is calculated by max(min(s1,s2),s3).

Good suffix heuristic

Bad Character heuristic

Boyer-Moore Excellent time performance Worse case: O(mn) Average case: sublinear Best case: O(n/m) ex: am-1b in bn Fast when alphabet size is large, common in NIMS Cons: calculating the shift value in both heuristics is somewhat complex

BM-Horspool For large alphabet, bad character heuristic always has the bigger shift value New bad character heuristic: Only consider the last character in the window

Roadmap

Commnetz-Walter Natural extension of BM Use reversed trie of pattern set Trie: a set of nodes put together with unidirectional links, each link has a label. It can represents a set of strings

Set BM-Horspool Natural extension of BM-Horspool Use reversed trie of pattern set

Wu-Manber AC_BM is complex in shift value calculation SBMH has bad performance when pattern set size is large the probability of a character appear in a certain pattern is high, so the average shift value is comparatively small So, WM extend “bad character” of SBMH to “bad character block”

Wu-Manber Use a hash table called SHIFT to store the shift values of character blocks Use HASH table to link the patterns has the same last character block Use PREFIX table to discriminate patterns link with the HASH entry SHIFT table and HASH table share the same hash function

Wu-Manber SHIFT Block: Bl Size of block: B Hash function: h() Minimum pattern length: lmin SHIFT(j) entry stores the minimum shift value of all the block that h(Bl)=j The shift value of block is calculated as follows: If Bl does not appear in any pattern, its shift value is lmin-B+1 If Bl appear in some patterns, find the rightmost of them, let the offset of Bl in that pattern be j, its shift value is lmin-B-j

Wu-Manber Searching stage Read character block and hash it Find the relevent SHIFT entry value shift If shift>0, move the window backwards according to it If shift=0, find the relevent HASH entry, and then verify all the possible matching patterns link to this entry, with the help of PREFIX table. No matter match or not, move the window backwards by 1

Wu-Manber Pros: Excellent average time performance Cons: Hash function Avoid unnecessary character comparison Cons: Bad worse case performance Ex: {baa, caa, daa} in an shift value is limited by lmin

Roadmap

Backward Dawg Matching (BDM) Read backwards and check whether u is the factor of the pattern or not If not, we can safely shift the pattern to the beginning of u Use suffix automata to recognize all the factors of the reversed pattern

Backward Dawg Matching (BDM)

BOM There is no need for us to confirm that u is a factor of the pattern. know that au is not the factor of the pattern is enough BOM use a light-weight data structure called Factor Oracle to replace the suffix automata

BOM Factor Oracle of reversed pattern {announce} Simpler than suffix automata But include “error factor” (ex: cnna in the picture)

Set BDM Natural extension of BDM Cons: The suffix automata consume lots of memory The construction of automata is complex

Set BOM Over come the shortcoming of suffix automata by factor oracle Consider all lmin-length prefix of the patterns, reverse them and build Factor Oracle Example: {announce, annual, annually}

Set BOM Searching stage Read the character backwards if the factor recognition stop, we can shift the search window If reach the beginning of the window, then we need to first verify the lmin path in SBOM with the characters in the window. If verification pass, we can further verify the whole pattern If verification fail, move the window backwards by one character.

Roadmap

AC, WM & SBOM 1000 random patterns, 10MB random text

Performance Experiments

Performance Test Algorithm Test environment AC, AC_BM, WM, HybridWM, SBOM, RSI Test environment Processor : Intel Centrino Duo,1.83GHz Cache: 32KB L1 instruction, 32KB L1 data. 2048KB shared L2 Cache. DRAM:1.5GB DDR2, 667MHz OS —— windows XP sp2

Test 1-Random Scenario Alphabet size: 256 Random text 32MB with manually set matches of 10% Random pattern set with special length distribution (pattern length from 4 to 100, about 80% of patterns are of length 8 to 16) Pattern set size from 50 to 5000

Searching Performance

Analysis WM is the most efficient algorithm under such scenarios. Long random patterns Low matching rate Low memory requirement AC performance does not suffer great decline when comparing with others. Matches with theoretical analysis

Memory

Analysis WM consume much less memory than other algorithms Hash table AC and AC_BM consumes lots of memory Automata data structure SBOM is in the middle Light weight Factor Oracle

Test 2 Snort Pattern set (1785 patterns, average length 16)

Test 2-Real Scenario MIT DARPA IDS Data set LLS-DOS1.0, LLS-DOS2.02, winNT-inside Size（MB） Match Times Match Rate(%) P1 P2 P3 P4 ALL LLDOS1 116 41953565 1161663 42008979 1106249 43115228 55% LLDOS2 63 22396353 814154 22433249 777258 23210507 59% inside 226 72021566 3986124 72169387 3838303 76007690 63%

Results Full Pattern Set Algorithm Throughput (Mbps) MEM（MB） LLDOS 1 inside AC 376.48 377.76 387.28 29.39 AC_BM 154.96 153.52 150.48 20.18 HybridWM 133.12 135.2 141.68 0.42

Results P1 (plen<4) P2 (plen>=4) Algorithm Throughput (Mbps) MEM（MB） LLDOS 1 LLDOS 2 inside AC 420 420.4 448.72 0.17 AC_BM 212 204.8 214 0.2 SBOM 127.52 128.56 140.24 0.08 P1 (plen<4) Algorithm Throughput (Mbps) MEM（MB） LLDOS 1 LLDOS 2 inside AC 782.88 758.8 752 29.18 AC_BM 281.12 277.92 270.48 20.04 WM 420.56 417.68 404.8 0.42 RSI 371.04 378.72 346.72 6.45 SBOM 181.44 181.6 178.72 6.82 P2 (plen>=4)

Results P3 (plen<6) P4 (plen>=6) Algorithm Throughput (Mbps) MEM（MB） LLDOS 1 LLDOS 2 inside AC 404.32 412 429.92 0.91 AC_BM 184.56 183.36 183.44 0.85 SBOM 56 56.96 61.28 0.25 P3 (plen<6) Algorithm Throughput (Mbps) MEM（MB） LLDOS 1 LLDOS 2 inside AC 756.48 758 755.6 28.44 AC_BM 335.52 328.32 322.8 19.53 WM 644.88 626.64 578.88 0.41 RSI 552.72 533.12 505.52 5.35 SBOM 386.48 383.2 377.12 9.21 P4 (plen>=6)

Analysis AC performance does not suffer great decline from P1 to P3 WM performance greatly increase from P2 to P4 divided pattern length could be bigger than 6 AC possess higher searching performance, even comparing with WM in P2 and P4. Many of the patterns in snort has same prefixes, which is not good for WM, especially when matching rate is high.

Test 3-AC & WM Hybrid MIT DARPA IDS Data set Flexible pattern division LLS-DOS2.02 (match rate: 59%) Mix with background flows LLDOS30 (match rate: 34%) LLDOS10 (match rate: 18%) Flexible pattern division Division line from 4~12

Results Divide length MEM (KB) Throughput(Mbps) LLDOS2 (59%) LLDOS (34%) (18%) AC WM total 4 127 426 424 212 535 637 291 671 1007 402 6 633 418 416 640 248 521 984 341 642 1633 461 8 1280 412 406 892 280 511 1368 372 623 2241 488 10 2441 1320 316 502 1909 398 612 2924 506 11 3080 397 384 296 497 2101 611 3246 514 12 3922 391 1496 302 501 2322 609 3551 520

Solution and Future Work

Alternatives AC Only AC & WM Hybrid

Future Work AC (memory compression) WM Automata: NFA, Banded, Sparse or other idea? Pattern Set: Sub set division? …… WM Same prefixes problem: dynamic cut? Worst-case problem: matches signal? Performance improve: intelligent verification?

Bibliography All the paper involved in this presentation has been upload to NSlab server 20 Categorized by prefix, suffix and factor (done) There is also a document name current for new papers appeared in recent high-ranked conferences like INFOCOM, SIGCOMM, USENIX Security, CPM and so on. (in progress) \\166.111.137.20\venus\文献资源\Zongwei\string matching algorithm

Other Useful Resource Book Websites Pattern Matching Pointer http://www.cs.ucr.edu/~stelo/pattern.html Maintained by associate professor Stefano Lonardi of UC Reiverside EXACT STRING MATCHING ALGORITHMS http://www-igm.univ-mlv.fr/~lecroq/string/ Description, complexity analysis, C source code of many single string matching algorithms