1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Longest Common Subsequence
Walks, Paths and Circuits Walks, Paths and Circuits Sanjay Jain, Lecturer, School of Computing.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Representing Relations Using Matrices
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Yangjun Chen 1 String Matching String matching problem - prefix - suffix - automata - String-matching automata - prefix function - Knuth-Morris-Pratt algorithm.
CS21 Decidability and Tractability
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Introduction to Computability Theory
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
1 String Matching The problem: Input: a text T (very long string) and a pattern P (short string). Output: the index in T where a copy of P begins.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
1 Languages and Finite Automata or how to talk to machines...
1 Single Final State for NFAs and DFAs. 2 Observation Any Finite Automaton (NFA or DFA) can be converted to an equivalent NFA with a single final state.
Aho-Corasick String Matching An Efficient String Matching.
Data Structures – LECTURE 10 Huffman coding
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:
1 Construction of Index: (Page 197) Objective: Given a document, find the number of occurrences of each word in the document. Example: Computer Science.
Costas Busch - LSU1 Non-Deterministic Finite Automata.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Induction and recursion
Formal Language Finite set of alphabets Σ: e.g., {0, 1}, {a, b, c}, { ‘{‘, ‘}’ } Language L is a subset of strings on Σ, e.g., {00, 110, 01} a finite language,
Algorithms for Enumerating All Spanning Trees of Undirected and Weighted Graphs Presented by R 李孟哲 R 陳翰霖 R 張仕明 Sanjiv Kapoor and.
Zvi Kohavi and Niraj K. Jha 1 Memory, Definiteness, and Information Losslessness of Finite Automata.
Finite State Machines Chapter 5. Languages and Machines.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
Prof. Busch - LSU1 NFAs accept the Regular Languages.
Theory of Computation, Feodor F. Dragan, Kent State University 1 TheoryofComputation Spring, 2015 (Feodor F. Dragan) Department of Computer Science Kent.
MCS 101: Algorithms Instructor Neelima Gupta
 2005 SDU Lecture13 Reducibility — A methodology for proving un- decidability.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
CSCI 2670 Introduction to Theory of Computing September 13.
CS 203: Introduction to Formal Languages and Automata
Exercise 1 Consider a language with the following tokens and token classes: ID ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP ::=
Chapter 3 Regular Expressions, Nondeterminism, and Kleene’s Theorem Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction.
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
8.4 Closures of Relations Definition: The closure of a relation R with respect to property P is the relation obtained by adding the minimum number of.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
Nondeterministic Finite State Machines Chapter 5.
Lecture 14: Theory of Automata:2014 Finite Automata with Output.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
 2004 SDU Uniquely Decodable Code 1.Related Notions 2.Determining UDC 3.Kraft Inequality.
Finite-State Machines (FSM) Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth Rosen.
Advanced Algorithms Analysis and Design
PROPERTIES OF REGULAR LANGUAGES
Chapter 2 FINITE AUTOMATA.
CSE 421: Introduction to Algorithms
Enumerating Distances Using Spanners of Bounded Degree
Non-Deterministic Finite Automata
CSE 311: Foundations of Computing
CSE 311 Foundations of Computing I
Chap. 3 BOTTOM-UP PARSING
Presentation transcript:

1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories

2 Virus Definition Each virus has its peculiar signature  Example in ClamAV _0017_0001_000=21b c999cd218bd6b9030 0b440cd218b4c198b541bb80157cd21b43ecd2132ed _0017_0001_000 virus index Hex(21)=Dec(33)=‘!’  Match the signature for detecting virus

3 Regular Expression Use RE to describe the signature ? can be any one char  W32.Hybris.C (Clam)=4000?????????????83??????75f2e9????ffff * can be any chars (including no char)  Oror-fam (Clam)=495243* * f5455*4b617a61*536e f {n1-n2}, there are n1~n2 chars between two parts  Worm.Bagle.AG-empty (Clam)=6e74656e742d a c f6 e2f6f d d3b{40-130}2d2d2d2d2d2d2d2d

4 Introduction Locate all occurrences of any of a finite number of keywords in a string of text. Consists of two parts :  constructing a finite state pattern matching machine from the keywords  using the pattern matching machine to process the text string in a single pass.

5 Pattern Matching Machine(1) Our problem is to locate and identify all substrings of x which are keywords in K.  K : K={y 1,y 2,…,y k } be a finite set of strings which we shall call keywords  x : x is an arbitrary string which we shall call the text string. The behavior of the pattern matching machine is dictated by three functions: a goto function g, a failure function f, and an output function output.

6 Pattern Matching Machine(2) g (s,a) = s’ or fail : maps a pair consisting of a state and an input symbol into a state or the message fail. f (s) = s’ : maps a state into a state, and is consulted whenever the goto function reports fail. output (s) = keywords : associating a set of keyword (possibly empty) with every state.

7 Pattern Matching Machine Example with keywords {he,she,his,hers}

8

9 Start state is state 0. Let s be the current state and a the current symbol of the input string x. Operating cycle  If g(s,a)=s’, makes a goto transition, and enters state s’ and the next symbol of x becomes the current input symbol.  If g(s,a)=fail, make a failure transition f. If f(s)=s’, the machine repeats the cycle with s’ as the current state and a as the current input symbol.

10 Example Text: u s h e r s State: In state 4, since g(4,e)=5, and the machine enters state 5, and finds keywords “she” and “he” at the end of position four in text string, emits output(5)

11 Example Cont’d In state 5 on input symbol r, the machine makes two state transitions in its operating cycle. Since g(5,r)=fail, M enters state 2=f(5). Then since g(2,r)=8, M enters state 8 and advances to the next input symbol. No output is generated in this operating cycle.

12 Algorithm 1. Pattern matching machine. Input. A text string x = a 1 a 2 … a n where each a i is an input symbol and a pattern matching machine M with goto function g, failure function f, and output function output, as described above. Output. Locations at which keywords occur in x. Method. begin state ← 0 for i ← 1 until n do begin while g (state, a i ) = fail do state ← f(state) state ← g (state, a i ) if output (state)≠ empty then begin print i print output (state) end

13 Construction the functions Two part to the construction  First : Determine the states and the goto function.  Second : Compute the failure function.  Output function start at first, complete at second.

14 Construction of Goto function Construct a goto graph like next page. New vertices and edges to the graph, starting at the start state. Add new edges only when necessary. Add a loop from state 0 to state 0 on all input symbols other than the first one in each keyword.

15 Construction of Goto function with keywords {he,she,his,hers}

16 Algorithm 2. Construction of the goto function. Input. Set of keywords K = {y l, y 2,..... y k }. Output. Goto function g and a partially computed output function output. Method. We assume output(s) is empty when state s is first created, and g(s, a) = fail if a is undefined or if g(s, a) has not yet been defined. The procedure enter(y) inserts into the goto graph a path that spells out y. begin newstate ← 0 for i ← 1 until k do enter(y i ) for all a such that g(0, a) = fail do g(0, a) ← 0 end Algorithm 2

17 procedure enter(a 1 a 2 … a m ): begin state ← 0; j ← 1 while g (state, a j )≠ fail do begin state ← g (state, a j ) j ← j + l end for p ← j until m do begin newstate ← newstate + 1 g (state, a p ) ← newstate state ← newstate end output(state) ← { a 1 a 2 … a m } end

18 Construction of Failure function Depth of s : the length of the shortest path from the start state to state s. The states of depth d can be determined from the states of depth d-1. Make f(s)=0 for all states s of depth 1.

19 Construction of Failure function Cont’d Compute failure function for the state of depth d, each state r of depth d-1 :  1. If g(r,a)=fail for all a, do nothing.  2. Otherwise, for each a such that g(r,a)=s, do the following : a. Set state=f(r). b. Execute state ←f(state) zero or more times, until a value for state is obtained such that g(state,a) ≠fail. c. Set f(s)=g(state,a).

20 Algorithm 3. Construction of the failure function. Input. Goto function g and output function output from Algorithm 2. Output. Failure function fand output function output. Method. begin queue ← empty for each a such that g(0, a) = s≠0 do begin queue ← queue ∪ {s} f(s) ← 0 end Algorithm 3

21 while queue ≠ empty do begin let r be the next state in queue queue ← queue - {r} for each asuch that g(r, a) = s≠fail do begin queue ← queue ∪ {s} state ← f(r) while g (state, a) = fail do state ← f(state) f(s) ← g(state, a) output(s) ←output(s) ∪ output(f(s)) end

22 About construction When we determine f(s)=s’, we merge the outputs of state s with the output of state s’. In fact, if the keyword “his” were not present, then could go directly from state 4 to state 0, skipping an unnecessary intermediate transition to state 1. To avoid above, we can use the deterministic finite automaton, which discuss later.

23 Properties of Algorithms 1,2,3 Lemma 1: Suppose that in the goto graph state s is represented by the string u and state t is represented by the string v. Then f(s)=t iff v is the longest proper suffix of u that is also a prefix of some keyword. Proof :  Suppose u=a 1 a 2 …a j, and a 1 a 2 …a j-1 represents state r, let r 1,r 2,…,r n be the sequence of states : 1. r 1 =f(r) ; 2. r i+1 =f(r i ) ; 3.g(r i,a j )=fail for 1 ≦ i < n ; 4.g(r n,a j )=t  Suppose v i represents state r i, v 1 is the longest proper suffix of a 1 a 2 …a j-1 that is a prefix of some keyword; v 2 is the longest proper suffix of v 1 that is a prefix of some keyword, and so on.  Thus v n is the longest suffix of a 1 a 2 …a j-1 such that v n a j is a prefix of some keyword.

24 Properties of Algorithms 1,2,3 Lemma 2 : The set output(s) contains y if and only if y is a keyword that is a suffix of the string representing state s. Proof :  Consider a string y in output(s).  If y is added to output(s) by algorithm 2, then y=u and y is a keyword.  If y is added to output(s) by algorithm 3, then y is in output(f(s)). If y is a proper suffix of u, then from the inductive hypothesis and Lemma 1 we know output(f(s)) contains y.

25 Properties of Algorithms 1,2,3 Lemma 3 : After the jth operating cycle, Algorithm 1 will be in state s iff s is represented by the longest suffix of a 1 a 2 …a j that is a prefix of some keyword.  Proof : Similar to Lemma 1. THEOREM 1 : THEOREM 1 : Algorithms 2 and 3 produce valid goto,failure, and output functions.  Proof : By Lemmas 2 and 3.

26 Time Complexity of Algorithms 1, 2, and 3 THEOREM 2 : THEOREM 2 : Using the goto, failure and output functions created by Algorithms 2 and 3, Algorithm 1 makes fewer than 2n state transitions in processing a text string of length n.  From state s of depth d Algorithm 1 make d failure transitions at most in one operating cycle.  Number of failure transitions must be at least one less than number of goto transitions.  processing an input of length n Algorithm 1 makes exactly n goto transitions. Therefore the total number of state transitions is less than 2n.

27 Time Complexity of Algorithms 1, 2, and 3 THEOREM 3 : THEOREM 3 : Algorithms 2 requires time linearly proportional to the sum of the lengths of the keywords. Proof :  Straightforward THEOREM 4 : THEOREM 4 : Algorithms 3 can be implemented to run in time proportional to the sum of the lengths of the keywords. Proof :  Total number of executions of state← f(state) is bounded by the sum of the lengths of the keywords.  Using linked lists to represent the output set of a state, we can execute the statement output(s) ← output(s) ∪ output(f(s)) in constant time.

28 procedure enter(a 1 a 2 … a m ): begin state ← 0; j ← 1 while g (state, a j )≠ fail do begin state ← g (state, a j ) j ← j + l end for p ← j until m do begin newstate ← newstate + 1 g (state, a p ) ← newstate state ← newstate end output(state) ← { a 1 a 2 … a m } end

29 while queue ≠ empty do begin let r be the next state in queue queue ← queue - {r} for each asuch that g(r, a) = s≠fail do begin queue ← queue ∪ {s} state ← f(r) while g (state, a) = fail do state ← f(state) f(s) ← g(state, a) output(s) ←output(s) ∪ output(f(s)) end

30 Eliminating Failure Transitions Using in algorithm 1 δ(s, a), a next move function δ such that for each state s and input symbol a. By using the next move function δ, we can dispense with all failure transitions, and make exactly one state transition per input character.

31 Algorithm 4. Construction of a deterministic finite automaton. Input. Goto function g from Algorithm 2 and failure function f from Algorithm 3. Output. Next move function 8. Method. begin queue ← empty for each symbol a do begin δ(0, a) ← g(0, a) if g (0, a) ≠ 0 then queue ← queue ∪ {g (0, a) } end while queue ≠ empty do begin let r be the next state in queue queue ← queue - {r} for each symbol a do if g(r, a) = s ≠ fail do begin queue ← queue ∪ {s} δ(r, a) ← s end elseδ(r, a) ←δ(f(r), a) end

32 Fig. 3. Next move function. input symbolnext state state 0:h 1 s 3. 0 state 1 : e 2 i 6 h 1 s 3. 0 state 9:state7: state3 :h 4 s 3. 0 state 5:state2 : r 8 h 1 s 3. 0 state 6 : s 7 h 1. 0 state 4 :e 5 i 6 h 1 s 3. 0 state 8 : s 9 h 1. 0

33 Conclusion Attractive in large numbers of keywords, since all keywords can be simultaneously matched in one pass. Using Next move function  can potentially reduce state transitions by 50%, but more memory.  Spend most time in state 0 from which there are no failure transitions.

s hers is {h,s}’ he