Presentation is loading. Please wait.

Presentation is loading. Please wait.

Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.

Similar presentations


Presentation on theme: "Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa."— Presentation transcript:

1 Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa

2 Overview Classic Aho Corasick Our algorithm –Goto Function –Failure Function –Combining the two Queries in O(m log|Σ|)

3 Set Pattern Matching Problem Find patterns in text P={P 1, P 2,... P q }, in T Aho and Corasick solved it in ’ 75 Generalized version of KMP Uses a state machine

4 Aho Corasick - Example  hi heiris heriri iris P = {her, iris, he, is}

5 Aho Corasick - Example  hi heiris heriri iris P = {her, iris, he, is} If stuck, travel along KMP-style Failure link Travel along the Goto function, which is a trie of all patterns

6 Aho Corasick - Example  hi he ir is her iri iris P = {her, iris, he, is} If stuck, travel along KMP-style Failure link Travel along the Goto function, which is a trie of all patterns When found a pattern, output it

7 Aho Corasick Definitions Goto function: a trie of the patterns Failure function: for each label, the largest suffix which is a prefix of a pattern –KMP, but prefix of any pattern qualifies Output function: patterns ending at this label

8 Classic Aho Corasick – Analysis Constructed in O(n) (cumulative pattern length) Answered queries in O(m + k)... For constant alphabets only! For integer alphabets, Σ=O(n c ), algorithm changes depending on branching method –List, Array or Search Tree Recent developments inspire for better! –Farach-97; Karkkäinen & Sanders-03; –Ko & Aluru-03; Kim, Sim, Park & Park-03

9 Our Results Our algorithm achieves better results: Construction in O(n) time, O(n) space Query in O(m log|Σ|) Works for integer alphabets, Σ = O(n c )

10 Sort patterns in time linear to their length –By building suffix array of S p =$P 1 $P 2 $...$P q $, and just ignoring non-pattern suffixes –Or by two-pass radix sort, O(D + Σ) = O(n)  Paige & Tarjan, ’87; Andersson & Nilsson, ‘94 Now create the trie in lexicographic order Hold a list of sons; insert each new node to the end of the list Algorithm: Goto Function

11 Example – Goto Function P = {the, than, this, then} P’ = {than, the, then, this} Sorting Patterns

12 than, the, then, this than, the, then than, the P’ = {than, the, then, this} Example – Goto Function than  t th thathe thanthen thi this

13 P’ = {than, the, then, this} Example – Goto Function  t th thathe thanthen thi this th a Sorted List, keep the tail e i

14 Algorithm: Failure Function We need to construct Failure links on trie Original algorithm included traversing trie We found a deep connection between: –Failure function of the patterns, and –Suffix Tree of the reversed patterns  Or Enhanced Suffix Array  Abouelhoda, Kurtz & Ohlebusch-04; Kim, Jeon & Park-04 We ’ ll “ learn by example ”...

15 Example – Failure Function Failure function: “ iris ”  “ is ” The reverses: “ siri ”, “ si ” “ si ” is a prefix of “ siri ” (with $ so “ is ” prefix of a pattern)  hi heiris heriri iris P R = {eh, reh, siri, si}  r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $ $ $$ $ $ $ $ si (is) siri (iris) $ $ P = {he, her, iris, is}

16 Understanding Failure Function Failure function is defined as: largest suffix, which is a prefix of any pattern Reverse: “ largest suffix ”  “ largest prefix ” –Any prefix of a label will be its ancestor in ST –Largest means nearest “ prefix of pattern ”  “ suffix of pattern ” –It will be a node in the ST, marked by a $ So: closest ancestor which is marked by $

17 Algorithm: Failure Function We found a deep connection between: –Failure function of the patterns, and –Suffix Tree of the reversed patterns We define S p =$P 1 $P 2 $...$P q $ We define T R to be the suffix tree of (S p ) R T R can be built in linear time –Can use Enhanced Suffix Array, E R, instead –Note: T R is a Generalized Suffix Tree How will we link the trie and T R ?

18  r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $ $ $$ $ $ $ $ Example – 1-to-1 Mapping  hi heiris heriri iris Note: “ r ” doesn ’ t get a link since it ’ s not marked by a $

19  r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $ $ $$ $ $ $ $ Example – 1-to-1 Mapping  hi heiris heriri iris si (is) siri (iris) $ i (i) iri (iri) $

20 Algorithm: Review Build Goto function (trie) –Sort patterns –Construct trie Build Failure function –Construct T R –Compute proper ancestor for $-marked nodes Combine information –Through mapping, create Failure links on trie

21 Adjustment for Integer Alphabet We used recent developments (SA, ST) Constructed Goto: using suffix array Found a connection between Failure function and suffix trees Thus, reduced the construction to O(n) Yet, manage to keep queries at O(m log|Σ|) Again - how?

22 Queries in O(m log|Σ|) We’ve built the trie in O(n) –But we have a sorted list –Search is compromised Our simple solution…

23 P’ = {than, the, then, this} Example – Goto Function  t th thathe thanthen thi this th ae i Array can be searched in log(#children) aei

24 Queries in O(m log|Σ|) Once the trie is complete –Convert lists in each node to arrays –Array ’ s size is known; O(n) space overall –Binary search can now be employed Reduce the time spent in each node to log(# children) = O(log|Σ|) Can be applied to Suffix Tree built from Suffix Array + LCP

25 The End Thanks!

26 Algorithm: Combining the two Build a 1-to-1 mapping between $-marked nodes in T R and trie nodes We compute mapping through the string: –For each char in S p, we keep its Goto node –For each suffix tree node, we know what indices it represents (in (S p ) R, and so in S p ) Now, build Failure links atop the trie –Like we saw in the example

27 Algorithm: Failure Function For each node, find its “ proper ancestor ” –Closest ancestor marked with a $ –Found with a simple preorder traversal The properties of T R ensure that... –For each failure link v 1  v 2 –And their corresponding nodes, u 1 and u 2 –u 2 = proper ancestor of u 1 If we link trie and T R, we find the Failure! How will we link them?

28 Example - automaton - Goto Travel along the Goto function, which is a trie of all patterns  ehit eyheirthis eyeheririthe iris thei their P = {her, their, eye, iris, he, is}

29 Example - T R  e (e) r (r) t (t) ye (ey) eh (he) reh (her) eht (the) eye (eye) h (h) ht (th) i (i) iri (iri) ri (ir) rieht (their) si (is) siri (iris) $ $ $ $$$ $ $$ $ $ $ $$$ $ ieht (thei) $ P = {her, their, eye, iris, he, is}

30 Example - T R and Failure  e (e) r (r) t (t) ye (ey) eh (he) reh (her) eht (the) eye (eye) h (h) ht (th) i (i) iri (iri) ri (ir) rieht (their) si (is) siri (iris) $ $ $ $$$ $ $$ $ $ $ $$$ $ ieht (thei) $ P = {her, their, eye, iris, he, is}  ehit ey he irth is eyeher iri the iris thei their iris  is the  he  e eye  e their  ir  

31 T R - Reversed Suffix Tree We defined S p =$P 1 $P 2 $...$P q $ We define T R to be the suffix tree of (S p ) R This tree has interesting properties: –Each trie node v is represented by exactly one T R node u, so that Label(v) = Label(u) R –In T R, a node ’ s label is a prefix of its child ’ s label; in the trie, it is a suffix of the original –A $-marked node in T R means that the original label is a prefix of a pattern

32 Example - T R  e (e) r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $$ $ $$ $ $ $ $ P = {her, iris, he, is}

33 Example - T R and Failure We took: Failure of “ their ”  “ ir ” (from “ iris ” ) –Largest suffix, which is a prefix of a pattern Their reverse strings are “ rieht ”, “ ri ” –Now prefix... its ancestor in a suffix tree! To be a prefix of a pattern p x, should be a suffix of the reverse pattern (p x ) R –So it will be in suffix tree, and end with a $ P = {her, their, eye, iris, he, is}


Download ppt "Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa."

Similar presentations


Ads by Google