Download presentation
Presentation is loading. Please wait.
1
Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa
2
Overview Classic Aho Corasick Our algorithm –Goto Function –Failure Function –Combining the two Queries in O(m log|Σ|)
3
Set Pattern Matching Problem Find patterns in text P={P 1, P 2,... P q }, in T Aho and Corasick solved it in ’ 75 Generalized version of KMP Uses a state machine
4
Aho Corasick - Example hi heiris heriri iris P = {her, iris, he, is}
5
Aho Corasick - Example hi heiris heriri iris P = {her, iris, he, is} If stuck, travel along KMP-style Failure link Travel along the Goto function, which is a trie of all patterns
6
Aho Corasick - Example hi he ir is her iri iris P = {her, iris, he, is} If stuck, travel along KMP-style Failure link Travel along the Goto function, which is a trie of all patterns When found a pattern, output it
7
Aho Corasick Definitions Goto function: a trie of the patterns Failure function: for each label, the largest suffix which is a prefix of a pattern –KMP, but prefix of any pattern qualifies Output function: patterns ending at this label
8
Classic Aho Corasick – Analysis Constructed in O(n) (cumulative pattern length) Answered queries in O(m + k)... For constant alphabets only! For integer alphabets, Σ=O(n c ), algorithm changes depending on branching method –List, Array or Search Tree Recent developments inspire for better! –Farach-97; Karkkäinen & Sanders-03; –Ko & Aluru-03; Kim, Sim, Park & Park-03
9
Our Results Our algorithm achieves better results: Construction in O(n) time, O(n) space Query in O(m log|Σ|) Works for integer alphabets, Σ = O(n c )
10
Sort patterns in time linear to their length –By building suffix array of S p =$P 1 $P 2 $...$P q $, and just ignoring non-pattern suffixes –Or by two-pass radix sort, O(D + Σ) = O(n) Paige & Tarjan, ’87; Andersson & Nilsson, ‘94 Now create the trie in lexicographic order Hold a list of sons; insert each new node to the end of the list Algorithm: Goto Function
11
Example – Goto Function P = {the, than, this, then} P’ = {than, the, then, this} Sorting Patterns
12
than, the, then, this than, the, then than, the P’ = {than, the, then, this} Example – Goto Function than t th thathe thanthen thi this
13
P’ = {than, the, then, this} Example – Goto Function t th thathe thanthen thi this th a Sorted List, keep the tail e i
14
Algorithm: Failure Function We need to construct Failure links on trie Original algorithm included traversing trie We found a deep connection between: –Failure function of the patterns, and –Suffix Tree of the reversed patterns Or Enhanced Suffix Array Abouelhoda, Kurtz & Ohlebusch-04; Kim, Jeon & Park-04 We ’ ll “ learn by example ”...
15
Example – Failure Function Failure function: “ iris ” “ is ” The reverses: “ siri ”, “ si ” “ si ” is a prefix of “ siri ” (with $ so “ is ” prefix of a pattern) hi heiris heriri iris P R = {eh, reh, siri, si} r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $ $ $$ $ $ $ $ si (is) siri (iris) $ $ P = {he, her, iris, is}
16
Understanding Failure Function Failure function is defined as: largest suffix, which is a prefix of any pattern Reverse: “ largest suffix ” “ largest prefix ” –Any prefix of a label will be its ancestor in ST –Largest means nearest “ prefix of pattern ” “ suffix of pattern ” –It will be a node in the ST, marked by a $ So: closest ancestor which is marked by $
17
Algorithm: Failure Function We found a deep connection between: –Failure function of the patterns, and –Suffix Tree of the reversed patterns We define S p =$P 1 $P 2 $...$P q $ We define T R to be the suffix tree of (S p ) R T R can be built in linear time –Can use Enhanced Suffix Array, E R, instead –Note: T R is a Generalized Suffix Tree How will we link the trie and T R ?
18
r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $ $ $$ $ $ $ $ Example – 1-to-1 Mapping hi heiris heriri iris Note: “ r ” doesn ’ t get a link since it ’ s not marked by a $
19
r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $ $ $$ $ $ $ $ Example – 1-to-1 Mapping hi heiris heriri iris si (is) siri (iris) $ i (i) iri (iri) $
20
Algorithm: Review Build Goto function (trie) –Sort patterns –Construct trie Build Failure function –Construct T R –Compute proper ancestor for $-marked nodes Combine information –Through mapping, create Failure links on trie
21
Adjustment for Integer Alphabet We used recent developments (SA, ST) Constructed Goto: using suffix array Found a connection between Failure function and suffix trees Thus, reduced the construction to O(n) Yet, manage to keep queries at O(m log|Σ|) Again - how?
22
Queries in O(m log|Σ|) We’ve built the trie in O(n) –But we have a sorted list –Search is compromised Our simple solution…
23
P’ = {than, the, then, this} Example – Goto Function t th thathe thanthen thi this th ae i Array can be searched in log(#children) aei
24
Queries in O(m log|Σ|) Once the trie is complete –Convert lists in each node to arrays –Array ’ s size is known; O(n) space overall –Binary search can now be employed Reduce the time spent in each node to log(# children) = O(log|Σ|) Can be applied to Suffix Tree built from Suffix Array + LCP
25
The End Thanks!
26
Algorithm: Combining the two Build a 1-to-1 mapping between $-marked nodes in T R and trie nodes We compute mapping through the string: –For each char in S p, we keep its Goto node –For each suffix tree node, we know what indices it represents (in (S p ) R, and so in S p ) Now, build Failure links atop the trie –Like we saw in the example
27
Algorithm: Failure Function For each node, find its “ proper ancestor ” –Closest ancestor marked with a $ –Found with a simple preorder traversal The properties of T R ensure that... –For each failure link v 1 v 2 –And their corresponding nodes, u 1 and u 2 –u 2 = proper ancestor of u 1 If we link trie and T R, we find the Failure! How will we link them?
28
Example - automaton - Goto Travel along the Goto function, which is a trie of all patterns ehit eyheirthis eyeheririthe iris thei their P = {her, their, eye, iris, he, is}
29
Example - T R e (e) r (r) t (t) ye (ey) eh (he) reh (her) eht (the) eye (eye) h (h) ht (th) i (i) iri (iri) ri (ir) rieht (their) si (is) siri (iris) $ $ $ $$$ $ $$ $ $ $ $$$ $ ieht (thei) $ P = {her, their, eye, iris, he, is}
30
Example - T R and Failure e (e) r (r) t (t) ye (ey) eh (he) reh (her) eht (the) eye (eye) h (h) ht (th) i (i) iri (iri) ri (ir) rieht (their) si (is) siri (iris) $ $ $ $$$ $ $$ $ $ $ $$$ $ ieht (thei) $ P = {her, their, eye, iris, he, is} ehit ey he irth is eyeher iri the iris thei their iris is the he e eye e their ir
31
T R - Reversed Suffix Tree We defined S p =$P 1 $P 2 $...$P q $ We define T R to be the suffix tree of (S p ) R This tree has interesting properties: –Each trie node v is represented by exactly one T R node u, so that Label(v) = Label(u) R –In T R, a node ’ s label is a prefix of its child ’ s label; in the trie, it is a suffix of the original –A $-marked node in T R means that the original label is a prefix of a pattern
32
Example - T R e (e) r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $$ $ $$ $ $ $ $ P = {her, iris, he, is}
33
Example - T R and Failure We took: Failure of “ their ” “ ir ” (from “ iris ” ) –Largest suffix, which is a prefix of a pattern Their reverse strings are “ rieht ”, “ ri ” –Now prefix... its ancestor in a suffix tree! To be a prefix of a pattern p x, should be a suffix of the reverse pattern (p x ) R –So it will be in suffix tree, and end with a $ P = {her, their, eye, iris, he, is}
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.