Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.

Slides:



Advertisements
Similar presentations
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
Advertisements

1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
Constant-Time LCA Retrieval
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Modern Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Construction of Aho Corasick Automaton in Linear Time for Integer Alphabets Shiri Dori Gad M. Landau.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Suffix trees and suffix arrays presentation by Haim Kaplan.
1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories.
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
Space Efficient Linear Time Construction of Suffix Arrays
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Review Binary Tree Binary Tree Representation Array Representation Link List Representation Operations on Binary Trees Traversing Binary Trees Pre-Order.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Radix search trie (RST) R-way trie (RT) De la Briandias trie (DLB)
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
COMP9319 Web Data Compression and Search
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
COMP9319 Web Data Compression and Search
Mark Redekopp David Kempe
Reducing the Space Requirement of LZ-index
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Strings: Tries, Suffix Trees
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Suffix trees.
String Data Structures and Algorithms
String Data Structures and Algorithms
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
String Matching with k Mismatches
Strings: Tries, Suffix Trees
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa

Overview Classic Aho Corasick Our algorithm –Goto Function –Failure Function –Combining the two Queries in O(m log|Σ|)

Set Pattern Matching Problem Find patterns in text P={P 1, P 2,... P q }, in T Aho and Corasick solved it in ’ 75 Generalized version of KMP Uses a state machine

Aho Corasick - Example  hi heiris heriri iris P = {her, iris, he, is}

Aho Corasick - Example  hi heiris heriri iris P = {her, iris, he, is} If stuck, travel along KMP-style Failure link Travel along the Goto function, which is a trie of all patterns

Aho Corasick - Example  hi he ir is her iri iris P = {her, iris, he, is} If stuck, travel along KMP-style Failure link Travel along the Goto function, which is a trie of all patterns When found a pattern, output it

Aho Corasick Definitions Goto function: a trie of the patterns Failure function: for each label, the largest suffix which is a prefix of a pattern –KMP, but prefix of any pattern qualifies Output function: patterns ending at this label

Classic Aho Corasick – Analysis Constructed in O(n) (cumulative pattern length) Answered queries in O(m + k)... For constant alphabets only! For integer alphabets, Σ=O(n c ), algorithm changes depending on branching method –List, Array or Search Tree Recent developments inspire for better! –Farach-97; Karkkäinen & Sanders-03; –Ko & Aluru-03; Kim, Sim, Park & Park-03

Our Results Our algorithm achieves better results: Construction in O(n) time, O(n) space Query in O(m log|Σ|) Works for integer alphabets, Σ = O(n c )

Sort patterns in time linear to their length –By building suffix array of S p =$P 1 $P 2 $...$P q $, and just ignoring non-pattern suffixes –Or by two-pass radix sort, O(D + Σ) = O(n)  Paige & Tarjan, ’87; Andersson & Nilsson, ‘94 Now create the trie in lexicographic order Hold a list of sons; insert each new node to the end of the list Algorithm: Goto Function

Example – Goto Function P = {the, than, this, then} P’ = {than, the, then, this} Sorting Patterns

than, the, then, this than, the, then than, the P’ = {than, the, then, this} Example – Goto Function than  t th thathe thanthen thi this

P’ = {than, the, then, this} Example – Goto Function  t th thathe thanthen thi this th a Sorted List, keep the tail e i

Algorithm: Failure Function We need to construct Failure links on trie Original algorithm included traversing trie We found a deep connection between: –Failure function of the patterns, and –Suffix Tree of the reversed patterns  Or Enhanced Suffix Array  Abouelhoda, Kurtz & Ohlebusch-04; Kim, Jeon & Park-04 We ’ ll “ learn by example ”...

Example – Failure Function Failure function: “ iris ”  “ is ” The reverses: “ siri ”, “ si ” “ si ” is a prefix of “ siri ” (with $ so “ is ” prefix of a pattern)  hi heiris heriri iris P R = {eh, reh, siri, si}  r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $ $ $$ $ $ $ $ si (is) siri (iris) $ $ P = {he, her, iris, is}

Understanding Failure Function Failure function is defined as: largest suffix, which is a prefix of any pattern Reverse: “ largest suffix ”  “ largest prefix ” –Any prefix of a label will be its ancestor in ST –Largest means nearest “ prefix of pattern ”  “ suffix of pattern ” –It will be a node in the ST, marked by a $ So: closest ancestor which is marked by $

Algorithm: Failure Function We found a deep connection between: –Failure function of the patterns, and –Suffix Tree of the reversed patterns We define S p =$P 1 $P 2 $...$P q $ We define T R to be the suffix tree of (S p ) R T R can be built in linear time –Can use Enhanced Suffix Array, E R, instead –Note: T R is a Generalized Suffix Tree How will we link the trie and T R ?

 r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $ $ $$ $ $ $ $ Example – 1-to-1 Mapping  hi heiris heriri iris Note: “ r ” doesn ’ t get a link since it ’ s not marked by a $

 r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $ $ $$ $ $ $ $ Example – 1-to-1 Mapping  hi heiris heriri iris si (is) siri (iris) $ i (i) iri (iri) $

Algorithm: Review Build Goto function (trie) –Sort patterns –Construct trie Build Failure function –Construct T R –Compute proper ancestor for $-marked nodes Combine information –Through mapping, create Failure links on trie

Adjustment for Integer Alphabet We used recent developments (SA, ST) Constructed Goto: using suffix array Found a connection between Failure function and suffix trees Thus, reduced the construction to O(n) Yet, manage to keep queries at O(m log|Σ|) Again - how?

Queries in O(m log|Σ|) We’ve built the trie in O(n) –But we have a sorted list –Search is compromised Our simple solution…

P’ = {than, the, then, this} Example – Goto Function  t th thathe thanthen thi this th ae i Array can be searched in log(#children) aei

Queries in O(m log|Σ|) Once the trie is complete –Convert lists in each node to arrays –Array ’ s size is known; O(n) space overall –Binary search can now be employed Reduce the time spent in each node to log(# children) = O(log|Σ|) Can be applied to Suffix Tree built from Suffix Array + LCP

The End Thanks!

Algorithm: Combining the two Build a 1-to-1 mapping between $-marked nodes in T R and trie nodes We compute mapping through the string: –For each char in S p, we keep its Goto node –For each suffix tree node, we know what indices it represents (in (S p ) R, and so in S p ) Now, build Failure links atop the trie –Like we saw in the example

Algorithm: Failure Function For each node, find its “ proper ancestor ” –Closest ancestor marked with a $ –Found with a simple preorder traversal The properties of T R ensure that... –For each failure link v 1  v 2 –And their corresponding nodes, u 1 and u 2 –u 2 = proper ancestor of u 1 If we link trie and T R, we find the Failure! How will we link them?

Example - automaton - Goto Travel along the Goto function, which is a trie of all patterns  ehit eyheirthis eyeheririthe iris thei their P = {her, their, eye, iris, he, is}

Example - T R  e (e) r (r) t (t) ye (ey) eh (he) reh (her) eht (the) eye (eye) h (h) ht (th) i (i) iri (iri) ri (ir) rieht (their) si (is) siri (iris) $ $ $ $$$ $ $$ $ $ $ $$$ $ ieht (thei) $ P = {her, their, eye, iris, he, is}

Example - T R and Failure  e (e) r (r) t (t) ye (ey) eh (he) reh (her) eht (the) eye (eye) h (h) ht (th) i (i) iri (iri) ri (ir) rieht (their) si (is) siri (iris) $ $ $ $$$ $ $$ $ $ $ $$$ $ ieht (thei) $ P = {her, their, eye, iris, he, is}  ehit ey he irth is eyeher iri the iris thei their iris  is the  he  e eye  e their  ir  

T R - Reversed Suffix Tree We defined S p =$P 1 $P 2 $...$P q $ We define T R to be the suffix tree of (S p ) R This tree has interesting properties: –Each trie node v is represented by exactly one T R node u, so that Label(v) = Label(u) R –In T R, a node ’ s label is a prefix of its child ’ s label; in the trie, it is a suffix of the original –A $-marked node in T R means that the original label is a prefix of a pattern

Example - T R  e (e) r (r) eh (he) reh (her) h (h) i (i) iri (iri) ri (ir) si (is) siri (iris) $ $$ $ $$ $ $ $ $ P = {her, iris, he, is}

Example - T R and Failure We took: Failure of “ their ”  “ ir ” (from “ iris ” ) –Largest suffix, which is a prefix of a pattern Their reverse strings are “ rieht ”, “ ri ” –Now prefix... its ancestor in a suffix tree! To be a prefix of a pattern p x, should be a suffix of the reverse pattern (p x ) R –So it will be in suffix tree, and end with a $ P = {her, their, eye, iris, he, is}