On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Finite Automata CPSC 388 Ellen Walker Hiram College.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Sparse Compact Directed Acyclic Word Graphs
Trie/Suffix Trie/Suffix Tree. Trie A trie (from retrieval), is a multi-way tree structure useful for storing strings over an alphabet. It has been used.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Goodrich, Tamassia String Processing1 Pattern Matching.
E.G.M. PetrakisTries1  Trees of order >= 2  Variable length keys  The decision on what path to follow is taken based on potion of the key  Static environment,
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Suffix trees and suffix arrays presentation by Haim Kaplan.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Notes on Sequence Binary Decision Diagrams: Relationship to Acyclic Automata and Complexities of Binary Set Operations Shuhei Denzumi1, Ryo Yoshinaka2,
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Contents What is a trie? When to use tries
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Reducing the Space Requirement of LZ-index
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
HEXA: Compact Data Structures for Faster Packet Processing
Knuth-Morris-Pratt KMP algorithm. [over binary alphabet]
Reachability on Suffix Tree Graphs
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & JST)

Pattern Searching Problem Given : text T in   and pattern P in   Find : an occurrence of P in T   : alphabet    : set of strings Using an indexing structure for T, we can solve the above problem in O(|P|) time.

Suffix Trie A trie representing all suffixes of T a a c b $ c b $ c b $ b $ $ T = aacb$ aacb$ acb$ cb$ b$ $

Unwanted Matching pace runner pattern s text

Unwanted Matching pace runner pattern s text

Unwanted Matching pace runner pattern s text

Introducing Word Separator # # : word separator - special symbol not in  D =   # : dictionary of words Text T : an element of D + (T is a sequence T 1 T 2 …T k of k words in D) e.g., T = This#is#a#pen#   = {A,…,z}  D = {...,This#,...,a#,...is#,...pen#,...}

Word-level Pattern Searching Problem Given: text T in D + and pattern P in D + Find: an occurrence of P in T which immediately follows # e.g. The#space#runner#is#not#your#good#pace#runner#

Word-level Pattern Searching Problem Given: text T in D + and pattern P in D + Find: an occurrence of P in T which immediately follows # e.g. The#space#runner#is#not#your#good#pace#runner#

Word Suffix Trie A trie representing the suffixes of T which immediately follows # (and T itself). T = aa#b# aa#b# a#b# #b# b# # a a # b # b #

Comparison a a # b # b # a a # b # # b # # b # b # T = aa#b# Suffix Trie Word Suffix Trie

Construction Suffix Trie : Ukkonen’s on-line algorithm ( 1995 ) Word Suffix Trie : We modify Ukkonen’s algorithm by:  Using minimum DFA accepting dictionary D  Redefining suffix links

Minimum DFA The minimum DFA accepting D =   # clearly requires constant space (for fixed  ). We replace the root node of the suffix trie with the final state of the DFA. # 

Suffix Links T = aa#b# # a,b a a # b b # # Word Suffix Trie

Suffix Links For any node s with path(s) = u, we can write as u = xy such that  x ∈ D , and  y is a proper prefix of some string in D. Note that the choice of x and y is unique for each u = path(s).

Suffix Links # a,b a a # b b # # path( ) = aa, x = , y = aa path( ) = aa#b#, x = aa#b#, y =  Word Suffix Trie x ∈ D  and y is a proper prefix of some string in D

Suffix Links [cont.] If x ∈ D +, suflink(s) = r such that  r = x’y, and  x = hx’ for some h ∈ D. Otherwise (if x =  ), suflink(s) = initial state of DFA.

Suffix Links a,b a a # b b # # = aa, x = , y = aa suflink( ) = initial state of DFA = aa#b, x = aa#, y = b  suflink( ) = b Word Suffix Trie # x ∈ D  and y is a proper prefix of some string in D

Suffix Links [cont.] a a # b # # b # # b # b # T = aa#b# a,b a a # b b # # Suffix Trie Word Suffix Trie #

On-line Construction   # a T = aa#b# a,b a Suffix Trie Word Suffix Trie #

On-line Construction a a T = aa#b# a,b a a Suffix Trie Word Suffix Trie #   #

# On-line Construction a a # # # T = aa#b# a,b a a # Suffix Trie Word Suffix Trie

  # # On-line Construction a a # # # T = aa#b# b b b b a,b a a # b b Suffix Trie Word Suffix Trie

  # # On-line Construction a a # # # T = aa#b# b b b b # # # # a,b a a # b b # # Suffix Trie Word Suffix Trie

Pseudo Code Just change here!! a a # b # b # a a # b # # b # # b # b # # a,b #  #

Like Dress-up Doll

Like Dress-up Doll [cont.] Hair is different Body is the same!! Different looking!!!! a a # b # b # a a # b # # b # # b # b # # a,b #  #

Drawback of Word Suffix Trie Word suffix tries require O(k|T|) space. Andersson et al. introduced word suffix trees which can be implemented in O(k) space.

Construction of Word Suffix Trees Algorithm by Andersson et al. ( 1996 )  for text T = T 1 T 2 …T k, constructs word suffix trees in O(|T|) expected time with O(k) space. Our algorithm  simulates the on-line word suffix trie algorithm on word suffix trees.  runs in O(|T|) time in the worst cases, with O(k) space.

Suffix Tree a a # b # # b # # b # b # # T = aa#b#

Word Suffix Tree a # b # b # a T = aa#b#

Normal and Word Suffix Trees a a # b # b # a a # b # # b # # b # b # # T = aa#b# Suffix Tree Word Suffix Tree

Construction Algorithm Just change here!!

Conclusions We first proposed an on-line word suffix trie construction algorithm.  The keys to the algorithm are the minimal DFA accepting D and the re-defined suffix links. Further, we introduced an on-line algorithm to build word suffix trees that works with O(k) space and in O(|T|) time in the worst cases.

Further Work We like this “dress-up doll” strategy :) Hair is different Body is the same!! Different looking!!!!

Further Work “Sparse Directed Acyclic Word Graphs” by Shunsuke Inenaga and Masayuki Takeda Accepted to SPIRE’06 “Sparse Compact Directed Acyclic Word Graphs” by Shunsuke Inenaga and Masayuki Takeda Accepted to PSC’06