Sparse Compact Directed Acyclic Word Graphs

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Finite Automata CPSC 388 Ellen Walker Hiram College.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Trie/Suffix Trie/Suffix Tree. Trie A trie (from retrieval), is a multi-way tree structure useful for storing strings over an alphabet. It has been used.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
15-853Page : Algorithms in the Real World Suffix Trees.
OUTLINE Suffix trees Suffix arrays Suffix trees Indexing techniques are used to locate highest – scoring alignments. One method of indexing uses the.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Alon Efrat Computer Science Department University of Arizona Suffix Trees.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Chapter 3 The Greedy Method 3.
Modern Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
E.G.M. PetrakisTries1  Trees of order >= 2  Variable length keys  The decision on what path to follow is taken based on potion of the key  Static environment,
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Important Problem Types and Fundamental Data Structures
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Contents What is a trie? When to use tries
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Suffix trees.
Reachability on Suffix Tree Graphs
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
String Matching with k Mismatches
Presentation transcript:

Sparse Compact Directed Acyclic Word Graphs Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & Japan Science and Technology Agency)

Traditional Pattern Matching Problem Given:text T in S* and pattern P in S* Return:whether or not P appears in T S: alphabet (set of characters) S*: set of strings A text indexing structure for T enables you to solve the above problem in O(m) time (for fixed S). m: the length of P

Suffix Trie A trie representing all suffixes of T $ T = aacb$ a b c

Unwanted Matching pattern s pace runner text

Unwanted Matching pattern s pace runner text

Unwanted Matching pattern s pace runner text

Introducing Word Separator # # : word separator - special symbol not in S D = S* # : dictionary of words Text T : an element of D+ (T is a sequence T1T2…Tk of k words in D) e.g., T = This#is#a#pen# S = {A,…,z} D = {...,This#,...,a#,...is#,...pen#,...}

Word-level Pattern Matching Problem Given: text T in D+ and pattern P in D+ Return: whether or not P appears at the beginning of any word in T e.g. T = The#space#runner#is#not#your#good#pace#runner# P = pace#runner#

Word-level Pattern Matching Problem Given: text T in D+ and pattern P in D+ Return: whether or not P appears at the beginning of any word in T e.g. T = The#space#runner#is#not#your#good#pace#runner# P = pace#runner#

Word Suffix Trie A trie representing the suffixes of T which begin at a word. T = aa#b# a b aa#b# a#b# #b# b# # a # # b #

Normal and Word Suffix Tries T = aa#b# a a b b # a # a # # b # # b # b b # # # Normal Suffix Trie Word Suffix Trie

Normal and Word Suffix Trees T = aa#b# a b b # a # # a a b # # # # b b b # # # Normal Suffix Tree Word Suffix Tree

Sizes of Word Suffix Tries and Trees For text T = T1T2…Tk of length n, the word suffix trie of T requires O(nk) space, but the word suffix tree of T requires O(k) space!! because the word suffix tree has only k leaves and has only branching internal nodes.

Construction of Word Suffix Trees Algorithm by Andersson et al.(1996) for text T = T1T2…Tk of length n, constructs word suffix trees in O(n) expected time with O(k) space. Our algorithm (CPM’06) builds word suffix trees in O(n) time in the worst case, with O(k) space.

Our Construction Algorithm We modify Ukkonen’s on-line normal suffix tree construction algorithm by using minimum DFA accepting dictionary D We replace the root node of the suffix tree with the final state of the DFA.

Minimum DFA The minimum DFA accepting D = S* # clearly requires constant space (for fixed S). S #

On-line Construction of Word Suffix Trees T = aa#b# a,b # a

On-line Construction of Word Suffix Trees T = aa#b# a,b # a

On-line Construction of Word Suffix Trees T = aa#b# a,b # a a

On-line Construction of Word Suffix Trees T = aa#b# a,b # a a

On-line Construction of Word Suffix Trees T = aa#b# a,b # a a #

On-line Construction of Word Suffix Trees T = aa#b# a,b # a a #

On-line Construction of Word Suffix Trees T = aa#b# a,b # b a a # b

On-line Construction of Word Suffix Trees T = aa#b# a,b # b a a # b

On-line Construction of Word Suffix Trees T = aa#b# a,b # b a # a # b #

Pseudo-Code Just change here

Like Dress-up Doll...

Like Dress-up Doll... [cont.] a # b a,b S,# Different looking!!!! Just change hair!! Body is the same!!

Compact Directed Acyclic Word Graphs T = aa#b# a a # b b # # a b # # minimization # b b # # Compact Directed Acyclic Word Graph (CDAWG) Suffix Tree

Sparse CDAWGs T = aa#b# a # b b a # a # b # Word Suffix Tree minimization # Sparse Compact Directed Acyclic Word Graph (SCDAWG) Word Suffix Tree

Sparse CDAWGs [cont.] T = a#b#a#bab# a # b a # b b # a # a a b a # b # minimization a # a a b a # b # # b b # a a b b # # Word Suffix Tree SCDAWG

SCDAWG Construction SCDAWGs can be constructed by minimizing word suffix trees in O(k) time. using Revuz’s DAG minimization algorithm (1992)

SCDAWG Construction [cont.] Question : Direct construction for SCDAWGs? Answer : YES! Using minimal DFA accepting dictionary D, we can directly build SCDAWGs in O(n) time and O(k) space. We modify the CDAWG on-line construction algorithm (Inenaga et al. 05) by using the above DFA.

Pseudo-Code Different looking!!!! Just change here Body is the same!! # b Different looking!!!! a,b S,# Just change here Body is the same!!

Some Events Basically on-line construction of SCDAWGs is similar to that of word suffix trees. Except for the two following unique events: Edge merging Node splitting

Edge Merging a,b T = a#b#a#bab#bc... # a # b b # # a a # # b b

Edge Merging a,b T = a#b#a#bab#bc... # a # b b # # a a # # b b a a

Edge Merging a,b T = a#b#a#bab#bc... # a # b b # # a a a # # b b a a

Edge Merging a,b T = a#b#a#bab#bc... # a # b b # a # a b a

Node Splitting a,b T = a#b#a#bab#bc... # a # b b # a a # b b a # b #

Node Splitting a,b T = a#b#a#bab#bc... # a # b b # a a # b b # a b b #

Node Splitting a,b T = a#b#a#bab#bc... # a # b b # # a a a # b # a b #

Conclusion We introduced new text indexing structure sparse compact directed acyclic word graphs (SCDAWGs) for word-level pattern matching. We presented an on-line algorithm to construct SCDAWGs directly, in O(n) time with O(k) space. The key is the use of minimum DFA accepting dictionary D.

Related Work “Sparse Directed Acyclic Word Graphs” by Shunsuke Inenaga and Masayuki Takeda Accepted to SPIRE’06