Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Indexing DNA Sequences Using q-Grams
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Recursive Definitions and Structural Induction
An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.
Fusion Trees Advanced Data Structures Aris Tentes.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Sparse Compact Directed Acyclic Word Graphs
HABATAKITAI Laboratory Everything is String. Computing palindromic factorization and palindromic covers on-line Tomohiro I, Shiho Sugimoto, Shunsuke Inenaga,
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Modern Information Retrieval Chapter 8 Indexing and Searching.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga.
Modern Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
CS Lecture 9 Storeing and Querying Large Web Graphs.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Space Efficient Linear Time Construction of Suffix Arrays
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
CSE 373, Copyright S. Tanimoto, 2002 Up-trees - 1 Up-Trees Review of the UNION-FIND ADT Straight implementation with Up-Trees Path compression Worst-case.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
1 Geometric Intersection Determining if there are intersections between graphical objects Finding all intersecting pairs Brute Force Algorithm Plane Sweep.
TEDI: Efficient Shortest Path Query Answering on Graphs Author: Fang Wei SIGMOD 2010 Presentation: Dr. Greg Speegle.
An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
Combinatorial Algorithms Reference Text: Kreher and Stinson.
Distance sensitivity oracles in weighted directed graphs Raphael Yuster University of Haifa Joint work with Oren Weimann Weizmann inst.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Navigation Piles with Applications to Sorting, Priority Queues, and Priority Deques Jyrki Katajainen and Fabio Vitale Department of Computing, University.
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.
COMP9319 Web Data Compression and Search
Tries 07/28/16 11:04 Text Compression
Succinct Data Structures
Andrzej Ehrenfeucht, University of Colorado, Boulder
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Reachability on Suffix Tree Graphs
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Forbidden-set labelling in graphs
Presentation transcript:

Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1, Masayuki Takeda 1 ( 1 Kyushu University) ( 2 TU Dortmund)

Overview  There exist many space-efficient indices (e.g. FM-index [Ferragina&Manzini, 2000]) but most of them are static.  Some (e.g. Dynamic FM-index [Salson et al., 2010]) are dynamic but consume more space than static counterparts. 2

Overview  There exist many space-efficient indices (e.g. FM-index [Ferragina&Manzini, 2000]) but most of them are static.  Some (e.g. Dynamic FM-index [Salson et al., 2010]) are dynamic but consume more space than static counterparts.  We propose a self-index for searching patterns of limited length, which: is theoretically and practically efficient in terms of construction, updates (adding characters at the end of the text) and searches, is compact, i.e., requires only O(n log σ) bits of space, where n is the text size and σ is the alphabet size, and can be constructed in online manner. 3

Problem  Preprocess : text T of length n over an alphabet of size σ.  Query : pattern P of length at most r.  Answer : all occurrences of P in T. 4

Problem  Preprocess : text T of length n over an alphabet of size σ.  Query : pattern P of length at most r.  Answer : all occurrences of P in T.  Example T = abbbabaaabaaabbaaaabaa If P = baa, then we output {5, 9, 14, 19} (in any order).

A naïve algorithm  Since we would like to search for any pattern of length at most r, a naïve solution would be to store all occurrences of all r-grams in T.  This naïve algorithm requires at least n log n bits.  Example. 6 r-gramsOccurrences aaa 6, 10, 15, 16 aab 7, 11, 17 aba 4, 8, T = abbbabaaabaaabbaaaabaa r=3

Sampling of q-grams  To reduce the space, we only store the beginning positions divisible by some k (> 1).  We also sample longer substrings (of length r + k − 1 = q) so that occurrences of substrings of length at most r are not missed.  Example. 7 r=3 k=4 q= T = abbbabaaabaaabbaaaabaa q-gramsOccurrences at positions divisible by k aaabaa 16 abaaab 4, 8 abbaaa 12 abbbab 0

Sampling of q-grams  For any pattern P of length at most r, if w is a sampled q-gram at position x in T and P has an occurrence in w with relative position d (i.e., w[d.. d+|P|−1] = P), then x + d is an occurrence of P in T. 8 P = baa r=3 k=4 q= T = abbbabaaabaaabbaaaabaa occurrence at 4+1 occurrence at 8+1 occurrence at 12+2 occurrence at 16+3

Set of q-grams Q P,d  Let Q P,d be the set of (not only sampled but) all q-grams w in T where P has an occurrence in w with relative position d, i.e., w[d.. d+|P|−1] = P.  For example, consider the following string T: In this example, if k = 4, q = 6 and P = baa, then Q P,0 = { baaaab, baaaba, baaabb }, Q P,1 = { abaaab, bbbaab }, Q P,2 = { aabaaa, abbaaa, babaaa }, and Q P,3 = { aaabaa, aabbaa, bbabaa } T = abbbabaaabaaabbaaaabaa

Set of q-grams Q P,d 10  For example, consider the following string T: In this example, if k = 4, q = 6 and P = baa, then Q P,0 = { baaaab, baaaba, baaabb }, Q P,1 = { abaaab, bbbaab }, Q P,2 = { aabaaa, abbaaa, babaaa }, and Q P,3 = { aaabaa, aabbaa, bbabaa } T = abbbabaaabaaabbaaaabaa Q P,0 ∪ Q P,1 ∪ … ∪ Q P,k−1 contains all sampled q-grams which contain P (with its offset). |Q P,d | ≤ #occ for any 0 ≤ d < k. Observation

Basic strategy of our search algorithm  To compute all occurrences of P in T, we incrementally compute Q P,0, Q P,1, …, Q P,k−1 and output occurrences of P when we encounter sampled q-grams in each Q P,d. 11 Q P,0 ∪ Q P,1 ∪ … ∪ Q P,k−1 contains all sampled q-grams which contain P (with its offset). |Q P,d | ≤ #occ for any 0 ≤ d < k. Observation

q-gram transition graph  To compute Q P,1,…, Q P,k−1, we consider a directed graph G = (Σ q, E), which we call a q-gram transition graph. A q-gram transition graph is a subgraph of the de Bruijn graph of T s.t. the indegree of each vertex is at most 1. 12

q-gram transition graph 13 r=3 k=4 q=6 abbbabbbbababbabaababaaaabaaabbaaaba baaabb aaabbaaabbaaabbaaabbaaaabaaaabaaaaba aaabaaaabaaa T = abbbabaaabaaabbaaaabaa We limit the indegree at most 1, so this edge is not constructed.

q-gram transition graph 14 r=3 k=4 q=6 abbbabbbbababbabaababaaaabaaabbaaaba baaabb aaabbaaabbaaabbaaabbaaaabaaaabaaaaba aaabaaaabaaa T = abbbabaaabaaabbaaaabaa 0 4, Positions of sampled q-grams.

Computing Q P,0, …, Q P,k−1 15 baaabb baaaba Q P,0 P = baa This edge does not exist, therefore abaaba is enumerated only once. r=3 k=4 q= T = abbbabaaabaaabbaaaabaa baaaabbbaaaa Q P,1 abaaab abbaaa aabaaa Q P,2 babaaa aabbaa bbabaa Q P,3 aaabaa , 8

Computing Q P,0, …, Q P,k−1 16 baaabb baaaba Q P,0 P = baa This edge does not exist, therefore abaaba is enumerated only once. r=3 k=4 q= T = abbbabaaabaaabbaaaabaa baaaabbbaaaa Q P,1 abaaab abbaaa aabaaa Q P,2 babaaa aabbaa bbabaa Q P,3 aaabaa , 8

Computing Q P,0, …, Q P,k−1 17 baaabb baaaba Q P,0 P = baa This edge does not exist, therefore abaaba is enumerated only once. r=3 k=4 q=6 baaaabbbaaaa Q P,1 abaaab abbaaa aabaaa Q P,2 babaaa aabbaa bbabaa Q P,3 aaabaa , T = abbbabaaabaaabbaaaabaa

Computing Q P,0, …, Q P,k−1 18 baaabb baaaba Q P,0 P = baa This edge does not exist, therefore abaaba is enumerated only once. r=3 k=4 q=6 baaaabbbaaaa Q P,1 abaaab abbaaa aabaaa Q P,2 babaaa aabbaa bbabaa Q P,3 aaabaa , T = abbbabaaabaaabbaaaabaa

Computing Q P,0, …, Q P,k−1 19 baaabb baaaba Q P,0 P = baa This edge does not exist, therefore abaaba is enumerated only once. r=3 k=4 q=6 baaaabbbaaaa Q P,1 abaaab abbaaa aabaaa Q P,2 babaaa aabbaa bbabaa Q P,3 aaabaa , T = abbbabaaabaaabbaaaabaa

Computing Q P,0, …, Q P,k−1 20 baaabb baaaba Q P,0 P = baa This edge does not exist, therefore abaaba is enumerated only once. r=3 k=4 q=6 baaaabbbaaaa Q P,1 abaaab abbaaa aabaaa Q P,2 babaaa aabbaa bbabaa Q P,3 aaabaa , T = abbbabaaabaaabbaaaabaa

Computing Q P,0  Given pattern P, first we need to compute the source Q P,0 of the q-gram transition graph, i.e., all q-grams in T which begin with P. 21

w aaaaaa 0 aaaaab 1 aaaaba 2 ・・・ abbbbb 31 baaaaa 32 baaaab 33 baaaba 34 baaabb 35 ・・・ baabbb 39 ・・・ bbbbbb 63 Computing Q P,0  Given pattern P, first we need to compute the source Q P,0 of the q-gram transition graph, i.e., all q-grams in T which begin with P.  Consider all q-grams in lexicographical order. For any w ∈ Σ q (not necessary appearing in T), we denote by the lexicographical rank of w.  For any pattern P, there exists a single range [sp(P), ep(P)] s.t. a q-gram w begins with P iff. This range can be computed easily. 22 q-grams that begin with baa. sp( baa ) = 32 ep( baa ) = 39

Computing Q P,0  Consider a bit array B of size σ q s.t. iff w appears in T. Then, w ∈ Q P,0 iff and.  Hence we need to output all w s.t. and. 23 w aaaaaa 00 aaaaab 11 aaaaba 20 ・・・ abbbbb 311 baaaaa 320 baaaab 331 baaaba 340 baaabb 351 ・・・ baabbb 390 ・・・ bbbbbb 630 q-grams that begin with baa. sp( baa ) = 32 ep( baa ) = 39

Summary of our index  We need to store: a) q-gram transition graph, b) bit array B[0.. σ q − 1] for computing Q P,0, and c) positions of sampled q-grams. 24 n : length of T. σ : alphabet size. q : length of sampled substrings. k : sampling distance.

Summary of our index  We need to store: a) q-gram transition graph, b) bit array B[0.. σ q − 1] for computing Q P,0, and c) positions of sampled q-grams.  We can represent a)in O(σ q log σ) bits, b)in σ q + O(σ q / ω) bits, and c)in (n / k + σ q ) log(n / k) bits.  We can search any pattern in O(k × #occ + log σ n) time. 25 n : length of T. σ : alphabet size. q : length of sampled substrings. k : sampling distance. ω : machine word size.

Summary of our index  We need to store: a) q-gram transition graph, b) bit array B[0.. σ q − 1] for computing Q P,0, and c) positions of sampled q-grams.  We can represent a)in O(σ q log σ) bits, b)in σ q + O(σ q / ω) bits, and c)in (n / k + σ q ) log(n / k) bits.  We can search any pattern in O(k × #occ + log σ n) time. 26 n : length of T. σ : alphabet size. q : length of sampled substrings. k : sampling distance. ω : machine word size. I will explain these next.

Representation of (a)  Since q-gram transition graph is a subgraph of de Bruijn graph, from each node u, it is enough to store the character c s.t. v = c u[0..q−2] if an edge (u,v) exists. 27 abaaabbaaaba baaaabaaaaba aaabaaaabaaa b a … … b a a a a

Representation of (a)  Since q-gram transition graph is a subgraph of de Bruijn graph, from each node u, it is enough to store the character c s.t. v = c u[0..q−2] if an edge (u,v) exists.  Since the number of vertices is σ q and the indegree of each vertex is at most 1, the number of edges is at most σ q. We can represent this graph in O(σ q log σ) bits by using some tables. 28 abaaabbaaaba baaaabaaaaba aaabaaaabaaa b a … … b a a a a

Representation of (b)  By data structure (b), we output all w s.t. and.  So, using a fast successor data structure, we can compute all such q-grams w. 29 w aaaaaa 00 aaaaab 11 aaaaba 20 ・・・ abbbbb 311 baaaaa 320 baaaab 331 baaaba 340 baaabb 351 ・・・ baabbb 390 ・・・ bbbbbb 630 q-grams that begin with baa. sp( baa ) = 32 ep( baa ) = 39

Representation of (b)  By data structure (b), we output all w s.t. and.  So, using a fast successor data structure, we can compute all such q-grams w.  We need a dynamic successor data structure to support online updates to T. 30 w aaaaaa 00 aaaaab 11 aaaaba 20 ・・・ abbbbb 311 baaaaa 320 baaaab 331 baaaba 340 baaabb 351 ・・・ baabbb 390 ・・・ bbbbbb 630 q-grams that begin with baa. sp( baa ) = 32 ep( baa ) = 39

Representation of (b)  By data structure (b), we output all w s.t. and.  So, using a fast successor data structure, we can compute all such q-grams w.  We need a dynamic successor data structure to support online updates to T.  We can use van Emde Boas tree but it requires Θ(σ q ) words = Θ(σ q ω) bits. We want to reduce the space. 31 w aaaaaa 00 aaaaab 11 aaaaba 20 ・・・ abbbbb 311 baaaaa 320 baaaab 331 baaaba 340 baaabb 351 ・・・ baabbb 390 ・・・ bbbbbb 630 q-grams that begin with baa. sp( baa ) = 32 ep( baa ) = 39

Representation of (b)  We present a succinct variant of van Emde Boas tree.  We divide B into blocks of size ω h where ω is the machine word size and h (> 1) is some constant integer.  We maintain an ω-ary tree of height h (bottom tree) for each block, and a van Emde Boas tree (top tree) over the bottom trees …… …… ……0 ω-ary trees of height h …… van Emde Boas tree 101 ωhωh Corresponds to B. ……

Representation of (b) : bottom tree  Each bottom tree is a complete ω-ary tree.  Each node has a bit array A of length ω s.t. A[ j] = 1 iff the j-th child of the node contains … aaaaaaaaaaaaaaaa0111…… 1101 Block of size ω h. … … … A

Representation of (b)  Data structure (b) can be represented in σ q + o(σ q ) bits. The bottom trees require σ q + O(σ q / ω) = σ q + o(σ q ) bits and the top tree requires O(σ q / ω h−1 ) = o(σ q ) bits, assuming the machine word size ω = Θ(log n).  Updates of a single bit in B and successor queries can be done in O(h + log log σ q ) = O(log log σ q ) time. If σ q ≤ n then O(log log n) time. 34

Complexities  We represent each q-gram by an integer, and we do not store the original text T.  We assume that σ = polylog(n), k ≥ 1, q = k + r − 1 and q ≤ log σ n − log σ log σ n.  If we choose k = Θ(log σ n), then the space complexity is O(n log σ) bits, and hence our index is compact. 35 Complexities Construction timeO(n)O(n) Searching timeO(k × #occ + log σ n) Space (in bits)(n / k + σ q ) log(n / k) + o(n)

Experimental results of construction Time for construction (in seconds). Text size n (in megabytes). 36

Experimental results of construction Time for construction (in seconds). Text size n (in megabytes). Our index is the fastest to construct. 37

Experimental results of searching Average time for searching, using100 patterns of length 6 (in seconds). Text size n (in megabytes). 38

Experimental results of searching Average time for searching, using100 patterns of length 6 (in seconds). Text size n (in megabytes). Ours is the fastest compact/compressed index to search. 39

Experimental results of memory usage Memory usage (in megabytes). Text size n (in megabytes). 40

Experimental results of memory usage Memory usage (in megabytes). Text size n (in megabytes). Ours is much more space-efficient than Dynamic FM-index 41

Conclusion  We proposed a q-gram based self-index for searching patterns of limited length. Our self-index: is theoretically and practically efficient in terms of construction, updates (adding characters at the end of the text) and searches, is compact, i.e., requires only O(n log σ) bits of space, where n is the text size and σ is the alphabet size, and can be constructed in online manner.  When the text is DNA sequence of human (i.e., σ = 4 and n ~ 10 9 ), the practical limit of pattern length is about 10 for our index.  Can we further reduce the space complexity? 42