Download presentation
Presentation is loading. Please wait.
1
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007
2
2 Introduction The full-text searching problem: to find all the occ occurrences of a pattern P[1..m] in a text T[1..u] (both over an alphabet of size ) We are interested in indexed text searching: an index on T allows us to find quickly the pattern occurrences T PPPP Index In our work the index replaces the text (self-indexing) is compressed (LZ) (compression+search)
3
3 Applications and goals Main applications of text searching: Computational Biology (DNA and protein sequences) Oriental language texts (Japanese, Chinese, Korean, etc.) “Natural language” texts (English, Spanish, etc.) Music (MIDI pitch sequences) Program code Etc. Compressed self-indexes: Reduce the space requirement (not storing the text + compressing) Are useful in cases where accessing the text is expensive (for example, web search engines)
4
4 Motivations The use of a compressed self-index may totally remove the need to use the disk However… Huge texts Sequential text searching + compression Compressed self-indexes improves disk performance More disk accesses but smaller seek time
5
5 Motivations By reducing the space of the index we aim at: Saving disk space (important for storage media of limited size) Reducing the seek time when searching (because the index is smaller)
6
6 Model of computation We assume a model of computation where: A disk page of size B can be transferred to main memory in a single disk access We can hold a constant number of disk pages in main memory We count every disk access The text is static
7
7 Related Works String B-trees [FG, JACM 1999]: 3 – 4 times text size Compact Pat Trees [CM, SODA 1996]: 5 – 6 times text size Compressed Suffix Arrays [MNS, ISAAC 2003] About 0.25 – 0.5 times text size 2(1 + m · log B u ) accesses for counting O(log u) extra accesses per occurrence! Can we define a small an efficient index on secondary storage?
8
8 Searching LZ78 compressed texts: the LZ-index LZTrie RevTrie Different types of occurrences… LZ78 parses the text into phrases
9
9 Occurrences of Type 1 LZTrie P P P Subtrees containing ocurrences of type 1 By LZ78, P is a suffix of such phrases Occurrences contained in a single phrase Shortest possible LZ78 phrases containing P
10
10 As P is a suffix of such phrases, P r is a prefix of the corresponding reverse phrases We need the Reverse Trie (RevTrie) to solve this problem Occurrences of Type 1 PrPr RevTrie LZTrie P P P Occurrences contained in a single phrase navigation between tries!
11
11 Occurrences of Type 2 Occurrences spanning two consecutive phrases Phrases starting with P 2 Phrases ending with P 1 P2P2 P1P1 P k-1k Pr1Pr1 RevTrie LZTrie P2P2 k-1 k RNode Node
12
12 Occurrences of Type 3 Occurrences spanning more than two consecutive phrases O(m 2 ) occurrences of type 3 in the worst case O(m 2 ) random accesses in the worst case
13
13 A compressed full-text self-index based on the LZTrie [Navarro, JDA 2004] Four data structures compose the LZ-index LZTrie: the trie formed by all the LZ78 phrases B 0,…,B n RevTrie: the trie formed by all the reverse LZ78 phrases B r 0,…,B r n Node: a mapping from phrase identifiers to their node in LZTrie RNode: a mapping from phrase identifiers to their node in RevTrie Overall: the LZ-index requires 4nlogn(1+o(1)) = 4uH k + o(ulog ) bits, for k = o(log u) We don’t need to store the text! The LZ-index
14
14 The LZ-index was originally designed for main memory It has a non-regular pattern of access to the index components We define a version of LZ-index for secondary storage We divide the problem as follows: Solving the Basic Trie Operations Reducing the Navigation Between Structures The LZ-index on secondary storage
15
15 We cut the tries into disjoint blocks of size at most B, using the Clark and Munro Strategy Every block stores a subtree of the whole trie We arrange these blocks in a tree by adding inter-block pointers Solving the basic trie operations We are able to compute parent(x) child(x, a) depth(x) subtreesize(x) preorder(x) ancestor(x, y) With one extra disk access in the worst case
16
16 We avoid random accesses to report only one occurrence We would need a data structure able of finding all these subtrees without random accesses Reducing the navigation between structures PrPr RevTrie LZTrie P P P Occurrences contained in a single phrase For counting...
17
17 Reducing the navigation between structures Occurrences spanning two consecutive phrases P2P2 P1P1 k-1 k Pr1Pr1 RevTrie LZTrie P2P2 y y’ k-1 k LR mapping
18
18 We add some redundancy to reduce the number of accesses between index components Many random accesses now become a single access + sequential scanning (please read the paper for other technical details) The overall space requirement is 8uH k + o(ulog ) bits, for any k = o(log u) The space can be dropped to 6uH k + o(ulog ) bits if we only need to count pattern occurrences Reducing the navigation between structures
19
19 We indexed: XML file from Pizza&Chili Corpus (200 megabytes) ( http://pizzachili.dcc.uchile.cl ) http://pizzachili.dcc.uchile.cl We searched for 5,000 random patterns count and locate queries We assume a disk page of 32 kilobytes (i.e., 8,192 integers of 32 bits) Experimental results
20
20 We compared against Suffix Arrays for secondary storage: The two-level hierarchy of [BYBZ, 1996] String B-trees: We use the model provided in [FG, 1996] Compact Pat Trees (CPT) [CM, 1996] Experimental results
21
21 Experimental results (count) LZ-index String B-trees Suffix Array CPT 3.3 times smaller than String B-trees
22
22 Experimental results (count) LZ-index String B-trees Suffix Array CPT
23
23 Experimental results (locate) LZ-index String B-trees Suffix Array CPT 2.6 times smaller than String B-trees Average number of accesses to report the first occurrence LZ-index 11 String B-trees 12
24
24 Experimental results (locate) LZ-index String B-trees Suffix Array CPT
25
25 The LZ-index can be adapted to work on secondary storage Requiring up to 8uH k + o(ulog ) bits, for any k = o(log u) Our index is significantly smaller than any other practical secondary-memory data structure LZ-index requires more disk accesses But a smaller index would have a smaller seek time Conclusions
26
26 Future work We assumed a constant main-memory space, but… To implement our index in a real practical setting Handling dynamism (String B-trees require 13.5 times the text size!) Direct construction on secondary storage adapting [AN, ISAAC 2005] to work on disk
27
27 Questions? Contact darroyue@dcc.uchile.cl gnavarro@dcc.uchile.cl
28
28 Thanks! Contact darroyue@dcc.uchile.cl gnavarro@dcc.uchile.cl
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.