Presentation is loading. Please wait.

Presentation is loading. Please wait.

The SBC-Tree: An Index for Run- Length Compressed Sequences Mohamed El-tabakh 1, Wing-Kia Hon 2 Rahul Shah 3, Walid Aref 1, Jeffrey Vitter 1 1 Department.

Similar presentations


Presentation on theme: "The SBC-Tree: An Index for Run- Length Compressed Sequences Mohamed El-tabakh 1, Wing-Kia Hon 2 Rahul Shah 3, Walid Aref 1, Jeffrey Vitter 1 1 Department."— Presentation transcript:

1 The SBC-Tree: An Index for Run- Length Compressed Sequences Mohamed El-tabakh 1, Wing-Kia Hon 2 Rahul Shah 3, Walid Aref 1, Jeffrey Vitter 1 1 Department of Computer Science, Purdue University 2 Department of Computer Science, National Tsing Hua University 3 Department of Computer Science, Louisiana State University

2 Outline Introduction Related Work SBC-Tree Structure SBC-Tree Operations Theoretical and Experimental Analysis Summary 2

3 Introduction: Why Compression? We deal with massive amount of data, scientific databases, … Text and sequence formats are very common Compression techniques gain significant importance because they achieve: Significant storage reduction Reducing buffer requirements Reducing number of I/Os >>> Enhance the overall system performance

4 4 Introduction: Objective Current databases do not support data compression Operate over the raw data compress Store, Index, and Search the compressed Sequences Store, Index, and Search the decompressed sequences The main challenge is how to operate on the compressed data without decompressing it More challenging for external memory processing

5 5 Processing Compressed Sequences: Related Work(1) A. Amir and G. Benson. Efficient two-dimensional compressed matching. In DCC, 1992. A. Amir, G. Benson, and M. Farach. Let sleeping files lie: pattern matching in z-compressed files. In SODA, 1994. A. Apostolico, G. M. Landau, and S. Skiena. Matching for run-length encoded strings. Journal of Complexity,1999. T. Bell, M. Powell, A. Mukherjee, and D. Adjeroh. Searching BWT compressed text with the boyer-moore algorithm and binary search. In DCC, 2002. V. Freschi and A. Bogliolo. Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism. Information Processing Letters, 2004. Searching compressed data is addressed in main memory Substring matching, longest common subsequence, edit distance 1.Processing compressed data in main memory

6 6 Processing Compressed Sequences: Related Work(2) M. Stonebraker, D. Abadi, A. Batkin, X. Chen,M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik, C-store: A column oriented dbms, In VLDB, 2005. D. Abadi, S. Madden. Compression in Column Oriented Databases. In SIGMOD, 2006. 20 100 times (20, 100) Run-Length encoding Operations such as SUM can be applied directly over the compressed data Column in a database table More complex operations have not been addressed yet Indexing RLE-compressed sequences Substring searching 2.Processing compressed data in DBMSs

7 What is SBC-Tree? SBC-Tree (String B-tree for Compressed sequences) An index for Run-Length Encoding (RLE) compressed sequences Supports prefix, range, and substring matching Optimal theoretical bounds for: External memory space complexity Search I/O requirements >> Relative to the size of the compressed sequences 7

8 8 SBC-Tree: An Index for RLE Compressed Sequences S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHH RLE(S) = L10 E6 L4 E3 H18 >> S has 41 suffixes >> RLE(S) has 5 RLE-suffixes RLE-char L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 1.Store the compressed sequences 2.Index the RLE-suffixes 3.Perform efficiently substring operations RLE-suffixes Run-Length Encoding (RLE) Replace tandem repeated characters with their frequency Effective with small alphabets

9 9 SBC-tree Structure Two-level index structure String B-tree: Indexes the RLE-suffixes Two-dimensional index: built on top of the leaves of the string B-tree Two-dimensional Index (e.g., R-tree) Tags Preceding character String B-tree root Numeric tag assigned to each suffix

10 10 String B-tree Overview [P. Ferragina and R. Grossi., Journal of ACM,1999] S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHH (S,21)(S,12)(S,11)(S,13) Store logical pointers instead of the keys 1.Generate all suffixes of S 2.Insert the suffixes into the String B-tree (ordered alphabetically) 3.Store the logical keys instead of the key sequence 4.Several optimizations to achieve optimal theoretical bounds for:  External memory space complexity  Search I/O requirements 1121 >> Relative to the size of the raw (decompressed) sequences

11 11 String B-tree over RLE-suffixes String B-tree CANNOT be used directly to index RLE-suffixes RLE-suffixes are subset of the total suffixes 3 1 5 2 4 Order We indexed only subset of the suffixes (RLE-suffixes) Searching for “L10 E6 L3”  Found Searching for “L5 E6 L3”  Not Found Searching for “E3 L4”  Not Found Implicit in L10 E6 L4 E3 H18 Implicit in E6 L4 E3 H18 L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 (S,6)(S,8)(S,4) (S,1) (S,10) L10 E6 L4 E3 H18 4 6 10 8

12 12 SBC-Tree over RLE-suffixes Query Pattern Mapping Rule: Substring query pattern P = x 1 f 1 x 2 f 2 … x n f n is mapped into P’ = x 1 f 1 + x 2 f 2 … x n f n L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 RLE-suffixes Searching for “L5 E6 L3”  (L5 + E6 L3)  Found Searching for “E3 L4”  (E3 + L4)  Found Challenge: The answer set is no longer consecutive in the index tree  Unbounded number of I/Os to answer a query L5 + E6 L3 L5 E6 L3 L6 E6 L3 L5 H2 L5 K10 L3 Not part of the query answer

13 13 SBC-tree: Insertion Procedure Given an RLE sequence S = Ω1 x 1 f 1 x 2 f 2 … x n f n 1. Insert S as the first suffix into the SBC-tree first level 2. 1 ≤ i ≤ n, insert RLE-suffix x i 1 x i+1 f i+1 … x n f n into the SBC-tree first level Assign it a position tag T (Tag assignment problem) 3. Insert into the SBC-tree second level point = (T, f i )

14 14 SBC-tree: Substring Searching Given a query Q = y 1 f 1 y 2 f 2 … y m f m 1. Map Q into Q’ = y 1 f 1 + y 2 f 2 … y m f m 2. Search the String B-tree for Q’’ = y 1 1 y 2 f 2 … y m f m Returns (min_tag, max_tag) as a contiguous range 3. Search the SBC-tree second level for suffixes with frequency >= f 1 String B-tree The answer set Preceding RLE-char Suffix tag f1f1 Two-dimensional index Max_tag Min_tag

15 SBC-Tree: Example 15 P = A5 E3 B4P’ = A5 + E3 B4P’’ = A1 E3 B4

16 SBC-Tree Variants 3-sided structure [L. Arge, V. Samoladas, J. Vitter, PODS99] External memory structure based on priority search tree and B-tree Answers 3-sided range queries in 2D space Provides optimal worst-case theoretical bounds for: External memory space complexity Insertion and deletion 3-sided range query R-tree Available in all DBMSs Provides good performance in practice Does not have worst-case theoretical bounds for searching One-Level SBC-tree Remove the second level structure Disadvantage: In queries scan many tuples outside the answer set 16

17 17 SBC-tree Theoretical Bounds Optimal external-memory space complexity O(N/B) Optimal substring, prefix, and range searching in O(Log B N + (|p| +T)/B) I/O operations Insertion and deletion in (m Log B (N+m)) amortized I/O operations ParameterDefinition BDisk page size NTotal length of the RLE-compressed sequences TQuery output size |p|Length of the RLE-compressed query pattern mLength of the RLE-compressed sequence to be inserted or deleted

18 18 SBC-tree Implementation SBC-tree (R-tree variant) is implemented inside PostgreSQL Query operators: ^^ (substring search) @@ (Prefix search) == > (Range search) CREATE TABLE sequences (id INT, RLE_seq VARCHAR); CREATE INDEX ON sequences USING sbctree (seq); SELECT id FROM sequences WHERE RLE_seq ^^ ‘A5H7N2’; Substring searching operator

19 19 SBC-tree Performance Analysis: Storage Requirements Up to an order of magnitude saving in storage Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52

20 20 SBC-tree Performance Analysis: Insertion Around 30% saving in Insertion Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52

21 21 SBC-tree Performance Analysis: Searching Retain the optimal search performance (only the query answer is retrieved) Some additional overhead because of the two-level structure Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52

22 Summary Addressing the challenge of storing and operating on compressed data inside DBMSs without decompression Introduced the SBC-tree as an index for Run-Length Encoded (RLE) compressed sequences SBC-Tree has optimal theoretical bounds for: External memory space complexity Search I/O requirements Implementation inside PostgreSQL 22

23 Thank you Mohamed Eltabakh (meltabak@cs.purdue.edu) 23


Download ppt "The SBC-Tree: An Index for Run- Length Compressed Sequences Mohamed El-tabakh 1, Wing-Kia Hon 2 Rahul Shah 3, Walid Aref 1, Jeffrey Vitter 1 1 Department."

Similar presentations


Ads by Google