The SBC-Tree: An Index for Run- Length Compressed Sequences Mohamed El-tabakh 1, Wing-Kia Hon 2 Rahul Shah 3, Walid Aref 1, Jeffrey Vitter 1 1 Department.

Slides:



Advertisements
Similar presentations
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Advertisements

Space-for-Time Tradeoffs
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
B+-trees. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B = n pages I/O complexity:
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009.
An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo Cory Tobin.
Goodrich, Tamassia String Processing1 Pattern Matching.
BTrees & Bitmap Indexes
B+-tree and Hashing.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
A New Point Access Method based on Wavelet Trees Nieves R. Brisaboa, Miguel R. Luaces, Diego Seco Database Laboratory University of A Coruña A Coruña,
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Introduction to Column-Oriented Databases Seminar: Columnar Databases, Nov 2012, Univ. Helsinki.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
C-Store: Concurrency Control and Recovery Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun. 5, 2009.
Bdbms: A Database System for Scientific Data Management Mohamed Y. Eltabakh, Mourad Ouzzani, Walid G. Aref, Ahmed Elmagarmid, Yasin Silva, Umer Arshad,
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
C-Store: Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 27, 2009.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
C-Store: Data Model and Data Organization Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
CS4432: Database Systems II
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Succinct Data Structures
CS 430: Information Discovery
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Lecture 19: Data Storage and Indexes
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
Presentation transcript:

The SBC-Tree: An Index for Run- Length Compressed Sequences Mohamed El-tabakh 1, Wing-Kia Hon 2 Rahul Shah 3, Walid Aref 1, Jeffrey Vitter 1 1 Department of Computer Science, Purdue University 2 Department of Computer Science, National Tsing Hua University 3 Department of Computer Science, Louisiana State University

Outline Introduction Related Work SBC-Tree Structure SBC-Tree Operations Theoretical and Experimental Analysis Summary 2

Introduction: Why Compression? We deal with massive amount of data, scientific databases, … Text and sequence formats are very common Compression techniques gain significant importance because they achieve: Significant storage reduction Reducing buffer requirements Reducing number of I/Os >>> Enhance the overall system performance

4 Introduction: Objective Current databases do not support data compression Operate over the raw data compress Store, Index, and Search the compressed Sequences Store, Index, and Search the decompressed sequences The main challenge is how to operate on the compressed data without decompressing it More challenging for external memory processing

5 Processing Compressed Sequences: Related Work(1) A. Amir and G. Benson. Efficient two-dimensional compressed matching. In DCC, A. Amir, G. Benson, and M. Farach. Let sleeping files lie: pattern matching in z-compressed files. In SODA, A. Apostolico, G. M. Landau, and S. Skiena. Matching for run-length encoded strings. Journal of Complexity,1999. T. Bell, M. Powell, A. Mukherjee, and D. Adjeroh. Searching BWT compressed text with the boyer-moore algorithm and binary search. In DCC, V. Freschi and A. Bogliolo. Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism. Information Processing Letters, Searching compressed data is addressed in main memory Substring matching, longest common subsequence, edit distance 1.Processing compressed data in main memory

6 Processing Compressed Sequences: Related Work(2) M. Stonebraker, D. Abadi, A. Batkin, X. Chen,M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik, C-store: A column oriented dbms, In VLDB, D. Abadi, S. Madden. Compression in Column Oriented Databases. In SIGMOD, times (20, 100) Run-Length encoding Operations such as SUM can be applied directly over the compressed data Column in a database table More complex operations have not been addressed yet Indexing RLE-compressed sequences Substring searching 2.Processing compressed data in DBMSs

What is SBC-Tree? SBC-Tree (String B-tree for Compressed sequences) An index for Run-Length Encoding (RLE) compressed sequences Supports prefix, range, and substring matching Optimal theoretical bounds for: External memory space complexity Search I/O requirements >> Relative to the size of the compressed sequences 7

8 SBC-Tree: An Index for RLE Compressed Sequences S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHH RLE(S) = L10 E6 L4 E3 H18 >> S has 41 suffixes >> RLE(S) has 5 RLE-suffixes RLE-char L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 1.Store the compressed sequences 2.Index the RLE-suffixes 3.Perform efficiently substring operations RLE-suffixes Run-Length Encoding (RLE) Replace tandem repeated characters with their frequency Effective with small alphabets

9 SBC-tree Structure Two-level index structure String B-tree: Indexes the RLE-suffixes Two-dimensional index: built on top of the leaves of the string B-tree Two-dimensional Index (e.g., R-tree) Tags Preceding character String B-tree root Numeric tag assigned to each suffix

10 String B-tree Overview [P. Ferragina and R. Grossi., Journal of ACM,1999] S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHH (S,21)(S,12)(S,11)(S,13) Store logical pointers instead of the keys 1.Generate all suffixes of S 2.Insert the suffixes into the String B-tree (ordered alphabetically) 3.Store the logical keys instead of the key sequence 4.Several optimizations to achieve optimal theoretical bounds for:  External memory space complexity  Search I/O requirements 1121 >> Relative to the size of the raw (decompressed) sequences

11 String B-tree over RLE-suffixes String B-tree CANNOT be used directly to index RLE-suffixes RLE-suffixes are subset of the total suffixes Order We indexed only subset of the suffixes (RLE-suffixes) Searching for “L10 E6 L3”  Found Searching for “L5 E6 L3”  Not Found Searching for “E3 L4”  Not Found Implicit in L10 E6 L4 E3 H18 Implicit in E6 L4 E3 H18 L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 (S,6)(S,8)(S,4) (S,1) (S,10) L10 E6 L4 E3 H

12 SBC-Tree over RLE-suffixes Query Pattern Mapping Rule: Substring query pattern P = x 1 f 1 x 2 f 2 … x n f n is mapped into P’ = x 1 f 1 + x 2 f 2 … x n f n L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 RLE-suffixes Searching for “L5 E6 L3”  (L5 + E6 L3)  Found Searching for “E3 L4”  (E3 + L4)  Found Challenge: The answer set is no longer consecutive in the index tree  Unbounded number of I/Os to answer a query L5 + E6 L3 L5 E6 L3 L6 E6 L3 L5 H2 L5 K10 L3 Not part of the query answer

13 SBC-tree: Insertion Procedure Given an RLE sequence S = Ω1 x 1 f 1 x 2 f 2 … x n f n 1. Insert S as the first suffix into the SBC-tree first level 2. 1 ≤ i ≤ n, insert RLE-suffix x i 1 x i+1 f i+1 … x n f n into the SBC-tree first level Assign it a position tag T (Tag assignment problem) 3. Insert into the SBC-tree second level point = (T, f i )

14 SBC-tree: Substring Searching Given a query Q = y 1 f 1 y 2 f 2 … y m f m 1. Map Q into Q’ = y 1 f 1 + y 2 f 2 … y m f m 2. Search the String B-tree for Q’’ = y 1 1 y 2 f 2 … y m f m Returns (min_tag, max_tag) as a contiguous range 3. Search the SBC-tree second level for suffixes with frequency >= f 1 String B-tree The answer set Preceding RLE-char Suffix tag f1f1 Two-dimensional index Max_tag Min_tag

SBC-Tree: Example 15 P = A5 E3 B4P’ = A5 + E3 B4P’’ = A1 E3 B4

SBC-Tree Variants 3-sided structure [L. Arge, V. Samoladas, J. Vitter, PODS99] External memory structure based on priority search tree and B-tree Answers 3-sided range queries in 2D space Provides optimal worst-case theoretical bounds for: External memory space complexity Insertion and deletion 3-sided range query R-tree Available in all DBMSs Provides good performance in practice Does not have worst-case theoretical bounds for searching One-Level SBC-tree Remove the second level structure Disadvantage: In queries scan many tuples outside the answer set 16

17 SBC-tree Theoretical Bounds Optimal external-memory space complexity O(N/B) Optimal substring, prefix, and range searching in O(Log B N + (|p| +T)/B) I/O operations Insertion and deletion in (m Log B (N+m)) amortized I/O operations ParameterDefinition BDisk page size NTotal length of the RLE-compressed sequences TQuery output size |p|Length of the RLE-compressed query pattern mLength of the RLE-compressed sequence to be inserted or deleted

18 SBC-tree Implementation SBC-tree (R-tree variant) is implemented inside PostgreSQL Query operators: ^^ (substring search) (Prefix search) == > (Range search) CREATE TABLE sequences (id INT, RLE_seq VARCHAR); CREATE INDEX ON sequences USING sbctree (seq); SELECT id FROM sequences WHERE RLE_seq ^^ ‘A5H7N2’; Substring searching operator

19 SBC-tree Performance Analysis: Storage Requirements Up to an order of magnitude saving in storage Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52

20 SBC-tree Performance Analysis: Insertion Around 30% saving in Insertion Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52

21 SBC-tree Performance Analysis: Searching Retain the optimal search performance (only the query answer is retrieved) Some additional overhead because of the two-level structure Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52

Summary Addressing the challenge of storing and operating on compressed data inside DBMSs without decompression Introduced the SBC-tree as an index for Run-Length Encoded (RLE) compressed sequences SBC-Tree has optimal theoretical bounds for: External memory space complexity Search I/O requirements Implementation inside PostgreSQL 22

Thank you Mohamed Eltabakh 23