1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs.

Slides:



Advertisements
Similar presentations
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Advertisements

Query Processing and Optimizing on SSDs Flash Group Qingling Cao
Best-Effort Top-k Query Processing Under Budgetary Constraints
CS 4432query processing - lecture 161 CS4432: Database Systems II Lecture #16 Join Processing Algorithms Professor Elke A. Rundensteiner.
MonetDB: A column-oriented DBMS Ryan Johnson CSC2531.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 Relational Query Optimization Module 5, Lecture 2.
Presented by Marie-Gisele Assigue Hon Shea Thursday, March 31 st 2011.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
ITIS 5160 Indexing. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapter 15.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 2, 2004 Some slide content derived.
Common Tuning Opportunities
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
1 Compressing Query Results for Mobile Clients Zhiyuan Chen and Praveen Seshadri Cornell University.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
1 Lecture 7: Data structures for databases I Jose M. Peña
Lecture 11: DMBS Internals
Flashing Up the Storage Layer I. Koltsidas, S. D. Viglas (U of Edinburgh), VLDB 2008 Shimin Chen Big Data Reading Group.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
Database Management 9. course. Execution of queries.
© Stavros Harizopoulos 2006 Performance Tradeoffs in Read- Optimized Databases: from a Data Layout Perspective Stavros Harizopoulos MIT CSAIL Modified.
ITCS 6163 Lecture 5. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
Oracle Advanced Compression – Reduce Storage, Reduce Costs, Increase Performance Session: S Gregg Christman -- Senior Product Manager Vineet Marwah.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
CPS216: Advanced Database Systems Notes 09:Query Optimization (Cost-based optimization) Shivnath Babu.
Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.
Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.
1 B + -Trees: Search  If there are n search-key values in the file,  the path is no longer than  log  f/2  (n)  (worst case).
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 15 – Query Optimization.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Query Optimization Problem Pick the best plan from the space of physical plans.
Column Oriented Database By: Deepak Sood Garima Chhikara Neha Rani Vijayita Gumber.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
Storage Access Paging Buffer Replacement Page Replacement
ITIS 5160 Indexing.
RE-Tree: An Efficient Index Structure for Regular Expressions
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Data Engineering Query Optimization (Cost-based optimization)
Introduction to Query Optimization
Evaluation of Relational Operations
Lecture 11: DMBS Internals
Disk Storage, Basic File Structures, and Buffer Management
Database Management Systems (CS 564)
CS222P: Principles of Data Management Notes #12 Joins!
CS222: Principles of Data Management Notes #12 Joins!
Relational Query Optimization
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall Notes #11 Join!
CS222P: Principles of Data Management UCI, Fall 2018 Notes #11 Join!
Presentation transcript:

1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

2 Why Compression? CPU speed outpaces Disk speed exponentially! –x10 / decade (bandwidth), x100 / decade (latency) Trade CPU for I/O: improve query performance + Save bandwidth for sequential I/O + Improve buffer pool hit ratio - Pay decompression cost Environment –Decision support queries –Lossless compression

3 Issues Database compression methods Efficient query processing

4 Database Compression Methods General-purpose compression Only compression ratio matters Large decompression unit (whole file) Database compression Both compression ratio and decompression cost matter Small decompression unit (attribute or tuple) Our setting: allow to decompress a single attribute

5 Efficient Query Processing Compared to uncompressed DB –When to decompress –Assumption: no compression in query processing Our story –Different strategies of when to decompress –None of them is always optimal –Combined optimization problem: Query plan + decompression placement –Solutions –Experiments

6 Different Decompression Strategies RS R.A = S.B Eager D(R) D(S) All uncompressed D(R.A)D(S.B) AB uncompressed RS R.A = S.B Lazy RS d(R.A) = d(S.B) All compressed Transient Mem Disk

7 Which Strategy Is Optimal? Lazy vs. eager –Lazy is always better Transient vs. Lazy –Transient: more I/O savings –Lazy: lower decompression cost In practice –Numerical attributes: transient is always better –String attributes: no clear winner Expensive to decompress High I/O savings if compressed

8 An Example With TPCH Data Select S_NAME, S_ADDRESS, C_NAME, C_PHONE From Supplier, Customer Where S_ADDRESS = C_ADDRESS Order by S_NAME, C_NAME SupplierCustomer S_A = C_A Sort(S_N, C_N)

9 Lazy BNL ( 2s) Lazy sort (7s) Transient vs. Lazy 1 attribute compressed Lazy BNL ( 2s) Transient sort (3s) 3 attributes compressed Transient BNL (42 s) Transient sort (0.5s) All attributes compressed An optimization problem!

10 Lazy BNL ( 2s) Transient sort (3s) Interactions With Traditional Optimization Optimal plan returned by System R is no longer optimal! Pruned by System R Algorithm: run System R, then decide when to decompress. 3 attributes compressed Transient SM (2.5 s) Transient sort (0.5s) All attributes compressed

11 Compression Aware Optimization Given a query and a compressed DB: Find the optimal query plan New operators –Explicit decompression operators –Transient versions of existing relational operators Search space: O (n m ) factor over old search space –n is the depth of the plan –m is the number of attributes –Each attribute explicitly decompressed at most once –For each attribute, n places to decompress explicitly

12 Dynamic Programming - OPT Extend system R optimizer –Bottom up, one minimal plan per interesting property –What attributes remain compressed as a new property Blowup reduced from n m to 2 m Lazy BNL (2s) Property: S_A, C_A uncompressed CustomerSupplier Transient SM join (2.5s) Property: all compressed CustomerSupplier

13 Min-K Heuristic Algorithm Store plans for k rather than 2 m properties –The k properties whose plans are cheapest Storage blowup reduced from 2 m to k Time: still exponential blowup in the worst case Join on S_A, C_A Stored plans: Lazy: S_A, C_A Transient: S_A, C_A Lazy: S_A, transient: C_A Transient: S_A, Lazy: C_A S_A,…C_A,…

14 Min-K Heuristics (2) If transient decompression is bad for one join attribute, often so for the other –BNL join: both S_A and C_A decompressed N 2 times Time blowup is 2k Join on S_A, C_A Stored plans: Lazy: S_A, C_A Transient: S_A, C_A S_A,…C_A,… Only consider two cases

15 Experiments Setup –Modify Predator query engine & optimizer –Algorithms Uncompressed, Eager, Lazy, Transient-Only, Two-Step, OPT, Min-1, Min-2 –100 MB TPCH data –50% compression ratio –Pentium III 550 Mhz, vary buffer pool size

16 Experimental Setup (2) Randomly add join conditions on string attributes Divide queries into workloads –Number of string join conditions, number of join tables Metrics: for algorithm X –Average relative-cost: Average(cost of plan returned by X / cost of opt plan) –Average blowup factor: Average(# plans searched by X / # plans by System R)

17 Average Relative Cost Queries with 3-4 join tables, buffer pool 10% of compressed DB

18 Distribution of Query Performance Percentage of Good plans (cost within twice of OPT) for all queries

19 Optimization Cost Queries with 3-4 join tables

20 Related Work How to compress –Roth&Horn93, Iyer&Wilhite94, Goldstein98 How to query –Graefe&Shapiro91, Westmann00, Greer99 Query optimization –Compressed MOLAP aggregates: Li99 –Compressed Bitmap indices:Amer-Yahia&Johnson00 –Expensive predicates: Chaudhuri&Shim99, Hellerstein93

21 Conclusions & Future Work Novel optimization problem –Search for regular query plan + when to decompress –Separate search sub-optimal –OPT and Min-K heuristic –Up to an order improvement in experiments Future work –Caching decompressed values –Updates

22 Search Space S_A, … S_A = C_A Sort(S_A) 3 extended plans (3 is depth) n m blow up over old space - n: depth of plan - m: number of attributes D(S_A) 3 places to place D(S_A) Transient join Before: convert to transient Regular sort After: as it is

23 Relative-Cost - Varying Buffer Pool Size Queries with 3- 4 join tables, 2 additional string joins

24 Relative Performance (2) Queries with more than 5 join tables