Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs.

Similar presentations


Presentation on theme: "1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs."— Presentation transcript:

1 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

2 2 Why Compression? CPU speed outpaces Disk speed exponentially! –x10 / decade (bandwidth), x100 / decade (latency) Trade CPU for I/O: improve query performance + Save bandwidth for sequential I/O + Improve buffer pool hit ratio - Pay decompression cost Environment –Decision support queries –Lossless compression

3 3 Issues Database compression methods Efficient query processing

4 4 Database Compression Methods General-purpose compression Only compression ratio matters Large decompression unit (whole file) Database compression Both compression ratio and decompression cost matter Small decompression unit (attribute or tuple) Our setting: allow to decompress a single attribute

5 5 Efficient Query Processing Compared to uncompressed DB –When to decompress –Assumption: no compression in query processing Our story –Different strategies of when to decompress –None of them is always optimal –Combined optimization problem: Query plan + decompression placement –Solutions –Experiments

6 6 Different Decompression Strategies RS R.A = S.B Eager D(R) D(S) All uncompressed D(R.A)D(S.B) AB uncompressed RS R.A = S.B Lazy RS d(R.A) = d(S.B) All compressed Transient Mem Disk

7 7 Which Strategy Is Optimal? Lazy vs. eager –Lazy is always better Transient vs. Lazy –Transient: more I/O savings –Lazy: lower decompression cost In practice –Numerical attributes: transient is always better –String attributes: no clear winner Expensive to decompress High I/O savings if compressed

8 8 An Example With TPCH Data Select S_NAME, S_ADDRESS, C_NAME, C_PHONE From Supplier, Customer Where S_ADDRESS = C_ADDRESS Order by S_NAME, C_NAME SupplierCustomer S_A = C_A Sort(S_N, C_N)

9 9 Lazy BNL ( 2s) Lazy sort (7s) Transient vs. Lazy 1 attribute compressed Lazy BNL ( 2s) Transient sort (3s) 3 attributes compressed Transient BNL (42 s) Transient sort (0.5s) All attributes compressed An optimization problem!

10 10 Lazy BNL ( 2s) Transient sort (3s) Interactions With Traditional Optimization Optimal plan returned by System R is no longer optimal! Pruned by System R Algorithm: run System R, then decide when to decompress. 3 attributes compressed Transient SM (2.5 s) Transient sort (0.5s) All attributes compressed

11 11 Compression Aware Optimization Given a query and a compressed DB: Find the optimal query plan New operators –Explicit decompression operators –Transient versions of existing relational operators Search space: O (n m ) factor over old search space –n is the depth of the plan –m is the number of attributes –Each attribute explicitly decompressed at most once –For each attribute, n places to decompress explicitly

12 12 Dynamic Programming - OPT Extend system R optimizer –Bottom up, one minimal plan per interesting property –What attributes remain compressed as a new property Blowup reduced from n m to 2 m Lazy BNL (2s) Property: S_A, C_A uncompressed CustomerSupplier Transient SM join (2.5s) Property: all compressed CustomerSupplier

13 13 Min-K Heuristic Algorithm Store plans for k rather than 2 m properties –The k properties whose plans are cheapest Storage blowup reduced from 2 m to k Time: still exponential blowup in the worst case Join on S_A, C_A Stored plans: Lazy: S_A, C_A Transient: S_A, C_A Lazy: S_A, transient: C_A Transient: S_A, Lazy: C_A S_A,…C_A,…

14 14 Min-K Heuristics (2) If transient decompression is bad for one join attribute, often so for the other –BNL join: both S_A and C_A decompressed N 2 times Time blowup is 2k Join on S_A, C_A Stored plans: Lazy: S_A, C_A Transient: S_A, C_A S_A,…C_A,… Only consider two cases

15 15 Experiments Setup –Modify Predator query engine & optimizer –Algorithms Uncompressed, Eager, Lazy, Transient-Only, Two-Step, OPT, Min-1, Min-2 –100 MB TPCH data –50% compression ratio –Pentium III 550 Mhz, vary buffer pool size

16 16 Experimental Setup (2) Randomly add join conditions on string attributes Divide queries into workloads –Number of string join conditions, number of join tables Metrics: for algorithm X –Average relative-cost: Average(cost of plan returned by X / cost of opt plan) –Average blowup factor: Average(# plans searched by X / # plans by System R)

17 17 Average Relative Cost Queries with 3-4 join tables, buffer pool 10% of compressed DB

18 18 Distribution of Query Performance Percentage of Good plans (cost within twice of OPT) for all queries

19 19 Optimization Cost Queries with 3-4 join tables

20 20 Related Work How to compress –Roth&Horn93, Iyer&Wilhite94, Goldstein98 How to query –Graefe&Shapiro91, Westmann00, Greer99 Query optimization –Compressed MOLAP aggregates: Li99 –Compressed Bitmap indices:Amer-Yahia&Johnson00 –Expensive predicates: Chaudhuri&Shim99, Hellerstein93

21 21 Conclusions & Future Work Novel optimization problem –Search for regular query plan + when to decompress –Separate search sub-optimal –OPT and Min-K heuristic –Up to an order improvement in experiments Future work –Caching decompressed values –Updates

22 22 Search Space S_A, … S_A = C_A Sort(S_A) 3 extended plans (3 is depth) n m blow up over old space - n: depth of plan - m: number of attributes D(S_A) 3 places to place D(S_A) Transient join Before: convert to transient Regular sort After: as it is

23 23 Relative-Cost - Varying Buffer Pool Size Queries with 3- 4 join tables, 2 additional string joins

24 24 Relative Performance (2) Queries with more than 5 join tables


Download ppt "1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs."

Similar presentations


Ads by Google