Download presentation
Presentation is loading. Please wait.
1
1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs
2
2 Why Compression? CPU speed outpaces Disk speed exponentially! –x10 / decade (bandwidth), x100 / decade (latency) Trade CPU for I/O: improve query performance + Save bandwidth for sequential I/O + Improve buffer pool hit ratio - Pay decompression cost Environment –Decision support queries –Lossless compression
3
3 Issues Database compression methods Efficient query processing
4
4 Database Compression Methods General-purpose compression Only compression ratio matters Large decompression unit (whole file) Database compression Both compression ratio and decompression cost matter Small decompression unit (attribute or tuple) Our setting: allow to decompress a single attribute
5
5 Efficient Query Processing Compared to uncompressed DB –When to decompress –Assumption: no compression in query processing Our story –Different strategies of when to decompress –None of them is always optimal –Combined optimization problem: Query plan + decompression placement –Solutions –Experiments
6
6 Different Decompression Strategies RS R.A = S.B Eager D(R) D(S) All uncompressed D(R.A)D(S.B) AB uncompressed RS R.A = S.B Lazy RS d(R.A) = d(S.B) All compressed Transient Mem Disk
7
7 Which Strategy Is Optimal? Lazy vs. eager –Lazy is always better Transient vs. Lazy –Transient: more I/O savings –Lazy: lower decompression cost In practice –Numerical attributes: transient is always better –String attributes: no clear winner Expensive to decompress High I/O savings if compressed
8
8 An Example With TPCH Data Select S_NAME, S_ADDRESS, C_NAME, C_PHONE From Supplier, Customer Where S_ADDRESS = C_ADDRESS Order by S_NAME, C_NAME SupplierCustomer S_A = C_A Sort(S_N, C_N)
9
9 Lazy BNL ( 2s) Lazy sort (7s) Transient vs. Lazy 1 attribute compressed Lazy BNL ( 2s) Transient sort (3s) 3 attributes compressed Transient BNL (42 s) Transient sort (0.5s) All attributes compressed An optimization problem!
10
10 Lazy BNL ( 2s) Transient sort (3s) Interactions With Traditional Optimization Optimal plan returned by System R is no longer optimal! Pruned by System R Algorithm: run System R, then decide when to decompress. 3 attributes compressed Transient SM (2.5 s) Transient sort (0.5s) All attributes compressed
11
11 Compression Aware Optimization Given a query and a compressed DB: Find the optimal query plan New operators –Explicit decompression operators –Transient versions of existing relational operators Search space: O (n m ) factor over old search space –n is the depth of the plan –m is the number of attributes –Each attribute explicitly decompressed at most once –For each attribute, n places to decompress explicitly
12
12 Dynamic Programming - OPT Extend system R optimizer –Bottom up, one minimal plan per interesting property –What attributes remain compressed as a new property Blowup reduced from n m to 2 m Lazy BNL (2s) Property: S_A, C_A uncompressed CustomerSupplier Transient SM join (2.5s) Property: all compressed CustomerSupplier
13
13 Min-K Heuristic Algorithm Store plans for k rather than 2 m properties –The k properties whose plans are cheapest Storage blowup reduced from 2 m to k Time: still exponential blowup in the worst case Join on S_A, C_A Stored plans: Lazy: S_A, C_A Transient: S_A, C_A Lazy: S_A, transient: C_A Transient: S_A, Lazy: C_A S_A,…C_A,…
14
14 Min-K Heuristics (2) If transient decompression is bad for one join attribute, often so for the other –BNL join: both S_A and C_A decompressed N 2 times Time blowup is 2k Join on S_A, C_A Stored plans: Lazy: S_A, C_A Transient: S_A, C_A S_A,…C_A,… Only consider two cases
15
15 Experiments Setup –Modify Predator query engine & optimizer –Algorithms Uncompressed, Eager, Lazy, Transient-Only, Two-Step, OPT, Min-1, Min-2 –100 MB TPCH data –50% compression ratio –Pentium III 550 Mhz, vary buffer pool size
16
16 Experimental Setup (2) Randomly add join conditions on string attributes Divide queries into workloads –Number of string join conditions, number of join tables Metrics: for algorithm X –Average relative-cost: Average(cost of plan returned by X / cost of opt plan) –Average blowup factor: Average(# plans searched by X / # plans by System R)
17
17 Average Relative Cost Queries with 3-4 join tables, buffer pool 10% of compressed DB
18
18 Distribution of Query Performance Percentage of Good plans (cost within twice of OPT) for all queries
19
19 Optimization Cost Queries with 3-4 join tables
20
20 Related Work How to compress –Roth&Horn93, Iyer&Wilhite94, Goldstein98 How to query –Graefe&Shapiro91, Westmann00, Greer99 Query optimization –Compressed MOLAP aggregates: Li99 –Compressed Bitmap indices:Amer-Yahia&Johnson00 –Expensive predicates: Chaudhuri&Shim99, Hellerstein93
21
21 Conclusions & Future Work Novel optimization problem –Search for regular query plan + when to decompress –Separate search sub-optimal –OPT and Min-K heuristic –Up to an order improvement in experiments Future work –Caching decompressed values –Updates
22
22 Search Space S_A, … S_A = C_A Sort(S_A) 3 extended plans (3 is depth) n m blow up over old space - n: depth of plan - m: number of attributes D(S_A) 3 places to place D(S_A) Transient join Before: convert to transient Regular sort After: as it is
23
23 Relative-Cost - Varying Buffer Pool Size Queries with 3- 4 join tables, 2 additional string joins
24
24 Relative Performance (2) Queries with more than 5 join tables
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.