1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung
2 Content Introduction Existing Methods Proposed Method: Partitioned-Cube Memory-Cube Experiment Conclusion
3 Introduction Datacubes queries compute aggregates over database relations at a variety of granularities. Cube by: Product, Country, Date Aggregation Function: Sum(Sales)
4 Sparseness Cardinality is a small fraction of the size of the cross product of the attribute domains. Interest in sparse relations, as effective datacube computation is important.
5 Problem Large Domain with CUBE BY attributes Large number of CUBE BY attributes Existing methods are not efficient We Need Something New Partitioned - Cube
6 Existing Methods PIPESORT Optimize overall cost by evaluating each path Poor performance when the relation is sparse Lower bound of no. of sorting is Large I / O cost for huge cuboids
7 OVERLAP Minimize Disk Access by overlapping cuboids But I / O cost is at least quadratic in k, even given memory-sized partition Classify the cuboids into “Partition” and “SortRun” state I / O depends on the partition size and number of sorted runs
8 Array – Based Algorithms Partitioned the data, and store fragments to memory. Data Compression may be applied Allow direct access to the memory cells For sparse data, array fragments may not be fit into memory. Then, a more costly data structure would be required
9 Partitioned-Cube Partition the large relations into fragments that can be fitted into the memory It follows the recursive structure of datacubes A sub-datacube is obtained by fixing each possible value of a CUBE BY attribute
10 Partitioned-Cube(cont.) Algorithm Partition-Cube(R, {B 1, …, B m }, A, G) R: a set of tuples {B 1, …, B n }: CUBE BY attribute A: attribute to be aggregated G: aggregate function F: finest granularity datacube tuples D: remaining tuples Step 1:if (R fits in memory) then return Memory-Cube(R, {B 1, …, B n }, A, G) Step 2:scan R, partition on B j in {B 1, …, B n } Step 3:for (i = 1 to n) (F i, D i ) = Partition-Cube(R i, {B 1, …, B n }, A, G) Step 4:let F = union of F i ’s Step 5:let (F’, D’) = Partition-Cube(F, {B 1, … B m }, A, G) Step 6:let D = union of F’, D’ and D i ’s Step 7:return (F, D) CountryYearSale s US US20015 US20008 US20026 HK20006 HK20018 HK20017 HK20027
11 Partitioned-Cube(cont.) STEP 1: Partition the large relations into fragments that can be fitted into the memory CountryYearSale s US US20015 US20008 US20026 HK20006 HK20018 HK20017 HK20027 CountryYearSale s US US20015 US20008 US20026 CountryYearSale s HK20006 HK20018 HK20017 HK20027 R R1R1 R2R2
12 Partitioned-Cube(cont.) STEP2: Compute the tuples in the corresponding sub-datacube CountryYearSales US US20015 US20008 US20026 R1R1 F1F1 D1D1 CountryYearSales US US20015 US20026 CountryYearSales USALL29
13 Partitioned-Cube(cont.) STEP3: In the same way, Compute F2 and D2 CountryYearSales HK20006 HK20018 HK20017 HK20027 R2R2 F2F2 D2D2 CountryYearSales HK20006 HK HK20027 CountryYearSales HKALL28
14 Partitioned-Cube(cont.) Step 4:F= Step 5: by recursively call this function, get F’ and D’ CountryYearSales US US20015 US20026 HK20006 HK HK20017 F CountryYearSales All All All F’ D’ CountryYearSales All 57
15 Partitioned-Cube(cont.) Step 6: Step 7: return F, D CountryYearSales US US20015 US20026 HK20006 HK HK20027 F CountryYearSales All All All CountryYearSales All 57 F’ D’ CountryYearSales USALL29 CountryYearSales HKALL28 D1D1 D2D2 D
16 Partitioned-Cube(cont.) Recursively execute STEP2 if there are more than 2 attributes CountryYearSales US US20015 US20008 US20026 R1R1 F1F1 D1D1 CountryYearSales US US20015 US20026 CountryYearSales USALL29
17 Memory-Cube Perform complex operation over each fragment independently Minimize the total no. of paths in searching lattice Share the sort work Compute the tuples in the corresponding sub-datacube Compute the datacube tuples with the value ALL for the attributes
18 Memory-Cube Minimize the total no. of paths in searching lattice G(1) =D Є G(2) =CD C Є D G(3) =BCD BC B Є BD D CD C G(4) = ABCD ABC AB A Є ABD AD D ACD AC C BCD BC B BD CD 6 = 4 C 2
19 Memory-Cube Share Sort Work Re-Order the sorting sequence can improve the performance Sorting result on shorter relation can be reused in longer relation E.g. S6 = CD, S3 = CAD After sorting S6, for S3, the entire relation does not have to be resorted, only each block of tuples that shares a C value needs to be independently sorted in the AD order.
20 Memory-Cube Sort in-memory relation according to the attribute Like PIPESORT, make a single scan through the data Aggregates all small fragments on the path Output datacube result by combining these small fragments
21 Solution Analysis I / O cost is linear of k CPU Cost (In-memory sorts) is exponential in k CPU Cost should be dominated by the I / O time
22 Experiment CPU time v.s. No. of Tuples Exponential in no. of CUBE BY attributes
23 Experiment CPU, I / O, CPU Usage % v.s. no. of CUBE BY attributes CPU Usage % drops for large no. of CUBE BY attributes
24 Experiment Share sorting work CPU Time is dominated by I / O Time
25 Conclusion Partitioned-Cube is a fast computation of datacubes over large sparse relation Minimize the number of sort orders Show the advantages of sharing sort orders in the datacube computation First solution with LINEAR I / O Cost
26 Reference Kenneth A. Ross, Divesh Srivastava : Kenneth A. Ross Divesh Srivastava Fast Computation of Sparse Datacubes. VLDB 1997 VLDB 1997 :
27 Q & A Section