MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada
VLDB 05Shaofeng Bu UBC2 Introduction Multi-dimensional OLAP queries typically produce data intensive answers Often the question is: how to express the large answer set of cells that satisfy the OLAP query conditions: Simple enumeration: accurate but not necessarily the most intuitive; Summaries: not (necessarily) 100% accurate but can be more intuitive and informative. Summarized answers can be more easily understood
3 OLAP Data Cube Example clothes New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany northwest midwest northeast location jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s Each dimension is associated with a hierarchical tree
4 OLAP Data Cube Example clothes New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany northwest midwest northeast location jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s Data Cell: (c1,c2), c1,c2 are leaf-nodes in axis-trees, e.g. (Vancouver, ties) Data Region: describes all data cells covered by given nodes in the axis- trees, (x1, y1), e.g.: (Vancouver, ties) (Vancouver, women’s) (northwest, women’s)
5 OLAP Data Cube Example clothes New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany northwest midwest northeast location jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s Blue cells: the cells that satisfy the query conditions; How to find a summary of the blue cells in a data cube?
VLDB 05Shaofeng Bu UBC6 MDL Summarization MDL: Minimum Description Length Use regions to cover the blue cells; Length of an MDL description is the number of included regions and cells; MDL is to find the description with the minimum length.
7 R9 R5 R6 R7R8 R1 An Example of MDL Summarization clothes R2R3R4 New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany midwest northeast location jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s northwest
8 ?R9 R10R11 R12 R13 R5 10 regions 8 single blue cells Total length = 18 MDL Summarization R6 R7R8 A Motivating Example: A New Case clothes R2?R3R4 ?R1 New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany northwest midwest northeast location jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s Not blue cells any more
VLDB 05Shaofeng Bu UBC9 Can we do better? Yes! We present a new compression approach: MDL with Holes: Identify regions with blue cells, even if they contain non-blue cells; Express the included blue cells by using regions with the exception of the covered non-blue cells; Non-blue cells are called holes.
10 R5 R6 R7R8 R2R4 Plus other 6 regions ?R1 R1-(Vancouver,Skirts) ?R9 R9-(Boston,ties) -(New York, dress skirts) ?R3 R3-(Vancouver,Skirts) A Motivating Example: MDL with Holes clothes New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany northwest midwest northeast locatio n jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s R1+R3-(Vancouver,Skirts) MDL with Holes : Length = 6+3+3=12 MDL Approach: Length is 18
VLDB 05Shaofeng Bu UBC11 Problem Statements MDL with Holes (MDLH) is to find a description with holes that has the minimum length and the maximum benefit. In practice, we can drill down on regions to get additional details.
VLDB 05Shaofeng Bu UBC12 Definitions: Length & Benefit Given a set B of data cells (blue cells), an MDLH description for B: D=S – H, S is a set of data regions, H is a set of data cells, also called ‘holes’, D covers exactly the data cells in B. Length: total number of the included regions and cells in the description. |D|=|S|+|H| Benefit : how much shorter is the MDLH summary than the enumeration of B. Benefit (D) = |B| – | D| B 1 ={a, b, c} D 1 = s – d |D 1 |=2 Benefit(D 1 ) = |B 1 | - |D 1 | = 1 B 2 ={e, g} D 2 = t – f – h |D 2 | = 3 Benefit(D 2 )= |B 2 | - |D 2 | = -1 a bcde f s t x g h
13 Related Work The Generalized MDL Approach for Summarization, Laks V.S. Lakshmanan, Raymond T. Ng et al., VLDB 2002 Reduce description length by allowing non-blue cells to be covered in the regions The regions are not pure. Concise Descriptions of Subsets of Structured Sets, Alberto O. Mendelzon & Ken Q. Pu, PODS 2003 Allow Cartesian products to be formed; Not purely hierarchical: NP Completeness result is less surprising ; What about the pure hierarchical? Intelligent Rollups in Multidimensional OLAP Data, Gayatri Sathe and Sunita Sarawagi, VLDB 2001 Only report consistent generalization: A tuple can be generalized along a set of dimensions only if it can be generalized along all subsets of dimensions.
VLDB 05Shaofeng Bu UBC14 Outline Introduction to MDL with Holes A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Complete Heuristics A Greedy Heuristic Dynamic Programming Quadratic Programming Experimental Results Summarization on Holes: An Extension Conclusions & Contributions
15 ‘x’ D 1 = x – d – f – j Benefit(D 1 ) = 7 – 4 = 3 D 2 =(s – d ) + e + ( u – j ) Beneift(D 2 ) = 7 – 5 = 2 ‘y’ D 3 = y – m – p – q – r Benefit(D 3 ) = 4 – 5 = -1 D 4 = ( v – m ) + o, Benefit(D 4 ) = 4 – 3 = 1 ‘z’ D 5 = z – d – f – j – m – p – q – r Benefit(D 5 ) = 11 – 8 = 3 D 6 =(x – d – f – j)+( v – m + o ) Benefit(D 6 ) = 11 – 7 = 4 1-D Case: MDLH is Tractable a bcdefghijklmnopqr s tuv w y x z MDLH is Tractable: the Optimal MDLH description, which has the maximum benefit, can be generated in polynomial time in 1-D case.
VLDB 05Shaofeng Bu UBC16 Outline Introduction to MDL with Holes A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics A Greedy Heuristic Dynamic Programming Quadratic Programming Experimental Results Summarization on Holes: An Extension Conclusions & Contributions
a b c d e f g i (c,8),(d,8),(e,8) 4 0 rows length benefit (f,8),(g,8) 3 2 (a,8),(b,8) 5 -2 columns length benefit (i,1) 3 2 (i,5) 5 -2 (i,2),(i,3),(i,4) (i,6),(i,7) D Case: Optimality is not Preserved Any More Optimal Solution: {(c,8)+(d,8)+(e,8)+(i,2)+(i,e)+(i,4)} -{(c,2)+(c,3)+(c,4)+(d,2)+(d,3)+(d,4) +(e,2)+(e,3)+(e,4)} +(f,1)+(g,1)+(f,6)+(g,7) Length = 19Benefit = = 9
VLDB 05Shaofeng Bu UBC18 MDLH is NP-Hard in 2-D Case It is NP-Hard to find the optimal MDLH description in 2-D data cube; Not a Trivial Proof: Details are in the paper; Reduction Strategy: Clique Maximum Induced Subgraph in Complete Edge-Weighted(CEW) Bipartite Graph MDL with Holes
VLDB 05Shaofeng Bu UBC19 Outline Introduction to MDL with Holes A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics A Greedy Heuristic Dynamic Programming Quadratic Programming Experimental Results Summarization on Holes: An Extension Conclusions & Contributions
VLDB 05Shaofeng Bu UBC20 Heuristics for MDLH Greedy Each time, choose the row/column with the most benefit Dynamic Programming A bottom-up method to get the description of a region from the descriptions of its children regions Quadratic Programming Using a quadratic function to represent the benefit of a 2-d data cube
VLDB 05Shaofeng Bu UBC21 Example for Comparison with Heuristics The optimal description for this example: (e,1)-(a,1)+(e,2)-(b,2)+(e,3)- (b,3)+(d,4)+(b,5) +(e,6)+(e,8)+(a,11)-(a,8) Length = 12 Benefit = a b c d e
VLDB 05Shaofeng Bu UBC22 Heuristics: A Greedy Heuristic a b c d e region length benefit holes (e,6) (d,10) 2 2 (d,5) (e,1) 2 1 (a,1) (e,2) 2 1 (b,2) (e,3) 2 1 (b,3) (a,11) 2 1 (a,8) (e,8) 2 1 (a,8) (c,10) 3 0 (c,4)(c,5) Description by Greedy: (e,6)+(a,11)+(e,8)-(a,8) +(d,10)-(d,5) +(a,2)+(a,3)+(b,1)+(b,5)+(c,1)+(c,2)+(c,3) The length is 13 The benefit is = 7
VLDB 05Shaofeng Bu UBC23 Greedy: Why it is not optimal? a b c d e Description from Greedy a b c d e Optimal Description A selection of row/column may reduce more total benefit
24 Heuristics: Dynamic Programming a224 b224 c325 d224 e a b c d e at2t2 gt2t2 bt2t2 t2t2 t2t2 ct2t2 t2t2 t2t2 dgt2t2 t2t2 egggt1t1 t1t1 t2t2 gt1t1 gt1t1 t2t2 t2t2 L: The Length of a Region S: Selection of Rows & Columns (a,10) : (a,2) + (a,3) L(a,10)=2, S(a,10)=‘t 2 ’ (e,4) : (d,4) L(e,4)=1, S(e,4)=‘t 1 ’ (d,10): (d,10) – (d,5) L(d,10)=2, S(d,10)=‘g’ t1t1 t2t2
25 Heuristics: Dynamic Programming(2) a b c d e S at2t2 gt2t2 bt2t2 t2t2 t2t2 ct2t2 t2t2 t2t2 dgt2t2 t2t2 egggt1t1 t1t1 t2t2 gt1t1 gt1t1 t2t2 t2t2 S (e,12)=‘t 2 ’ S (e,11)=‘t 2 ’ D(e,6)+D(e,7)+D(e,8)+D(e,9) S (e,10)=‘t 2 ’ D(e,1)+D(e,2)+D(e,3)+D(e,4)+D(e,5) D(e,12)=D(e,10)+D(e,11) (e,1)-(a,1)(e,2)-(b,2) (e,3)-(b,3) (d,4) (b,5)(e,6)(a,7) (e,8)-(a,8) (a,9) Generated Description: (e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5) +(e,6)+(a,7)+(e,8)-(a,8)+(a,9) The length is 13 and the benefit is = 7 D(x 1,x 2 ):description for region (x 1,x 2 ) t1t1 t2t2
VLDB 05Shaofeng Bu UBC26 Dynamic Programming: Why it is not optimal? Description by Dynamic Programming Optimal Description a b c d e a b c d e Misses the combination of rows and columns
VLDB 05Shaofeng Bu UBC27 Use variables to represent rows/columns; for a variable v: v=1: the corresponding row/column is selected; v=0: the corresponding row/column is not selected; f = – Benefit( D) Maximizing the benefit is to minimize the value of f For the previous example, quadratic programming generates the optimal description; Optimality is not guaranteed. Heuristics: Quadratic Programming
VLDB 05Shaofeng Bu UBC28 Outline Introduction to MDL with Holes A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics A Greedy Heuristic Dynamic Programming Quadratic Programming Experimental Results Summarization on Holes: An Extension Conclusions & Contributions
VLDB 05Shaofeng Bu UBC29 Experiments We ran a set of experiments on the TPC-H benchmark data set; We compared the three MDLH heuristics with MDL and GMDL.
30 Experimental Results: Comparison of All Methods Compression Ratio: MDLH-Quadratic generates the most concise descriptions: a yardstick of quality; MDLH-Dynamic is a very close second.
31 Experimental Results: Compression Ratio The more children per parent node, the greater the benefit
VLDB 05Shaofeng Bu UBC32 Experimental Results: Summary Running time & Scalability: MDLH-Greedy is the fastest; MDLH-Dynamic runs slower than MDLH-Greedy, but it is still scalable w.r.t. the number of cells;
VLDB 05Shaofeng Bu UBC33 Outline Introduction to MDL with Holes A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics A Greedy Heuristic Dynamic Programming Quadratic Programming Experimental Results Summarization on Holes: An Extension Conclusions & Contributions
34 As the blue density becomes high, a large part of the MDLH description is made up of holes. Can we further reduce the total length by summarizing ‘Holes’? MDLH description is: (a,11)-{(a,6)+(a,8)+(a,9)} +(d,11)-{(d,6)+(d,7)+(d,8)} +(b,6)+(c,8) Total length is 10. Summarization on holes: (a,6)+(a,8)+(a,9) = (a,10)-(a,7) (d,6)+(d,7)+(d,8) = (d,10)-(d,9) After summarization on holes: (a,11) - { (a,10) - (a,7)} +(d,11) - { (d,10) - (d,9)} +(b,6) + (c,8) Total length is 8. Extension: Summarization on holes a b c d e 10 11
VLDB 05Shaofeng Bu UBC35 Conclusions & Contributions We present a new method, MDLH, to compress the answers of OLAP queries; We present a bottom-up algorithm for 1-d cube; We proved the NP-Hardness of the MDLH problem; We provided three heuristics for MDLH: greedy, dynamic programming, and quadratic programming; We extended the summarization on holes to further reduce the total length; We did a set of experiments on the TPC-H benchmark data to compare the heuristics.
VLDB 05Shaofeng Bu UBC36 On going work Based on the summarization on blue cells and summarization on holes, build a visualization tool with MDLH summarization: Return summarized answers to user’s queries; Provide drill down operation for users: Browse details on blue cells Browse details on holes Design k-approximation algorithm for MDLH: What is the best quality we can guarantee?