Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Incremental Maintenance of Data Cubes 2006. 9. 15. Ki Yong Lee Software Laboratories Samsung Electronics Co., Ltd. Myoung Ho Kim Division of.

Similar presentations


Presentation on theme: "Efficient Incremental Maintenance of Data Cubes 2006. 9. 15. Ki Yong Lee Software Laboratories Samsung Electronics Co., Ltd. Myoung Ho Kim Division of."— Presentation transcript:

1 Efficient Incremental Maintenance of Data Cubes 2006. 9. 15. Ki Yong Lee Software Laboratories Samsung Electronics Co., Ltd. Myoung Ho Kim Division of Computer Science Korea Advanced Institute of Science and Technology

2 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 2 Outline Introduction Related work – Incremental maintenance of aggregate views – Incremental maintenance of data cubes Our approach – Key idea – Problem formulation – Heuristic algorithm Performance evaluation Conclusion

3 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 3 Data Cube A generalized group-by operator [GBP96] – Computes group-bys for all possible combinations of a given set of attributes SELECT a, b, SUM(m) FROM F CUBE BY a, b SELECT a, b, SUM(m) FROM F GROUP BY a, b SELECT a, ‘*’, SUM(m) FROM F GROUP BY a SELECT ‘*’, b, SUM(m) FROM F GROUP BY b SELECT ‘*’, ‘*’, SUM(m) FROM F (GROUP BY  ) 2n2n Dimension attributes

4 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 4 Cube Lattice We represent a data cube as a lattice diagram [HRU96] – Each node represents a group-by, which is called a cuboid – Each edge (q i, q j ) represents that q j can be computed from q i Cuboid (group-by) Aggregation abcd abcabdacdbcd abcd acadbcbdabcd 

5 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 5 Maintenance of Data Cubes A data cube is typically stored as a materialized view How can we update a data cube efficiently when the source relations change? abcd abcabdacdbcd abcd acadbcbdabcd  SELECT a, b, c, d, SUM(m) FROM F CUBE BY a, b, c, d SELECT a, b, c, d, SUM(m) FROM F’ CUBE BY a, b, c, d ?

6 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 6 Related Work (1/2) Incremental maintenance of an aggregate view abcSUM(m) 1113 1223 2139 abcm 1113 1223 2135 2134 abc 1224 2318 abc 1113 1223 + 4 2139 2318 abcm 1224 2312 2316 F A:SELECT a, b, c, SUM(m) FROM F GROUP BY a, b, c ΔA:SELECT a, b, c, SUM(m) FROM Δ F GROUP BY a, b, c ΔF A’

7 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 7 Related Work (2/2) Incremental maintenance of a data cube – Propagate stage: computes the delta cube – Refresh stage: refreshes the data cube by the delta cube abc ac b  ac bc Original cube ab abc’ ac’ b’ ’’ a’ ab’ Updated cube bc’ c’ Δ abc Δ ac ΔbΔb ΔΔ ΔaΔa ΔcΔc Δ bc Delta cube Δ ab FΔF + Delta cuboid

8 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 8 Motivation To incrementally maintain a data cube with 2 n cuboids, existing methods compute 2 n delta cuboids – As n increases, the maintenance cost increases significantly abcd abcabdacdbcd abcd acadbcbdabcd  ∆ abcd ∆ abc ∆ abd ∆ acd ∆ bcd ∆a∆a ∆b∆b ∆c∆c ∆d∆d ∆ ac ∆ ad ∆ bc ∆ bd ∆ ab ∆ cd ∆∆ Original cube Delta cube

9 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 9 Motivation (cont’d) Each cuboid is refreshed separately in existing methods 2 n delta cuboids are used ∆   ∆a a ∆b b ∆c c ∆d d ∆ab ab ∆ac ac ∆ad ad ∆bc bc ∆bd bd ∆cd cd ∆abc abc ∆abd abd ∆acd acd ∆bcd bcd ∆abcd abcd

10 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 10 Key Idea Refresh more than one cuboid by a delta cuboid ∆bc bc ∆cd cd b bc ∆bcd bcd c ac ∆acd acd d ad ∆abd abd  a ab abc ∆abcd abcd ∆   ∆a a ∆b b ∆c c ∆d d ∆ab ab ∆ac ac ∆ad ad ∆bc bc ∆bd bd ∆cd cd ∆abc abc ∆abd abd ∆acd acd ∆bcd bcd ∆abcd abcd

11 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 11 Key Idea (cont’d) Benefit – The number of delta cuboids that need to be computed is reduced ∆ , ∆a,∆b,∆c,∆d, ∆ab,∆ac,∆ad,∆bc,∆bd,∆cd, ∆abc,∆abd,∆acd,∆bcd, ∆abcd ∆bc,∆cd, ∆abd,∆acd,∆bcd, ∆abcd 2 n = 16 delta cuboids need to be computed Only 6 delta cuboids need to be computed

12 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 12 Key Idea (cont’d) Refreshing more than one cuboid by a delta cuboid ∆abc abcSUM(m) 1134 1116 1129 1222 1214 1238 ∆ab abcSUM(m) 11*19 12*14 abc abcSUM(m) 1123 1134 1112 1235 1212 1226 ab abcSUM(m) 11*9 12*7 +4  13  8

13 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 13 Key Idea (cont’d) However, this method requires more access to ab ∆abc abcSUM(m) 1134 1116 1129 1222 1214 1238 ∆ab abcSUM(m) 11*19 12*14 abc abcSUM(m) 1123 1134 1112 1235 1212 1226 ab abcSUM(m) 11*9 12*7 |Δab| |Δabc|

14 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 14 Key Idea (cont’d) But, if Δabc is sorted by a, b and c, ab can be refreshed by Δabc without more access to ab ∆ab abcSUM(m) 11*19 12*14 ab abcSUM(m) 11*9 12*7 ∆abc abcSUM(m) 1116 1129 1134 1214 1222 1238 abc abcSUM(m) 11212 1134 1112 1235 1212 1226 m ab 6 15 19 +19 +4  28  8

15 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 15 Key Idea (cont’d) A delta cuboid can be easily sorted during its computation with little or no additional cost – In most existing commercial relational database systems, aggregation algorithms are based on sorting [G93] – If a group-by is computed by sorting algorithms, sorted results on the grouping attributes can be easily obtained We assume that a delta cuboid is computed by sorting based algorithms – Thus, the above method can be applied to any delta cuboid with no additional sorting cost

16 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 16 Generalization of the Idea Δ[d 1 d 2 …d k ] – A delta cuboid sorted in the order of attributes d 1, d 2, …, d k. The following set of delta cuboids can be refreshed by Δ[d 1 d 2 …d k ] – {d 1 d 2 …d k, d 1 d 2 …d k-1, …, d 1,  } Δ[d 1 d 2 …d k ]  {q 1, q 2, …, q i } – Cuboids q 1, q 2, …, q i are refreshed by Δ[d 1 d 2 …d k ] Example – Δ[abcd]  {abcd, abc, ab, a,  } Cuboids abcd, abc, ab, a,  are refreshed by Δ[abcd] – Δ[acdb]  {acdb, acd, ac, a,  } Cuboids acdb, acd, ac, a,  are refreshed by Δ[acdb]

17 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 17 Our Approach We propose an incremental maintenance method that can maintain a data cube with 2 n cuboids using only a subset of 2 n delta cuboids ∆[abc]  {abc} ∆[ab]  {ab} ∆[ca]  {ca} ∆[bc]  {bc} ∆[a]  {a} ∆[b]  {b} ∆[c]  {c} ∆   {  } ∆[acb]  {acb, ac} ∆[ba]  {ba, b} ∆[cb]  {cb, c} ∆[a]  {a,  } ∆[abc]  {abc, ab, a,  } ∆[ca]  {ca, c} ∆[bc]  {bc, b} 2 3 = 8 delta cuboids4 delta cuboids3 delta cuboids

18 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 18 Cost of Computing Delta Cuboids Different sets of delta cuboids incur different computation cost We represent the cost of computing delta cuboids by the cost of a delta cuboid computation plan ∆abc ∆ca ∆b ∆∆ ∆a∆c ∆bc∆ab ∆abc ∆ca ∆b ∆∆ ∆a∆c ∆bc∆ab ∆abc ∆ca ∆b ∆∆ ∆a∆c ∆bc∆ab ∆[abc], ∆[ab], ∆[ca], ∆[bc], ∆[a], ∆[b], ∆[c], ∆  ∆[acb], ∆[ba], ∆[cb], ∆[a] ∆[abc], ∆[ca], ∆[bc] 16 15 14 8 7 6 3 1614 8 15 14

19 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 19 Problem Formulation Delta cube – ΔQ = {Δq 1, Δq 2, …, Δq m }, where Δq i is a delta cuboid Refresh chain – A sequence of elements in ΔQ such that q 1  q 2  …  q i – implies Δ[q 1 ]  {q 1, q 2, …, q i } – Example implies Δ[abc]  {abc, ab, a} implies Δ[cba]  {cba, cb, c}

20 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 20 Problem Formulation (cont’d) Refresh partition – A partition of the elements of ΔQ into disjoint refresh chains – Example {,, } {,, } Δ[acb]  {acb, ac} Δ[ba]  {ba, b} Δ[cb]  {cb, c} Δ[a]  {a,  } Δ[abc]  {abc, ab, a,  } Δ[ca]  {ca, c} Δ[bc]  {bc, b} implies

21 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 21 Problem Formulation (cont’d) Delta cuboid computation plan – A subtree of the delta lattice including at least all of the first elements of refresh chains in a given refresh partition – Example {,, } Refresh partition Delta cuboid computation plan Δ abc Δ ca ΔbΔb ΔΔ Δa ΔcΔc Δ bc Δ ab 15 14

22 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 22 Problem Statement Given a delta cube and its delta lattice, find a refresh partition that minimizes the cost of a delta cuboid computation plan Refresh partition: {,, } Delta cuboid computation plan: ∆abc ∆ca ∆b ∆∆ ∆a∆c ∆bc∆ab 15 14 Delta cube: {Δabc, Δab, Δac, Δbc, Δa, Δb, Δa, Δ  } find out

23 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 23 NP-Hardness of the Problem For a given refresh partition, finding the minimum cost delta cuboid computation plan is NP-complete – (proved in the paper) Our problem is NP-hard – Our problem is at least as hard as finding the minimum cost delta cuboid computation plan – Moreover, there can be many refresh partitions for a given delta cube Hence, we resort to heuristic approaches

24 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 24 Idea behind Our Heuristic As the number of delta cuboids to be computed increases, the cost of a delta cuboid computation plan increases Hence, we minimize the number of refresh chains in a refresh partition as possible The minimum number of refresh chains in a refresh partition for a delta cube with 2 n delta cuboids = (proved in the paper) {Δab} {Δa} {Δb} {Δ  } {Δab, Δ a} {Δb} {Δ  } {Δab, Δ a} {Δb, Δ  } n  n/2 

25 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 25 Heuristic Algorithm  Starts from the refresh partition with 2 n refresh chains – Each refresh chain consists of only one delta cuboid  Repeatedly merge refresh chains until there are exactly aaa refresh chains in the refresh partition – Whenever refresh chains are merged, a new delta cuboid computation plan with less cost is produced {Δ  } {Δa}, {Δb}, {Δc} {Δab}, {Δca}, {Δbc} {Δabc} {Δb, Δ  } {Δa, Δab}, {Δc}, {Δca}, {Δbc} {Δabc} {Δca, Δc}, {Δbc, Δb, Δ  } {Δabc, Δab, Δa} n  n/2  2n2n n  n/2 

26 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 26 Example of the Algorithm ∆abc ∆ca ∆b ∆∆ ∆a∆c ∆bc∆ab (1) Input {∆abc} {∆ca} {∆b} {∆  } {∆a}{∆c} {∆bc}{∆ab} (2) Step 1 {∆abc} {∆ca} {∆b,∆  } {∆a}{∆c} {∆bc}{∆ab} (3) Step 2 {∆abc} {∆ca,∆c} {∆bc,∆b,∆  } {∆ab,∆a} (4) Step 3 {∆abc,∆ab,∆a} {∆ca,∆c} {∆bc,∆b,∆  } (5) Step 4 {{∆ca, Δc}, {Δbc, Δb, Δ  }, {Δabc, Δab, Δa}} (6) Output 16 15 14 8 7 6 3 16 15 14 8 7 6 16 15 14 15 14 Lv(0): Lv(1): Lv(2): Lv(3):

27 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 27 Analysis of the Algorithm Lemma 1: Given a delta cube with 2 n delta cuboids, the proposed heuristic algorithm produces a refresh partition with exactly refresh chains Thus, we need to compute only delta cuboids to refresh a data cube with 2 n cuboids n  n/2  n  n/2  n2n2n 242 383 ……… 825670 9512126 101024252 n  n/2 

28 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 28 Analysis of the Algorithm (cont’d) Lemma 2: Let T C be a delta cuboid computation plan found by the proposed heuristic. Then the following is true. – G : a delta lattice with 2 n delta cuboids – G/2 : a subgraph of G such that Level(0), Level(1), …, Level(  n/2  ) are removed from G – T G : the minimum spanning tree of G – T G/2 : the minimum spanning tree of G/2 Thus, the cost of T C is bounded by the cost of T G/2 Cost(T C ) < Cost(T G/2 ) < Cost(T G )

29 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 29 Performance Evaluation (1/3) Data warehouse environment – Oracle9i database – Sun Blade 1000 with UltraSparc III CPU and 512MB RAM – TPC-H benchmark schema and data Data cubes used in the experiments – Defined over lineitem table in the TPC-H schema Data CubeDimension AttributesMeasure Q1Q2Q3Q1Q2Q3 l_orderkey, l_partkey, l_suppkey l_orderkey, l_partkey, l_suppkey, l_shipdate l_orderkey, l_partkey, l_suppkey, l_shipdate, l_receiptdate l_quantity

30 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 30 Performance Evaluation (2/3) By varying the size of changes (a) Q 1 (b) Q 2 (c) Q 3

31 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 31 Performance Evaluation (3/3) The number of tuples generated in the experiment (a) Q 1 (b) Q 2 (c) Q 3

32 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 32 Summary We proposed an efficient incremental maintenance method for data cubes The proposed method can refresh a data cube with 2 n delta cuboids using only delta cuboids – The cost of computing delta cuboids can be substantially reduced We formulated the problem and developed a heuristic algorithm for this problem We showed the efficiency of the proposed method through performance evaluation n  n/2 

33 K. Y. Lee and M. H. Kim 32 th International Conference on Very Large Data Bases 33 References [GBP96] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. In Proceedings of the ICDE Conference, p. 152-159, 1996 [HRU96] V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing Data Cubes Efficiently. In Proceedings of the ACM SIGMOD Conference, p. 205-216, 1996. [G93] Goetz Graefe, Query Evaluation Techniques for Large Databases, ACM Computing Surveys, Vol. 25, Issue 2, p. 73-169, 1993. [FG82] L. R. Foulds and R. L. Graham. The Steiner Problem in Phylogeny is NP-Complete. Advances in Applied Mathematics, 3: 43- 49, 1982.


Download ppt "Efficient Incremental Maintenance of Data Cubes 2006. 9. 15. Ki Yong Lee Software Laboratories Samsung Electronics Co., Ltd. Myoung Ho Kim Division of."

Similar presentations


Ads by Google