Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently.

Similar presentations


Presentation on theme: "1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently."— Presentation transcript:

1 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

2 2 Content Background Introduction of Datacube Problem defined Lattice model Greedy algorithm  How to do?  How good?  How bad ? Evaluations Conclusion

3 3 Background DSS (Decision Support System)  Gain competitiveness for business Data warehouse  Maintain historical information  Use “ Data cube ” to summarize results  Identify trends  Performance issue (time and space)  Need to reuse result (materialization of views)

4 4 Introduction of datacube Datacube  Dimensionality (number of GROUP-BYs)  Aggregated data: Values in each cell  Dimension of datacube  Detail of summary  Higher Dimension  Higher detail Common operations  Drill down:Look in more detail  Roll up:Look in less detail

5 5 What is a data cube? Date Product Country sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum Total annual sales of TV in U.S.A.

6 6 Our problem Physically materialize the whole data cube  Best query response  Heavy pre-computing, large storage space  i.e. Time efficient but space inefficient Materialize nothing  Worse query response  Dynamic query evaluation, less storage space  i.e. Space efficient but time inefficient

7 7 Problem on materialized views Materialize only part of the data cube  Balance the storage space and response  What is the best subject to materialize?  Addressed in this paper SourceSizeTime (sec)Ratio From cell itself12.07N/A View (s)10,0002.380.000031 View (p,s)800,00020.770.000023 View (p,s,c)6,000,000226.230.000037

8 8 Data? View? We use data cube to modify aggregate data. So what we use to model view? Lattice!

9 9 Example of lattice diagram 8 possible grouping on the dimensions  p for Part  s for Supplier  c for Customer  # of rows of data shown next to the grouping psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01Mc 0.1M none 1 An example of Regular Lattice

10 10 ≼ operator Suppose c ≼ d  The view d can be used to derive the view c  c is the ancestor of d in lattice diagram Impose a partial order on the views Usage on dimensions  (part) ≼ (part,customer)  (part) ⋠ (customer) Usage within attribute value  (year) ≼ (quarter) ≼ (month) ≼ (day)  (year) ≼ (quarter) ≼ (week) ≼ (day) week month day year quarter An example of Irregular Lattice

11 11 Regular lattices with equal domain size Grouping attributes: A 1,A 2, …, A n (domain: r) Attribute for aggregation: B Efficient algorithm  m: # of rows in top views  k = ⌈ log r m ⌉ Strategyk, j, and nSpaceTime Space-optimalMm2 n Time-optimalk>j(2r r/(r+1) ) n k<j and k ≤ n/2mm2 n k n/2m n C j r j

12 12 The problem The previous technique cannot be applied to the irregular lattices Irregular lattices is common in data warehouse The optimization of views for irregular lattice is NP-complete problem (inefficient!) Use Greedy Algorithm i.e. use heuristics to obtain approximate solution

13 13 Greedy algorithm Being as greedy as possible in each step!! Simple example: Use the smallest number of coins to pay $50 cents Suppose we have many coins of 20 cents, 10 cents and 5 cents.

14 14 How to be greedy? Common sense approach:  Select the largest coin: 20 cents  Select the largest coin again: 20 cents  Remaining amount = 50 – 20 – 20 = 10 cents  We cannot select the largest coin again.  We choose the 2 nd largest coin 10 cents instead. Only 3 coins are needed! Optimal solution!

15 15 Definition of “ benefit of view ” C(v) denotes cost of view (v) B(v,S) denotes benefit of a view (v) relative to a set of views (S) For each w ≼ v  Let u be the view of least cost in S such that w ≼ u  B w = max{ C(u) – C(v),0} B(v,S) = ∑ w ≼ v B w

16 16 Greedy algorithm In each step  Select the view with the most benefit  Add it to the result Algorithm S={top view}; for i=1 to k { select view v not in S such that B(v,S) is maximized S = S union {v} } return S;

17 17 Selecting the first view After selecting coins, let us back to our problem, selecting views. We must materialize the top view  i.e. the view grouping by all attributes  Cannot be constructed from other views  Avoid going to the raw data

18 18 Selecting k views more Space is limited! Suppose we can only select k more views. For each view which is not yet selected, calculate the benefit of materializing it. Pick the one with maximum benefit!!! Let’s set k = 2 for examples.

19 19 Example a b c d ef g h 100 50 75 20 40 30 1 10 E.g. The cost of constructing view b given the view A is 100 If we choose b to materialize, the new cost of constructing view b is 50.

20 20 First round a b c d ef g h 100 50 75 20 40 30 1 10 Notice that not only b, but also d, e, g and h can be calculated from b So the total benefit is (100 – 50) x 5 = 250

21 21 Continue… Similarly, the benefit of materializing c is (100 – 75) x 5 = 125 a b c d ef g h 100 50 75 20 40 30 1 10 Benefit b250 c125

22 22 Not yet finish… For e, Benefit = (100-30) x 3 = 210 a b c d ef g h 100 50 75 20 40 30 1 10 Benefit b250 c125 e210

23 23 Let’s choose b! a b c d ef g h 100 50 75 20 40 30 1 10 For d and f, Benefit = (100-20) x 2 = 160 and (100-40) x 2 = 120 respectively. Benefit b250 c125 d160 e210 f120

24 24 Next round? Seems we should choose e, as it has the second largest benefit. Let’s see what will happen in the second round. Benefit b250 c125 d160 e210 f120

25 25 Second round! a b c d ef g h 100 50 75 20 40 30 1 10 Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently calculated by using b) Benefit = (100 – 75) x 2 = 50 Benefit c50

26 26 How about choosing f? a b c d ef g h 100 50 75 40 30 1 10 If we choose f, we found that h can be effectively calculated by using f instead of b. Benefit =(100 – 40) + (50 – 40) Benefit c50 f70 20

27 27 Easy to work out others Benefit of d = (50 – 20) x 2 = 60 Benefit of e = (50 – 30) x 3 = 60 Benefit of g = 50 – 1 = 49 Benefit of h = 50 – 10 = 40 a b c d ef g h 100 50 75 20 40 30 1 10

28 28 Observation In the first round, the benefit of choosing f (only 120) is far from the best choice (250) But in second round, choosing f gives the maximum benefit! 1 st roundBenefit b250 c125 d160 e210 f120 2 nd roundBenefit c50 d60 e70 f g49

29 29 Simple? Optimal? Trade off again! This simple algorithm is not optimal in all cases! Consider the following case…

30 30 Bad example a b d c 200 100 20 nodes Total 1000 99

31 31 Bad example a b d c 200 100 20 nodes Total 1000 99 Choose c Benefit = (200-99) x (1 + 20 + 20) = 4141 = maximum

32 32 Bad example a b d c 200 100 20 nodes Total 1000 99 Now choose either 1 of b and d (same benefit)

33 33 Bad example a b d c 200 100 20 nodes Total 1000 99 How about these? Very expensive!!!

34 34 Optimal solution should be… a b d c 200 100 20 nodes Total 1000 99 Only c is a little bit expensive.

35 35 Some theoretical result It can be proved that we can get at least (e – 1 ) / e % (which is about 63%) of the benefit of the optimal algorithm.

36 36 Extensions (1) Problem  The views in a lattice are unlikely to have the same probability of being requested in a query. Solution:  We can weight each benefit by its probability.

37 37 Extensions (2) Problem  Instead of asking for some fixed number (k) of views to materialize, we might instead allocate a fixed amount of space to views. Solution  We can consider the “benefit of each view per unit space”.

38 38 Conclusions Materialization of views is an essential query optimization strategy for decision- support applications. Reason to materialize some part of the data cube but not all of the cube. A lattice framework that models multidimensional analysis very well.

39 39 Conclusions (cont.) Finding optimal solution in the case of irregular lattice is NP-hard. Introduction of greedy algorithm Greedy algorithm work on this lattice and pick the almost right views to materialize.

40 40 Conclusions (the end) There exists cases which greedy algorithm fails to produce optimal solution. But greedy algorithm has guaranteed performance Expansion of greedy algorithm.

41 41 Reference Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. SIGMOD’96:205-216.

42 42 Thank you~ Q & A Section


Download ppt "1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently."

Similar presentations


Ads by Google