1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Design of the fast-pick area Based on Bartholdi & Hackman, Chpt. 7.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
MCS 312: NP Completeness and Approximation algorithms Instructor Neelima Gupta
1 DynaMat A Dynamic View Management System for Data Warehouses Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Fast Algorithms For Hierarchical Range Histogram Constructions
Chapter 5 Fundamental Algorithm Design Techniques.
Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
Preference Elicitation Partial-revelation VCG mechanism for Combinatorial Auctions and Eliciting Non-price Preferences in Combinatorial Auctions.
Merge Sort 4/15/2017 6:09 PM The Greedy Method The Greedy Method.
Planning under Uncertainty
Branch and Bound Searching Strategies
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
CSC401 – Analysis of Algorithms Lecture Notes 12 Dynamic Programming
1 Combinatorial Dominance Analysis The Knapsack Problem Keywords: Combinatorial Dominance (CD) Domination number/ratio (domn, domr) Knapsack (KP) Incremental.
Materialized View Selection in a Multidimensional Database Presenter: Dong Wang 3/14/2006.
1 Combinatorial Dominance Analysis Keywords: Combinatorial Optimization (CO) Approximation Algorithms (AA) Approximation Ratio (a.r) Combinatorial Dominance.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Warehouse and Data Cube Lecture Notes for Chapter 3 Introduction to Data Mining By.
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
1 The Greedy Method CSC401 – Analysis of Algorithms Lecture Notes 10 The Greedy Method Objectives Introduce the Greedy Method Use the greedy method to.
10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.
10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.
The Marriage Problem Finding an Optimal Stopping Procedure.
1 Dr. Panagiotis Symeonidis Data Engineering Laboratory Data Warehouse implementation: Part B.
DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.
Called as the Interval Scheduling Problem. A simpler version of a class of scheduling problems. – Can add weights. – Can add multiple resources – Can ask.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross- Tab and Sub-Totals Gray et Al. Presented By: Priya Rajan.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Achieving Scalability in OLAP Materialized View Selection Thomas P. Nadeau Toby J. Teorey University of Michigan DOLAP 2002.
1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung.
Designing Aggregations. Performance Fundamentals - Aggregations Pre-calculated summaries of data Intersections of levels from each dimension Tradeoff.
The Fundamentals: Algorithms, the Integers & Matrices.
Thursday, May 9 Heuristic Search: methods for solving difficult optimization problems Handouts: Lecture Notes See the introduction to the paper.
COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong
The Greedy Method. The Greedy Method Technique The greedy method is a general algorithm design paradigm, built on the following elements: configurations:
Chapter 9 Finding the Optimum 9.1 Finding the Best Tree.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Greedy Algorithms Interval Scheduling and Fractional Knapsack These slides are based on the Lecture Notes by David Mount for the course CMSC 451 at the.
1 Chapter 16: Greedy Algorithm. 2 About this lecture Introduce Greedy Algorithm Look at some problems solvable by Greedy Algorithm.
Spring 2008The Greedy Method1. Spring 2008The Greedy Method2 Outline and Reading The Greedy Method Technique (§5.1) Fractional Knapsack Problem (§5.1.1)
Explorations in Artificial Intelligence Prof. Carla P. Gomes Module 5 Adversarial Search (Thanks Meinolf Sellman!)
1 The instructor will be absent on March 29 th. The class resumes on March 31 st.
Lecture 3: Uninformed Search
Data Transformation: Normalization
BlinkDB.
Tools for Decision Analysis: Analysis of Risky Decisions
Updating SF-Tree Speaker: Ho Wai Shing.
A B D C G5b Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product PC
Remember the Sales Data Cube? Each cell contains a sales measurement, e.g., the number of sales (may contain many other measurements of product-date-country.
BlinkDB.
Merge Sort 7/29/ :21 PM The Greedy Method The Greedy Method.
Merge Sort 11/28/2018 2:18 AM The Greedy Method The Greedy Method.
The Greedy Method Spring 2007 The Greedy Method Merge Sort
Lecture 4: From Data Cubes to ML
Merge Sort 11/28/2018 8:16 AM The Greedy Method The Greedy Method.
Advanced Algorithms Analysis and Design
Merge Sort 1/17/2019 3:11 AM The Greedy Method The Greedy Method.
Greedy Algorithms Alexandra Stefan.
CUBE MATERIALIZATION E0 261 Jayant Haritsa
Merge Sort 5/2/2019 7:53 PM The Greedy Method The Greedy Method.
Slides based on those originally by : Parminder Jeet Kaur
Presentation transcript:

1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

2 Content Background Introduction of Datacube Problem defined Lattice model Greedy algorithm  How to do?  How good?  How bad ? Evaluations Conclusion

3 Background DSS (Decision Support System)  Gain competitiveness for business Data warehouse  Maintain historical information  Use “ Data cube ” to summarize results  Identify trends  Performance issue (time and space)  Need to reuse result (materialization of views)

4 Introduction of datacube Datacube  Dimensionality (number of GROUP-BYs)  Aggregated data: Values in each cell  Dimension of datacube  Detail of summary  Higher Dimension  Higher detail Common operations  Drill down:Look in more detail  Roll up:Look in less detail

5 What is a data cube? Date Product Country sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum Total annual sales of TV in U.S.A.

6 Our problem Physically materialize the whole data cube  Best query response  Heavy pre-computing, large storage space  i.e. Time efficient but space inefficient Materialize nothing  Worse query response  Dynamic query evaluation, less storage space  i.e. Space efficient but time inefficient

7 Problem on materialized views Materialize only part of the data cube  Balance the storage space and response  What is the best subject to materialize?  Addressed in this paper SourceSizeTime (sec)Ratio From cell itself12.07N/A View (s)10, View (p,s)800, View (p,s,c)6,000,

8 Data? View? We use data cube to modify aggregate data. So what we use to model view? Lattice!

9 Example of lattice diagram 8 possible grouping on the dimensions  p for Part  s for Supplier  c for Customer  # of rows of data shown next to the grouping psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01Mc 0.1M none 1 An example of Regular Lattice

10 ≼ operator Suppose c ≼ d  The view d can be used to derive the view c  c is the ancestor of d in lattice diagram Impose a partial order on the views Usage on dimensions  (part) ≼ (part,customer)  (part) ⋠ (customer) Usage within attribute value  (year) ≼ (quarter) ≼ (month) ≼ (day)  (year) ≼ (quarter) ≼ (week) ≼ (day) week month day year quarter An example of Irregular Lattice

11 Regular lattices with equal domain size Grouping attributes: A 1,A 2, …, A n (domain: r) Attribute for aggregation: B Efficient algorithm  m: # of rows in top views  k = ⌈ log r m ⌉ Strategyk, j, and nSpaceTime Space-optimalMm2 n Time-optimalk>j(2r r/(r+1) ) n k<j and k ≤ n/2mm2 n k n/2m n C j r j

12 The problem The previous technique cannot be applied to the irregular lattices Irregular lattices is common in data warehouse The optimization of views for irregular lattice is NP-complete problem (inefficient!) Use Greedy Algorithm i.e. use heuristics to obtain approximate solution

13 Greedy algorithm Being as greedy as possible in each step!! Simple example: Use the smallest number of coins to pay $50 cents Suppose we have many coins of 20 cents, 10 cents and 5 cents.

14 How to be greedy? Common sense approach:  Select the largest coin: 20 cents  Select the largest coin again: 20 cents  Remaining amount = 50 – 20 – 20 = 10 cents  We cannot select the largest coin again.  We choose the 2 nd largest coin 10 cents instead. Only 3 coins are needed! Optimal solution!

15 Definition of “ benefit of view ” C(v) denotes cost of view (v) B(v,S) denotes benefit of a view (v) relative to a set of views (S) For each w ≼ v  Let u be the view of least cost in S such that w ≼ u  B w = max{ C(u) – C(v),0} B(v,S) = ∑ w ≼ v B w

16 Greedy algorithm In each step  Select the view with the most benefit  Add it to the result Algorithm S={top view}; for i=1 to k { select view v not in S such that B(v,S) is maximized S = S union {v} } return S;

17 Selecting the first view After selecting coins, let us back to our problem, selecting views. We must materialize the top view  i.e. the view grouping by all attributes  Cannot be constructed from other views  Avoid going to the raw data

18 Selecting k views more Space is limited! Suppose we can only select k more views. For each view which is not yet selected, calculate the benefit of materializing it. Pick the one with maximum benefit!!! Let’s set k = 2 for examples.

19 Example a b c d ef g h E.g. The cost of constructing view b given the view A is 100 If we choose b to materialize, the new cost of constructing view b is 50.

20 First round a b c d ef g h Notice that not only b, but also d, e, g and h can be calculated from b So the total benefit is (100 – 50) x 5 = 250

21 Continue… Similarly, the benefit of materializing c is (100 – 75) x 5 = 125 a b c d ef g h Benefit b250 c125

22 Not yet finish… For e, Benefit = (100-30) x 3 = 210 a b c d ef g h Benefit b250 c125 e210

23 Let’s choose b! a b c d ef g h For d and f, Benefit = (100-20) x 2 = 160 and (100-40) x 2 = 120 respectively. Benefit b250 c125 d160 e210 f120

24 Next round? Seems we should choose e, as it has the second largest benefit. Let’s see what will happen in the second round. Benefit b250 c125 d160 e210 f120

25 Second round! a b c d ef g h Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently calculated by using b) Benefit = (100 – 75) x 2 = 50 Benefit c50

26 How about choosing f? a b c d ef g h If we choose f, we found that h can be effectively calculated by using f instead of b. Benefit =(100 – 40) + (50 – 40) Benefit c50 f70 20

27 Easy to work out others Benefit of d = (50 – 20) x 2 = 60 Benefit of e = (50 – 30) x 3 = 60 Benefit of g = 50 – 1 = 49 Benefit of h = 50 – 10 = 40 a b c d ef g h

28 Observation In the first round, the benefit of choosing f (only 120) is far from the best choice (250) But in second round, choosing f gives the maximum benefit! 1 st roundBenefit b250 c125 d160 e210 f120 2 nd roundBenefit c50 d60 e70 f g49

29 Simple? Optimal? Trade off again! This simple algorithm is not optimal in all cases! Consider the following case…

30 Bad example a b d c nodes Total

31 Bad example a b d c nodes Total Choose c Benefit = (200-99) x ( ) = 4141 = maximum

32 Bad example a b d c nodes Total Now choose either 1 of b and d (same benefit)

33 Bad example a b d c nodes Total How about these? Very expensive!!!

34 Optimal solution should be… a b d c nodes Total Only c is a little bit expensive.

35 Some theoretical result It can be proved that we can get at least (e – 1 ) / e % (which is about 63%) of the benefit of the optimal algorithm.

36 Extensions (1) Problem  The views in a lattice are unlikely to have the same probability of being requested in a query. Solution:  We can weight each benefit by its probability.

37 Extensions (2) Problem  Instead of asking for some fixed number (k) of views to materialize, we might instead allocate a fixed amount of space to views. Solution  We can consider the “benefit of each view per unit space”.

38 Conclusions Materialization of views is an essential query optimization strategy for decision- support applications. Reason to materialize some part of the data cube but not all of the cube. A lattice framework that models multidimensional analysis very well.

39 Conclusions (cont.) Finding optimal solution in the case of irregular lattice is NP-hard. Introduction of greedy algorithm Greedy algorithm work on this lattice and pick the almost right views to materialize.

40 Conclusions (the end) There exists cases which greedy algorithm fails to produce optimal solution. But greedy algorithm has guaranteed performance Expansion of greedy algorithm.

41 Reference Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. SIGMOD’96:

42 Thank you~ Q & A Section