Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Dr. Panagiotis Symeonidis Data Engineering Laboratory Data Warehouse implementation: Part B.

Similar presentations


Presentation on theme: "1 Dr. Panagiotis Symeonidis Data Engineering Laboratory Data Warehouse implementation: Part B."— Presentation transcript:

1 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory http://delab.csd.auth.gr/~symeon Data Warehouse implementation: Part B

2 2 Cuboids Materialization as an Optimization Problem Minimize: the average time taken to evaluate a view Constraint: materialize a fixed number k of views Greedy algorithm Best choice is given based on what has gone before It does not give the optimal solution

3 3 Example of lattice of views diagram psc pcpssc psc p: part s: supp c: cust

4 4 The lattice of views framework if view V2 can be answered using results of view V1 then V2 is descendent of V1 V1 is ancestor of V2 (denoted V2 ≼ V1) E.g. (part) ≼ (part, cust)

5 5 Some Definitions K is the number of views to be materialized C (v ) is the cost of view v Given v is a view S is a set of views which are already selected to be materialized The Benefit of selecting v for materialization is B(v, S) = C(S) – C(S U v)

6 6 Greedy Algorithm S  {top view}; For i = 1 to k do Select that view v not in S such that B(v, S) is maximized; S  S U {v} Return S

7 7 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 Benefit from pc = Benefit 6M-6M = 0 k = 2

8 8 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit from ps = Benefit 6M-0.8M = 5.2M k = 2

9 9 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit from sc = Benefit 6M-6M = 0 0 x 3 = 0 k = 2

10 10 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit from p = Benefit 6M-0.2M = 5.8M 0 x 3 = 0 5.8 x 1 = 5.8 k = 2

11 11 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit from s = Benefit 6M-0.01M = 5.99M 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 k = 2

12 12 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit from c = Benefit 6M-0.1M = 5.9M 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 k = 2

13 13 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 Benefit from pc = 6M-6M = 0 0 x 2 = 0 k = 2

14 14 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 Benefit from sc = 6M-6M = 0 0 x 2 = 0 0 x 2 = 0 k = 2

15 15 psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 Benefit from p = 0.8M-0.2M = 0.6M 0 x 2 = 0 0 x 2 = 0 0.6 x 1 = 0.6 k = 2

16 16 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 Benefit from s = 0.8M-0.01M = 0.79M 0 x 2 = 0 0 x 2 = 0 0.6 x 1 = 0.6 0.79 x 1 = 0.79 k = 2

17 17 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 Benefit from c = 6M-0.1M = 5.9M 0 x 2 = 0 0 x 2 = 0 0.6 x 1 = 0.6 0.79 x 1 = 0.79 5.9 x 1 = 5.9 k = 2

18 18 psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 0 x 2 = 0 0 x 2 = 0 0.6 x 1 = 0.6 0.79 x 1 = 0.79 5.9 x 1 = 5.9 Two views to be materialized are 1.ps 2. c V = {ps, c} Gain(V U {top view}, {top view}) = 15.6 + 5.9 = 21.5 k = 2

19 19 2 nd Example of greedy algorithm Initially, S = {a} k = 4 (select 3 more) a bc de g f h 100 50 75 20 30 40 1 10

20 20 2 nd Example of greedy algorithm First choice b: 50  5 = 250 c: 25  5 = 125 d: 80  2 = 160 e: 70  3 = 210 f: 60  2 = 120 g: 99  1 = 99 h: 90  1 = 90 a bc de g f h 100 50 75 20 30 40 1 10

21 21 2 nd Example of greedy algorithm Second choice c: 25  2 = 50 d: 30  2 = 60 e: 20  3 = 60 f : (100-40)  1 + (50-40)  1= 60+10 = 70 g: 49  1 = 49 h: 40  1 = 40 a b c de g f h 100 50 75 20 30 40 1 10

22 22 2 nd Example of greedy algorithm Third choice c: 25  1 = 25 d: 30  2 = 60 e : (50-30)  2 + (40-30)  1=20  2 + 10  1 = 50 g: 49  1 = 49 h: 30  1 = 30 a b c de g f h 100 50 75 20 30 40 1 10

23 23 2 nd Example of greedy algorithm If we materialize only a then cost would be 8*100 =800 Now, cost is 800- 250-70-60 = 420 a b c e g f h 100 50 75 20 30 40 1 10 d

24 24 Performance Study How bad does the Greedy Algorithm perform?

25 25 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit from b = Benefit 200-100= 100 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 k = 2

26 26 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit from c = Benefit 200-99= 101 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 41 x 101 = 4141 k = 2

27 27 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 41 x 101 = 4141 41 x 100 = 4100 k = 2

28 28 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 41 x 101 = 4141 41 x 100 = 4100 Benefit from b = 200-100= 100 21 x 100 = 2100 k = 2

29 29 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 41 x 101 = 4141 41 x 100 = 4100 21 x 100 = 2100 21 x 100 = 2100 Greedy: V = {b, c} Gain(V U {top view}, {top view}) = 4141 + 2100 = 6241 k = 2

30 30 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 41 x 101 = 4141 41 x 100 = 4100 41 x 100 = 4100 Greedy: V = {b, c} Gain(V U {top view}, {top view}) = 4141 + 2100 = 6241 21 x 101 + 20 x 1 = 2141 Optimal: V = {b, d} Gain(V U {top view}, {top view}) = 4100 + 4100 = 8200 k = 2

31 31 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 Greedy: V = {b, c} Gain(V U {top view}, {top view}) = 4141 + 2100 = 6241 Optimal: V = {b, d} Gain(V U {top view}, {top view}) = 4100 + 4100 = 8200 Greedy Optimal = 6241 8200 =0.7611 If this ratio = 1, Greedy can give an optimal solution. If this ratio  0, Greedy may give a “ bad ” solution. Does this ratio has a “ lower ” bound? It is proved that this ratio is at least 0.63. k = 2

32 32 Indexing OLAP Data: Bitmap Index Relation table Index on RegionIndex on Type

33 33 Determining which materialized cuboid(s) should be selected for OLAP operations Query : Find the total sales group by {product-category, province} with the condition “year = 2004”. Which one of the 4 following materialized cuboids should be selected to process the query? 1) {year, product, city} 2) {year, product-category, country} 3) {year, product-category, province} 4) {product, province} where year = 2004

34 34 Solution: 1) {year, product, city} – it can be used. However, it costs most because product and city are of lower level 2) {year, product-category, country} – it cannot be used because country is a more general concept than province 3) {year, product_category, province} - it can be used. It could cost less than Solution 4, if there were no many year values and there are many products for each product-category. 4) {product, province} where year = 2004 - it can be used. Let the query to be processed be on {product_category, province} with the condition “year = 2004”, and there are 4 materialized cuboids available:

35 35 Assume we want to find pairs of customers and items such that the customer has purchased the item at least 5 times select P.custid, P. item, sum(P.qty) from Purchases P group by P.custid, P.item having sum (P.qty) > 5 Execution plan for the query? The number of groups is very large but the answer to the query (the top of the iceberg) is usually very small Iceberg queries

36 36 select P.custid, P. item, sum(P.qty) from Purchases P group by P.custid, P.item having sum (P.qty) > 5 select P.custid from Purchases P group by P.custid having sum (P.qty) > 5 select P.item from Purchases P group by P.item having sum (P.qty) > 5 Generate (custid, item) pairs only for custid from Q1 and item from Q2 Q1 Q2 Iceberg queries

37 37 From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM) Why online analytical mining? High quality of data in data warehouses OLAP-based exploratory data analysis Easy selection of data mining functions

38 August 25, 201538 An OLAM System Architecture Data Warehouse Meta Data MDDB OLAM Engine OLAP Engine User GUI API Data Cube API Database API Data cleaning Data integration Layer3 OLAP/OLAM Layer2 MDDB Layer1 Data Repository Layer4 User Interface Filtering&IntegrationFiltering Databases Mining queryMining result

39 39 Dr. Panagiotis Symeonidis Data Engineering Laboratory http://delab.csd.auth.gr/~symeon Data Warehouse implementation: Part B


Download ppt "1 Dr. Panagiotis Symeonidis Data Engineering Laboratory Data Warehouse implementation: Part B."

Similar presentations


Ads by Google