Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong

Similar presentations


Presentation on theme: "COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong"— Presentation transcript:

1 COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

2 COMP53312 Data Warehouse Also called Online Analytical Processing (OLAP) Many corporations use data warehouses for their analysis

3 COMP53313 Data Warehouse Databases Users Databases Users Data Warehouse Need to wait for a long time (e.g., 1 day to 1 week) Pre-computed results Query

4 COMP53314 Advantages Fast Query Response

5 COMP53315 Data Warehouse Problem Data Warehouse NP-hardness Algorithm Performance Study

6 COMP53316 Data Warehouse Parts are bought from suppliers and then sold to customers at a sale price SP partsuppliercustomerSP p1s1c14 p3s1c23 p2s3c17 ………… Table T

7 COMP53317 Data Warehouse Parts are bought from suppliers and then sold to customers at a sale price SP partsuppliercustomerSP p1s1c14 p3s1c23 p2s3c17 ………… Table T part p1p2p3p4p5 supplier s1 s2 s3 s4 customer c1 c2 c3 c4 4 3 Data cube

8 COMP53318 Data Warehouse Parts are bought from suppliers and then sold to customers at a sale price SP e.g., select part, customer, SUM(SP) from table T group by part, customer partcustomerSUM(SP) p1c14 p3c23 p2c17 e.g., select customer, SUM(SP) from table T group by customer customerSUM(SP) c111 c23 pc 3 c 2 partsuppliercustomerSP p1s1c14 p3s1c23 p2s3c17 ………… Table T AVG(SP), MAX(SP), MIN(SP), …

9 COMP53319 Data Warehouse psc 6M pc 4Mps 0.8Msc 2M p 0.2Ms 0.01M c 0.1M none 1 Parts are bought from suppliers and then sold to customers at a sale price SP partsuppliercustomerSP p1s1c14 p3s1c23 p2s3c17 ………… Table T

10 COMP533110 Data Warehouse psc 6M pc 4Mps 0.8Msc 2M p 0.2Ms 0.01M c 0.1M none 1 Suppose we materialize all views. This wastes a lot of space. Cost for accessing pc = 4M Cost for accessing ps = 0.8M Cost for accessing sc = 2M Cost for accessing p = 0.2M Cost for accessing c = 0.1M Cost for accessing s = 0.01M

11 COMP533111 Data Warehouse psc 6M pc 4Mps 0.8Msc 2M p 0.2Ms 0.01M c 0.1M none 1 Suppose we materialize the top view only. Cost for accessing pc = 6M (not 4M) Cost for accessing ps = 6M (not 0.8M) Cost for accessing sc = 6M (not 2M) Cost for accessing p = 6M (not 0.2M) Cost for accessing c = 6M (not 0.1M) Cost for accessing s = 6M (not 0.01M)

12 COMP533112 Data Warehouse psc 6M pc 4Mps 0.8Msc 2M p 0.2Ms 0.01M c 0.1M none 1 Suppose we materialize the top view and the view for “ps” only. Cost for accessing pc = 6M (still 6M) Cost for accessing sc = 6M (still 6M) Cost for accessing p = 0.8M (not 6M previously) Cost for accessing ps = 0.8M (not 6M previously) Cost for accessing c = 6M (still 6M) Cost for accessing s = 0.8M (not 6M previously)

13 COMP533113 Data Warehouse psc 6M pc 4Mps 0.8Msc 2M p 0.2Ms 0.01M c 0.1M none 1 Suppose we materialize the top view and the view for “ps” only. Cost for accessing pc = 6M (still 6M) Cost for accessing sc = 6M (still 6M) Cost for accessing p = 0.8M (not 6M previously) Cost for accessing ps = 0.8M (not 6M previously) Cost for accessing c = 6M (still 6M) Cost for accessing s = 0.8M (not 6M previously) Gain = 0 Gain = 5.2M Gain = 0 Gain = 5.2M Gain = 0 Gain({view for “ps”, top view}, {top view}) = 5.2*3 = 15.6 Selective Materialization Problem: We can select a set V of k views such that Gain(V U {top view}, {top view}) is maximized.

14 COMP533114 Data Warehouse Problem Data Warehouse NP-hardness Algorithm Performance Study

15 COMP533115 NP-hardness Selective Materialization Problem is NP- hard. Selective Materialization Problem: We can select a set V of k views such that Gain(V U {top view}, {top view}) is maximized.

16 COMP533116 NP-hardness Selective Materialization Decision Problem (SMD) Given an integer k and a real number J, We want to find a set V of k views such that Gain(V U {top view}, {top view}) is at least J. Selective Materialization Decision Problem is NP-hard. Selective Materialization Problem: We can select a set V of k views such that Gain(V U {top view}, {top view}) is maximized.

17 COMP533117 NP-hardness Exact Cover by 3-Sets (XC) Instance: Set X with 3q elements, and a collection C of size 3 subsets of X Question: Does C contain an exact cover for X, i.e., a subcollection C ’  C such that every element of X occurs in exactly one set of C ’. It is well-known that this problem is NP- complete.

18 COMP533118 NP-hardness Problem XC can be transformed to Problem SMD Create a root node with size = 200 (at level 1) Create a bottom node with size 1 (at level 4) For each element x in X, Create a node N x with size = 50 at level 3 Create an edge between N x and the bottom node For each element a  C (where a = (x, y, z)) Create a node N a with size = 100 at level 2 Create an edge between N a and the root node Create an edge between N a and N x Create an edge between N a and N y Create an edge between N a and N z Set k = q Set J = 400q

19 COMP533119 NP-hardness E.g., X = {A, B, C, D, E, F} C = {(A, B, C), (B, C, D), (D, E, F)} A BC D EF 200 1 50 100 k = 2 q = 2 J = 400x2 = 800

20 COMP533120 NP-hardness It is easy to verify that solving the problem SMD is equal to solving problem XC Problem SMD is NP-hard.

21 COMP533121 Data Warehouse psc 6M pc 4Mps 0.8Msc 2M p 0.2Ms 0.01M c 0.1M none 1 Parts are bought from suppliers and then sold to customers at a sale price SP partsuppliercustomerSP p1s1c14 p3s1c23 p2s3c17 ………… Table T

22 COMP533122 Data Warehouse psc 6M pc 6Mps 0.8Msc 2M p 0.2Ms 0.01M c 0.1M none 1 Parts are bought from suppliers and then sold to customers at a sale price SP partsuppliercustomerSP p1s1c14 p3s1c23 p2s3c17 ………… Table T

23 COMP533123 Data Warehouse Problem Date Warehouse NP-hardness Algorithm Performance Study

24 COMP533124 Greedy Algorithm k = number of views to be materialized Given v is a view S is a set of views which are selected to be materialized Define the benefit of selecting v for materialization as B(v, S) = Gain(S U {v}, S)

25 COMP533125 Greedy Algorithm S  {top view}; For i = 1 to k do Select that view v not in S such that B(v, S) is maximized; S  S U {v} Resulting S is the greedy selection

26 COMP533126 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 Benefit from pc = Benefit 6M-6M = 0 k = 2

27 COMP533127 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit from ps = Benefit 6M-0.8M = 5.2M k = 2

28 COMP533128 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit from sc = Benefit 6M-6M = 0 0 x 3 = 0 k = 2

29 COMP533129 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit from p = Benefit 6M-0.2M = 5.8M 0 x 3 = 0 5.8 x 1 = 5.8 k = 2

30 COMP533130 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit from s = Benefit 6M-0.01M = 5.99M 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 k = 2

31 COMP533131 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit from c = Benefit 6M-0.1M = 5.9M 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 k = 2

32 COMP533132 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 Benefit from pc = 6M-6M = 0 0 x 2 = 0 k = 2

33 COMP533133 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 Benefit from sc = 6M-6M = 0 0 x 2 = 0 0 x 2 = 0 k = 2

34 COMP533134 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 Benefit from p = 0.8M-0.2M = 0.6M 0 x 2 = 0 0 x 2 = 0 0.6 x 1 = 0.6 k = 2

35 COMP533135 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 Benefit from s = 0.8M-0.01M = 0.79M 0 x 2 = 0 0 x 2 = 0 0.6 x 1 = 0.6 0.79 x 1 = 0.79 k = 2

36 COMP533136 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 Benefit from c = 6M-0.1M = 5.9M 0 x 2 = 0 0 x 2 = 0 0.6 x 1 = 0.6 0.79 x 1 = 0.79 5.9 x 1 = 5.9 k = 2

37 COMP533137 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M none 1 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 5.2 x 3 = 15.6 Benefit 0 x 3 = 0 5.8 x 1 = 5.8 5.99 x 1 = 5.99 5.9 x 1 = 5.9 0 x 2 = 0 0 x 2 = 0 0.6 x 1 = 0.6 0.79 x 1 = 0.79 5.9 x 1 = 5.9 Two views to be materialized are 1.ps 2. c V = {ps, c} Gain(V U {top view}, {top view}) = 15.6 + 5.9 = 21.5 k = 2

38 COMP533138 Data Warehouse Problem Data Warehouse NP-hardness Algorithm Performance Study

39 COMP533139 Performance Study How bad does the Greedy Algorithm perform?

40 COMP533140 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 none 1 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit from b = Benefit 200-100= 100 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 k = 2

41 COMP533141 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 none 1 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit from c = Benefit 200-99= 101 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 41 x 101 = 4141 k = 2

42 COMP533142 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 none 1 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 41 x 101 = 4141 41 x 100 = 4100 k = 2

43 COMP533143 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 none 1 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 41 x 101 = 4141 41 x 100 = 4100 Benefit from b = 200-100= 100 21 x 100 = 2100 k = 2

44 COMP533144 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 none 1 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 41 x 101 = 4141 41 x 100 = 4100 21 x 100 = 2100 21 x 100 = 2100 Greedy: V = {b, c} Gain(V U {top view}, {top view}) = 4141 + 2100 = 6241 k = 2

45 COMP533145 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 none 1 1st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 41 x 101 = 4141 41 x 100 = 4100 41 x 100 = 4100 Greedy: V = {b, c} Gain(V U {top view}, {top view}) = 4141 + 2100 = 6241 21 x 101 + 20 x 1 = 2141 Optimal: V = {b, d} Gain(V U {top view}, {top view}) = 4100 + 4100 = 8200 k = 2

46 COMP533146 1.1 Data Cube a 200 b 100 c 99d 100 p 1 97 none 1 20 nodes … p 20 97 q 1 97 … q 20 97 r 1 97 … r 20 97 s 1 97 … s 20 97 Greedy: V = {b, c} Gain(V U {top view}, {top view}) = 4141 + 2100 = 6241 Optimal: V = {b, d} Gain(V U {top view}, {top view}) = 4100 + 4100 = 8200 Greedy Optimal = 6241 8200 =0.7611 If this ratio = 1, Greedy can give an optimal solution. If this ratio  0, Greedy may give a “ bad ” solution. Does this ratio has a “ lower ” bound? It is proved that this ratio is at least 0.63. k = 2

47 COMP533147 Performance Study This is just an example to show that this greedy algorithm can perform badly. A complete proof of the lower bound can be found in the paper.


Download ppt "COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong"

Similar presentations


Ads by Google