Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database Management Systems Data Mining - Data Cubes

Similar presentations


Presentation on theme: "Database Management Systems Data Mining - Data Cubes"— Presentation transcript:

1 15-721 Database Management Systems Data Mining - Data Cubes
Anastassia Ailamaki

2 Copyright: C. Faloutsos (2001)
Substitute lecturer Christos Faloutsos 15-721 Copyright: C. Faloutsos (2001)

3 Copyright: C. Faloutsos (2001)
Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning association rules (clustering) 15-721 Copyright: C. Faloutsos (2001)

4 Problem PGH NY ??? SF Given: multiple data sources
Find: patterns (classifiers, rules, clusters, outliers...) sales(p-id, c-id, date, $price) customers( c-id, age, income, ...) NY SF ??? PGH 15-721 Copyright: C. Faloutsos (2001)

5 Copyright: C. Faloutsos (2001)
Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? How often? How about discrepancies / non-homegeneities? 15-721 Copyright: C. Faloutsos (2001)

6 Copyright: C. Faloutsos (2001)
Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? A: Triggers/Materialized views How often? A: [Art!] How about discrepancies / non-homegeneities? A: Wrappers/Mediators 15-721 Copyright: C. Faloutsos (2001)

7 Copyright: C. Faloutsos (2001)
Data Ware-housing Step 2: collect counts. (DataCubes/OLAP) Eg.: 15-721 Copyright: C. Faloutsos (2001)

8 Copyright: C. Faloutsos (2001)
OLAP Problem: “is it true that shirts in large sizes sell better in dark colors?” sales ... 15-721 Copyright: C. Faloutsos (2001)

9 Copyright: C. Faloutsos (2001)
DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f color size color; size 15-721 Copyright: C. Faloutsos (2001)

10 Copyright: C. Faloutsos (2001)
DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color color; size 15-721 Copyright: C. Faloutsos (2001)

11 Copyright: C. Faloutsos (2001)
DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color color; size 15-721 Copyright: C. Faloutsos (2001)

12 Copyright: C. Faloutsos (2001)
DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color color; size 15-721 Copyright: C. Faloutsos (2001)

13 Copyright: C. Faloutsos (2001)
DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color color; size 15-721 Copyright: C. Faloutsos (2001)

14 Copyright: C. Faloutsos (2001)
DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color color; size DataCube 15-721 Copyright: C. Faloutsos (2001)

15 Copyright: C. Faloutsos (2001)
DataCubes SQL query to generate DataCube: Naively (and painfully:) select size, color, count(*) from sales where p-id = ‘shirt’ group by size, color select size, count(*) group by size ... 15-721 Copyright: C. Faloutsos (2001)

16 Copyright: C. Faloutsos (2001)
DataCubes SQL query to generate DataCube: with ‘cube by’ keyword: select size, color, count(*) from sales where p-id = ‘shirt’ cube by size, color 15-721 Copyright: C. Faloutsos (2001)

17 Copyright: C. Faloutsos (2001)
DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) Q2: How to index them Q3: Which operations to allow 15-721 Copyright: C. Faloutsos (2001)

18 Copyright: C. Faloutsos (2001)
DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) A: ROLAP/MOLAP Q2: How to index them A: bitmap indices Q3: Which operations to allow A: roll-up, drill down, slice, dice [More details: book by Han+Kamber] 15-721 Copyright: C. Faloutsos (2001)

19 Copyright: C. Faloutsos (2001)
DataCubes Q1: How to store a dataCube? 15-721 Copyright: C. Faloutsos (2001)

20 Copyright: C. Faloutsos (2001)
DataCubes Q1: How to store a dataCube? A1: Relational (R-OLAP) 15-721 Copyright: C. Faloutsos (2001)

21 Copyright: C. Faloutsos (2001)
DataCubes Q1: How to store a dataCube? A2: Multi-dimensional (M-OLAP) A3: Hybrid (H-OLAP) 15-721 Copyright: C. Faloutsos (2001)

22 Copyright: C. Faloutsos (2001)
DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube) 15-721 Copyright: C. Faloutsos (2001)

23 Copyright: C. Faloutsos (2001)
DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube) use existing RDBMS technology scale up better with dimensionality 15-721 Copyright: C. Faloutsos (2001)

24 Copyright: C. Faloutsos (2001)
DataCubes Pros/Cons: MOLAP strong points: (EssBase/hyperion.com) faster indexing (careful with: high-dimensionality; sparseness) HOLAP: (MS SQL server OLAP services) detail data in ROLAP; summaries in MOLAP 15-721 Copyright: C. Faloutsos (2001)

25 Copyright: C. Faloutsos (2001)
DataCubes Q1: How to store a dataCube Q2: What operations should we support? Q3: How to index a dataCube? 15-721 Copyright: C. Faloutsos (2001)

26 Copyright: C. Faloutsos (2001)
DataCubes Q2: What operations should we support? f color size color; size 15-721 Copyright: C. Faloutsos (2001)

27 Copyright: C. Faloutsos (2001)
DataCubes Q2: What operations should we support? Roll-up f size color color; size 15-721 Copyright: C. Faloutsos (2001)

28 Copyright: C. Faloutsos (2001)
DataCubes Q2: What operations should we support? Drill-down f size color color; size 15-721 Copyright: C. Faloutsos (2001)

29 Copyright: C. Faloutsos (2001)
DataCubes Q2: What operations should we support? Slice f size color color; size 15-721 Copyright: C. Faloutsos (2001)

30 Copyright: C. Faloutsos (2001)
DataCubes Q2: What operations should we support? Dice f size color color; size 15-721 Copyright: C. Faloutsos (2001)

31 Copyright: C. Faloutsos (2001)
DataCubes Q2: What operations should we support? Roll-up Drill-down Slice Dice 15-721 Copyright: C. Faloutsos (2001)

32 Copyright: C. Faloutsos (2001)
DataCubes Q1: How to store a dataCube Q2: What operations should we support? Q3: How to index a dataCube? 15-721 Copyright: C. Faloutsos (2001)

33 Copyright: C. Faloutsos (2001)
DataCubes Q3: How to index a dataCube? 15-721 Copyright: C. Faloutsos (2001)

34 Copyright: C. Faloutsos (2001)
DataCubes Q3: How to index a dataCube? A1: Bitmaps 15-721 Copyright: C. Faloutsos (2001)

35 Copyright: C. Faloutsos (2001)
DataCubes Q3: How to index a dataCube? A2: Join indices (see [Han+Kamber]) 15-721 Copyright: C. Faloutsos (2001)

36 D/W - OLAP - Conclusions
D/W: copy (summarized) data + analyze OLAP - concepts: DataCube R/M/H-OLAP servers ‘dimensions’; ‘measures’ 15-721 Copyright: C. Faloutsos (2001)

37 Copyright: C. Faloutsos (2001)
Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning association rules (clustering) 15-721 Copyright: C. Faloutsos (2001)

38 Decision trees - Problem
?? 15-721 Copyright: C. Faloutsos (2001)

39 Copyright: C. Faloutsos (2001)
Decision trees Pictorially, we have num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) + - 15-721 Copyright: C. Faloutsos (2001)

40 Copyright: C. Faloutsos (2001)
Decision trees and we want to label ‘?’ num. attr#2 (eg., chol-level) ? - - + + + - + - + - + - + num. attr#1 (eg., ‘age’) 15-721 Copyright: C. Faloutsos (2001)

41 Copyright: C. Faloutsos (2001)
Decision trees so we build a decision tree: num. attr#2 (eg., chol-level) ? - - + + + 40 - + - + - + - + 50 num. attr#1 (eg., ‘age’) 15-721 Copyright: C. Faloutsos (2001)

42 Copyright: C. Faloutsos (2001)
Decision trees so we build a decision tree: age<50 N Y chol. <40 + Y N - ... 15-721 Copyright: C. Faloutsos (2001)

43 Copyright: C. Faloutsos (2001)
Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees problem approach scalability enhancements Unsupervised learning association rules (clustering) 15-721 Copyright: C. Faloutsos (2001)

44 Copyright: C. Faloutsos (2001)
Decision trees Typically, two steps: tree building tree pruning (for over-training/over-fitting) 15-721 Copyright: C. Faloutsos (2001)

45 Copyright: C. Faloutsos (2001)
Tree building How? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) + - 15-721 Copyright: C. Faloutsos (2001)

46 Copyright: C. Faloutsos (2001)
skip Tree building How? A: Partition, recursively - pseudocode: Partition ( Dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2) 15-721 Copyright: C. Faloutsos (2001)

47 Copyright: C. Faloutsos (2001)
skip Tree building Q1: how to introduce splits along attribute Ai Q2: how to evaluate a split? 15-721 Copyright: C. Faloutsos (2001)

48 Copyright: C. Faloutsos (2001)
skip Tree building Q1: how to introduce splits along attribute Ai A1: for num. attributes: binary split, or multiple split for categorical attributes: compute all subsets (expensive!), or use a greedy algo 15-721 Copyright: C. Faloutsos (2001)

49 Copyright: C. Faloutsos (2001)
skip Tree building Q1: how to introduce splits along attribute Ai Q2: how to evaluate a split? 15-721 Copyright: C. Faloutsos (2001)

50 Copyright: C. Faloutsos (2001)
Tree building Q1: how to introduce splits along attribute Ai Q2: how to evaluate a split? A: by how close to uniform each subset is - ie., we need a measure of uniformity: 15-721 Copyright: C. Faloutsos (2001)

51 Copyright: C. Faloutsos (2001)
skip Tree building entropy: H(p+, p-) Any other measure? 1 0.5 1 p+ 15-721 Copyright: C. Faloutsos (2001)

52 Copyright: C. Faloutsos (2001)
skip Tree building entropy: H(p+, p-) ‘gini’ index: 1-p+2 - p-2 1 p+ 1 0.5 0.5 1 p+ 15-721 Copyright: C. Faloutsos (2001)

53 Tree building skip entropy: H(p+, p-) ‘gini’ index: 1-p+2 - p-2
(How about multiple labels?) 15-721 Copyright: C. Faloutsos (2001)

54 Copyright: C. Faloutsos (2001)
skip Tree building Intuition: entropy: #bits to encode the class label gini: classification error, if we randomly guess ‘+’ with prob. p+ 15-721 Copyright: C. Faloutsos (2001)

55 Copyright: C. Faloutsos (2001)
Tree building Thus, we choose the split that reduces entropy/classification-error the most: Eg.: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) + - 15-721 Copyright: C. Faloutsos (2001)

56 Tree building skip Before split: we need
(n+ + n-) * H( p+, p-) = (7+6) * H(7/13, 6/13) bits total, to encode all the class labels After the split we need: 0 bits for the first half and (2+6) * H(2/8, 6/8) bits for the second half 15-721 Copyright: C. Faloutsos (2001)

57 Copyright: C. Faloutsos (2001)
Tree pruning What for? ... num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) + - 15-721 Copyright: C. Faloutsos (2001)

58 Copyright: C. Faloutsos (2001)
Tree pruning Shortcut for scalability: DYNAMIC pruning: stop expanding the tree, if a node is ‘reasonably’ homogeneous ad hoc threshold [Agrawal+, vldb92] Minimum Description Language (MDL) criterion (SLIQ) [Mehta+, edbt96] 15-721 Copyright: C. Faloutsos (2001)

59 Copyright: C. Faloutsos (2001)
skip Tree pruning Q: How to do it? A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?) A2: or, rely on MDL (= Minimum Description Language) - in detail: 15-721 Copyright: C. Faloutsos (2001)

60 Copyright: C. Faloutsos (2001)
skip Tree pruning envision the problem as compression (of what?) 15-721 Copyright: C. Faloutsos (2001)

61 Copyright: C. Faloutsos (2001)
skip Tree pruning envision the problem as compression (of what?) and try to min. the # bits to compress (a) the class labels AND (b) the representation of the decision tree 15-721 Copyright: C. Faloutsos (2001)

62 Copyright: C. Faloutsos (2001)
skip (MDL) a brilliant idea - eg.: best n-degree polynomial to compress these points: the one that minimizes (sum of errors + n ) 15-721 Copyright: C. Faloutsos (2001)

63 Copyright: C. Faloutsos (2001)
Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees problem approach scalability enhancements Unsupervised learning association rules (clustering) 15-721 Copyright: C. Faloutsos (2001)

64 Scalability enhancements
Interval Classifier [Agrawal+,vldb92]: dynamic pruning SLIQ: dynamic pruning with MDL; vertical partitioning of the file (but label column has to fit in core) SPRINT: even more clever partitioning 15-721 Copyright: C. Faloutsos (2001)

65 Conclusions for classifiers
Classification through trees Building phase - splitting policies Pruning phase (to avoid over-fitting) For scalability: dynamic pruning clever data partitioning 15-721 Copyright: C. Faloutsos (2001)

66 Copyright: C. Faloutsos (2001)
Overall Conclusions Data Mining: of high commercial interest DM = DB+ ML+ Stat Data warehousing / OLAP: to get the data Tree classifiers (SLIQ, SPRINT) Association Rules - ‘a-priori’ algorithm (clustering: BIRCH, CURE, OPTICS) 15-721 Copyright: C. Faloutsos (2001)

67 Copyright: C. Faloutsos (2001)
Reading material Agrawal, R., T. Imielinski, A. Swami, ‘Mining Association Rules between Sets of Items in Large Databases’, SIGMOD 1993. M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996 15-721 Copyright: C. Faloutsos (2001)

68 Additional references
Agrawal, R., S. Ghosh, et al. (Aug , 1992). An Interval Classifier for Database Mining Applications. VLDB Conf. Proc., Vancouver, BC, Canada. Jiawei Han and Micheline Kamber, Data Mining , Morgan Kaufman, 2001, chapters , , 7.3.5 15-721 Copyright: C. Faloutsos (2001)


Download ppt "Database Management Systems Data Mining - Data Cubes"

Similar presentations


Ads by Google