Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications C. Faloutsos Data Warehousing / Data Mining.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Data Mining Techniques Association Rule
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
LOGO Association Rule Lecturer: Dr. Bo Yuan
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26)
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Classification and Prediction
Decision Tree Algorithm
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Fast Algorithms for Association Rule Mining
Lecture14: Association Rules
Mining Association Rules
Classification.
Mining Association Rules
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 7 Decision Tree.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Chapter 9 – Classification and Regression Trees
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Data Mining Find information from data data ? information.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26)
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26) C. Faloutsos and.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Data Mining Find information from data data ? information.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association rule mining
Frequent Pattern Mining
Data Warehousing / Data Mining
Database Management Systems Data Mining - Data Cubes
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
15-826: Multimedia Databases and Data Mining
Association Analysis: Basic Concepts
Presentation transcript:

Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Data Warehousing / Data Mining

Carnegie Mellon C. Faloutsos2 General Overview Relational model – SQL; db design Indexing; Q-opt;Transaction processing Advanced topics –Distributed Databases –RAID;Authorization / Stat. DB –Spatial Access Methods –Multimedia Indexing –Data Warehousing / Data Mining

Carnegie Mellon C. Faloutsos3 Data mining - detailed outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning –association rules –(clustering)

Carnegie Mellon C. Faloutsos4 Problem Given: multiple data sources Find: patterns (classifiers, rules, clusters, outliers...) sales(p-id, c-id, date, $price) customers( c-id, age, income,...) NY SF ??? PGH

Carnegie Mellon C. Faloutsos5 Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? How often? How about discrepancies / non- homegeneities?

Carnegie Mellon C. Faloutsos6 Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? A: Triggers/Materialized views How often? A: [Art!] How about discrepancies / non- homegeneities? A: Wrappers/Mediators

Carnegie Mellon C. Faloutsos7 Data Ware-housing Step 2: collect counts. (DataCubes/OLAP) Eg.:

Carnegie Mellon C. Faloutsos8 OLAP Problem: “is it true that shirts in large sizes sell better in dark colors?” sales...

Carnegie Mellon C. Faloutsos9 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

Carnegie Mellon C. Faloutsos10 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

Carnegie Mellon C. Faloutsos11 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

Carnegie Mellon C. Faloutsos12 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

Carnegie Mellon C. Faloutsos13 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

Carnegie Mellon C. Faloutsos14 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size DataCube

Carnegie Mellon C. Faloutsos15 DataCubes SQL query to generate DataCube: Naively (and painfully:) select size, color, count(*) from sales where p-id = ‘shirt’ group by size, color select size, count(*) from sales where p-id = ‘shirt’ group by size...

Carnegie Mellon C. Faloutsos16 DataCubes SQL query to generate DataCube: with ‘cube by’ keyword: select size, color, count(*) from sales where p-id = ‘shirt’ cube by size, color

Carnegie Mellon C. Faloutsos17 DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) Q2: How to index them Q3: Which operations to allow

Carnegie Mellon C. Faloutsos18 DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) A: ROLAP/MOLAP Q2: How to index them A: bitmap indices Q3: Which operations to allow A: roll-up, drill down, slice, dice [More details: book by Han+Kamber]

Carnegie Mellon C. Faloutsos19 DataCubes Q1: How to store a dataCube?

Carnegie Mellon C. Faloutsos20 DataCubes Q1: How to store a dataCube? A1: Relational (R-OLAP)

Carnegie Mellon C. Faloutsos21 DataCubes Q1: How to store a dataCube? A2: Multi-dimensional (M-OLAP) A3: Hybrid (H-OLAP)

Carnegie Mellon C. Faloutsos22 DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube)

Carnegie Mellon C. Faloutsos23 DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube) use existing RDBMS technology scale up better with dimensionality

Carnegie Mellon C. Faloutsos24 DataCubes Pros/Cons: MOLAP strong points: (EssBase/hyperion.com) faster indexing (careful with: high-dimensionality; sparseness) HOLAP: (MS SQL server OLAP services) detail data in ROLAP; summaries in MOLAP

Carnegie Mellon C. Faloutsos25 DataCubes Q1: How to store a dataCube Q2: What operations should we support? Q3: How to index a dataCube?

Carnegie Mellon C. Faloutsos26 DataCubes Q2: What operations should we support?  color size color; size

Carnegie Mellon C. Faloutsos27 DataCubes Q2: What operations should we support? Roll-up  color size color; size

Carnegie Mellon C. Faloutsos28 DataCubes Q2: What operations should we support? Drill-down  color size color; size

Carnegie Mellon C. Faloutsos29 DataCubes Q2: What operations should we support? Slice  color size color; size

Carnegie Mellon C. Faloutsos30 DataCubes Q2: What operations should we support? Dice  color size color; size

Carnegie Mellon C. Faloutsos31 DataCubes Q2: What operations should we support? Roll-up Drill-down Slice Dice (Pivot/rotate; drill-across; drill-through top N moving averages, etc)

Carnegie Mellon C. Faloutsos32 DataCubes Q1: How to store a dataCube Q2: What operations should we support? Q3: How to index a dataCube?

Carnegie Mellon C. Faloutsos33 DataCubes Q3: How to index a dataCube?

Carnegie Mellon C. Faloutsos34 DataCubes Q3: How to index a dataCube? A1: Bitmaps

Carnegie Mellon C. Faloutsos35 DataCubes Q3: How to index a dataCube? A2: Join indices (see [Han+Kamber])

Carnegie Mellon C. Faloutsos36 D/W - OLAP - Conclusions D/W: copy (summarized) data + analyze OLAP - concepts: –DataCube –R/M/H-OLAP servers –‘dimensions’; ‘measures’

Carnegie Mellon C. Faloutsos37 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning –association rules –(clustering)

Carnegie Mellon C. Faloutsos38 Decision trees - Problem ??

Carnegie Mellon C. Faloutsos39 Decision trees Pictorially, we have num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

Carnegie Mellon C. Faloutsos40 Decision trees and we want to label ‘?’ num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) ?

Carnegie Mellon C. Faloutsos41 Decision trees so we build a decision tree: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) ? 50 40

Carnegie Mellon C. Faloutsos42 Decision trees so we build a decision tree: age<50 Y + chol. <40 N -... Y N

Carnegie Mellon C. Faloutsos43 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering)

Carnegie Mellon C. Faloutsos44 Decision trees Typically, two steps: –tree building –tree pruning (for over-training/over-fitting)

Carnegie Mellon C. Faloutsos45 Tree building How? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

Carnegie Mellon C. Faloutsos46 Tree building How? A: Partition, recursively - pseudocode: Partition ( Dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2)

Carnegie Mellon C. Faloutsos47 Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split?

Carnegie Mellon C. Faloutsos48 Tree building Q1: how to introduce splits along attribute A i A1: –for num. attributes: binary split, or multiple split –for categorical attributes: compute all subsets (expensive!), or use a greedy algo

Carnegie Mellon C. Faloutsos49 Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split?

Carnegie Mellon C. Faloutsos50 Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split? A: by how close to uniform each subset is - ie., we need a measure of uniformity:

Carnegie Mellon C. Faloutsos51 Tree building entropy: H(p+, p-) p Any other measure?

Carnegie Mellon C. Faloutsos52 Tree building entropy: H(p +, p - ) p ‘gini’ index: 1-p p - 2 p

Carnegie Mellon C. Faloutsos53 Tree building entropy: H(p +, p - ) ‘gini’ index: 1-p p - 2 (How about multiple labels?)

Carnegie Mellon C. Faloutsos54 Tree building Intuition: entropy: #bits to encode the class label gini: classification error, if we randomly guess ‘+’ with prob. p +

Carnegie Mellon C. Faloutsos55 Tree building Thus, we choose the split that reduces entropy/classification-error the most: Eg.: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

Carnegie Mellon C. Faloutsos56 Tree building Before split: we need (n + + n - ) * H( p +, p - ) = (7+6) * H(7/13, 6/13) bits total, to encode all the class labels After the split we need: 0 bits for the first half and (2+6) * H(2/8, 6/8) bits for the second half

Carnegie Mellon C. Faloutsos57 Tree pruning What for? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

Carnegie Mellon C. Faloutsos58 Tree pruning Shortcut for scalability: DYNAMIC pruning: stop expanding the tree, if a node is ‘reasonably’ homogeneous –ad hoc threshold [Agrawal+, vldb92] –Minimum Description Language (MDL) criterion (SLIQ) [Mehta+, edbt96]

Carnegie Mellon C. Faloutsos59 Tree pruning Q: How to do it? A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?) A2: or, rely on MDL (= Minimum Description Language) - in detail:

Carnegie Mellon C. Faloutsos60 Tree pruning envision the problem as compression (of what?)

Carnegie Mellon C. Faloutsos61 Tree pruning envision the problem as compression (of what?) and try to min. the # bits to compress (a) the class labels AND (b) the representation of the decision tree

Carnegie Mellon C. Faloutsos62 (MDL) a brilliant idea - eg.: best n-degree polynomial to compress these points: the one that minimizes (sum of errors + n )

Carnegie Mellon C. Faloutsos63 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering)

Carnegie Mellon C. Faloutsos64 Scalability enhancements Interval Classifier [Agrawal+,vldb92]: dynamic pruning SLIQ: dynamic pruning with MDL; vertical partitioning of the file (but label column has to fit in core) SPRINT: even more clever partitioning

Carnegie Mellon C. Faloutsos65 Conclusions for classifiers Classification through trees Building phase - splitting policies Pruning phase (to avoid over-fitting) For scalability: –dynamic pruning –clever data partitioning

Carnegie Mellon C. Faloutsos66 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering)

Carnegie Mellon C. Faloutsos67 Association rules - idea [Agrawal+SIGMOD93] Consider ‘market basket’ case: (milk, bread) (milk) (milk, chocolate) (milk, bread) Find ‘interesting things’, eg., rules of the form: milk, bread -> chocolate | 90%

Carnegie Mellon C. Faloutsos68 Association rules - idea In general, for a given rule Ij, Ik,... Im -> Ix | c ‘c’ = ‘confidence’ (how often people by Ix, given that they have bought Ij,... Im ‘s’ = support: how often people buy Ij,... Im, Ix

Carnegie Mellon C. Faloutsos69 Association rules - idea Problem definition: given – a set of ‘market baskets’ (=binary matrix, of N rows/baskets and M columns/products) –min-support ‘s’ and –min-confidence ‘c’ find –all the rules with higher support and confidence

Carnegie Mellon C. Faloutsos70 Association rules - idea Closely related concept: “large itemset” Ij, Ik,... Im, Ix is a ‘large itemset’, if it appears more than ‘min- support’ times Observation: once we have a ‘large itemset’, we can find out the qualifying rules easily (how?) Thus, let’s focus on how to find ‘large itemsets’

Carnegie Mellon C. Faloutsos71 Association rules - idea Naive solution: scan database once; keep 2**|I| counters Drawback? Improvement?

Carnegie Mellon C. Faloutsos72 Association rules - idea Naive solution: scan database once; keep 2**|I| counters Drawback? 2**1000 is prohibitive... Improvement? scan the db |I| times, looking for 1-, 2-, etc itemsets Eg., for |I|=3 items only (A, B, C), we have

Carnegie Mellon C. Faloutsos73 Association rules - idea AB C first pass min-sup:10

Carnegie Mellon C. Faloutsos74 Association rules - idea AB C first pass min-sup:10 A,B A,C B,C

Carnegie Mellon C. Faloutsos75 Association rules - idea Anti-monotonicity property: if an itemset fails to be ‘large’, so will every superset of it (hence all supersets can be pruned) Sketch of the (famous!) ‘a-priori’ algorithm Let L(i-1) be the set of large itemsets with i-1 elements Let C(i) be the set of candidate itemsets (of size i)

Carnegie Mellon C. Faloutsos76 Association rules - idea Compute L(1), by scanning the database. repeat, for i=2,3..., ‘join’ L(i-1) with itself, to generate C(i) two itemset can be joined, if they agree on their first i-2 elements prune the itemsets of C(i) (how?) scan the db, finding the counts of the C(i) itemsets - set this to be L(i) unless L(i) is empty, repeat the loop (see example 6.1 in [Han+Kamber])

Carnegie Mellon C. Faloutsos77 Association rules - Conclusions Association rules: a new tool to find patterns easy to understand its output fine-tuned algorithms exist still an active area of research

Carnegie Mellon C. Faloutsos78 Overall Conclusions Data Mining: of high commercial interest DM = DB+ ML+ Stat Data warehousing / OLAP: to get the data Tree classifiers (SLIQ, SPRINT) Association Rules - ‘a-priori’ algorithm (clustering: BIRCH, CURE, OPTICS)

Carnegie Mellon C. Faloutsos79 Reading material Agrawal, R., T. Imielinski, A. Swami, ‘ Mining Association Rules between Sets of Items in Large Databases’, SIGMOD M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996

Carnegie Mellon C. Faloutsos80 Additional references Agrawal, R., S. Ghosh, et al. (Aug , 1992). An Interval Classifier for Database Mining Applications. VLDB Conf. Proc., Vancouver, BC, Canada. Jiawei Han and Micheline Kamber, Data Mining, Morgan Kaufman, 2001, chapters , , 7.3.5