CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26)

Slides:



Advertisements
Similar presentations
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Data Mining Techniques Association Rule
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
LOGO Association Rule Lecturer: Dr. Bo Yuan
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26)
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Classification and Prediction
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Data Warehousing / Data Mining.
Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏
Fast Algorithms for Association Rule Mining
Mining Association Rules
Classification.
Mining Association Rules
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Online Analytical Processing (OLAP) Hweichao Lu CS157B-02 Spring 2007.
Chapter 7 Decision Tree.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
OnLine Analytical Processing (OLAP)
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Data Mining Find information from data data ? information.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26) C. Faloutsos and.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Data Mining Association Analysis: Basic Concepts and Algorithms
Association rule mining
Frequent Pattern Mining
Data Warehousing / Data Mining
Database Management Systems Data Mining - Data Cubes
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
I don’t need a title slide for a lecture
©Jiawei Han and Micheline Kamber
©Jiawei Han and Micheline Kamber
15-826: Multimedia Databases and Data Mining
Association Analysis: Basic Concepts
Presentation transcript:

CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26)

CMU SCS C. FaloutsosCMU-Q March Data mining - detailed outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning –association rules

CMU SCS C. FaloutsosCMU-Q March Problem Given: multiple data sources Find: patterns (classifiers, rules, clusters, outliers...) sales(p-id, c-id, date, $price) customers( c-id, age, income,...) NY SF ??? PGH

CMU SCS C. FaloutsosCMU-Q March Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? How often? How about discrepancies / non- homegeneities?

CMU SCS C. FaloutsosCMU-Q March Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? A: Triggers/Materialized views How often? A: [Art!] How about discrepancies / non- homegeneities? A: Wrappers/Mediators

CMU SCS C. FaloutsosCMU-Q March Data Ware-housing Step 2: collect counts. (DataCubes/OLAP) Eg.:

CMU SCS C. FaloutsosCMU-Q March OLAP Problem: “is it true that shirts in large sizes sell better in dark colors?” sales...

CMU SCS C. FaloutsosCMU-Q March DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS C. FaloutsosCMU-Q March DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS C. FaloutsosCMU-Q March DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS C. FaloutsosCMU-Q March DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS C. FaloutsosCMU-Q March DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS C. FaloutsosCMU-Q March DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size DataCube

CMU SCS C. FaloutsosCMU-Q March DataCubes SQL query to generate DataCube: Naively (and painfully:) select size, color, count(*) from sales where p-id = ‘shirt’ group by size, color select size, count(*) from sales where p-id = ‘shirt’ group by size...

CMU SCS C. FaloutsosCMU-Q March DataCubes SQL query to generate DataCube: with ‘cube by’ keyword: select size, color, count(*) from sales where p-id = ‘shirt’ cube by size, color

CMU SCS C. FaloutsosCMU-Q March DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) Q2: Which operations to allow

CMU SCS C. FaloutsosCMU-Q March DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) A: ROLAP/MOLAP Q2: Which operations to allow A: roll-up, drill down, slice, dice [More details: book by Han+Kamber]

CMU SCS C. FaloutsosCMU-Q March DataCubes Q1: How to store a dataCube?

CMU SCS C. FaloutsosCMU-Q March DataCubes Q1: How to store a dataCube? A1: Relational (R-OLAP)

CMU SCS C. FaloutsosCMU-Q March DataCubes Q1: How to store a dataCube? A2: Multi-dimensional (M-OLAP) A3: Hybrid (H-OLAP)

CMU SCS C. FaloutsosCMU-Q March DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube)

CMU SCS C. FaloutsosCMU-Q March DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube) use existing RDBMS technology scale up better with dimensionality

CMU SCS C. FaloutsosCMU-Q March DataCubes Pros/Cons: MOLAP strong points: (EssBase/hyperion.com) faster indexing (careful with: high-dimensionality; sparseness) HOLAP: (MS SQL server OLAP services) detail data in ROLAP; summaries in MOLAP

CMU SCS C. FaloutsosCMU-Q March DataCubes Q1: How to store a dataCube Q2: What operations should we support?

CMU SCS C. FaloutsosCMU-Q March DataCubes Q2: What operations should we support?  color size color; size

CMU SCS C. FaloutsosCMU-Q March DataCubes Q2: What operations should we support? Roll-up  color size color; size

CMU SCS C. FaloutsosCMU-Q March DataCubes Q2: What operations should we support? Drill-down  color size color; size

CMU SCS C. FaloutsosCMU-Q March DataCubes Q2: What operations should we support? Slice  color size color; size

CMU SCS C. FaloutsosCMU-Q March DataCubes Q2: What operations should we support? Dice  color size color; size

CMU SCS C. FaloutsosCMU-Q March DataCubes Q2: What operations should we support? Roll-up Drill-down Slice Dice (Pivot/rotate; drill-across; drill-through top N moving averages, etc)

CMU SCS C. FaloutsosCMU-Q March D/W - OLAP - Conclusions D/W: copy (summarized) data + analyze OLAP - concepts: –DataCube –R/M/H-OLAP servers –‘dimensions’; ‘measures’

CMU SCS C. FaloutsosCMU-Q March Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning –association rules –(clustering)

CMU SCS C. FaloutsosCMU-Q March Decision trees - Problem ??

CMU SCS C. FaloutsosCMU-Q March Decision trees Pictorially, we have num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

CMU SCS C. FaloutsosCMU-Q March Decision trees and we want to label ‘?’ num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) ?

CMU SCS C. FaloutsosCMU-Q March Decision trees so we build a decision tree: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) ? 50 40

CMU SCS C. FaloutsosCMU-Q March Decision trees so we build a decision tree: age<50 Y + chol. <40 N -... Y N

CMU SCS C. FaloutsosCMU-Q March Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering)

CMU SCS C. FaloutsosCMU-Q March Decision trees Typically, two steps: –tree building –tree pruning (for over-training/over-fitting)

CMU SCS C. FaloutsosCMU-Q March Tree building How? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

CMU SCS C. FaloutsosCMU-Q March Tree building How? A: Partition, recursively - pseudocode: Partition ( Dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2)

CMU SCS C. FaloutsosCMU-Q March Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split? Details

CMU SCS C. FaloutsosCMU-Q March Tree building Q1: how to introduce splits along attribute A i A1: –for num. attributes: binary split, or multiple split –for categorical attributes: compute all subsets (expensive!), or use a greedy algo Details

CMU SCS C. FaloutsosCMU-Q March Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split? Details

CMU SCS C. FaloutsosCMU-Q March Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split? A: by how close to uniform each subset is - ie., we need a measure of uniformity: Details

CMU SCS C. FaloutsosCMU-Q March Tree building entropy: H(p+, p-) p Any other measure? Details

CMU SCS C. FaloutsosCMU-Q March Tree building entropy: H(p +, p - ) p ‘gini’ index: 1-p p - 2 p Details

CMU SCS C. FaloutsosCMU-Q March Tree building entropy: H(p +, p - ) ‘gini’ index: 1-p p - 2 (How about multiple labels?) Details

CMU SCS C. FaloutsosCMU-Q March Tree building Intuition: entropy: #bits to encode the class label gini: classification error, if we randomly guess ‘+’ with prob. p + Details

CMU SCS C. FaloutsosCMU-Q March Tree building Thus, we choose the split that reduces entropy/classification-error the most: Eg.: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) Details

CMU SCS C. FaloutsosCMU-Q March Tree building Before split: we need (n + + n - ) * H( p +, p - ) = (7+6) * H(7/13, 6/13) bits total, to encode all the class labels After the split we need: 0 bits for the first half and (2+6) * H(2/8, 6/8) bits for the second half Details

CMU SCS C. FaloutsosCMU-Q March Tree pruning What for? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) Details

CMU SCS C. FaloutsosCMU-Q March Tree pruning Shortcut for scalability: DYNAMIC pruning: stop expanding the tree, if a node is ‘reasonably’ homogeneous –ad hoc threshold [Agrawal+, vldb92] –( Minimum Description Language (MDL) criterion (SLIQ) [Mehta+, edbt96] ) Details

CMU SCS C. FaloutsosCMU-Q March Tree pruning Q: How to do it? A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?) (A2: or, rely on MDL (= Minimum Description Language) ) Details

CMU SCS C. FaloutsosCMU-Q March Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering) Details

CMU SCS C. FaloutsosCMU-Q March Scalability enhancements Interval Classifier [Agrawal+,vldb92]: dynamic pruning SLIQ: dynamic pruning with MDL; vertical partitioning of the file (but label column has to fit in core) SPRINT: even more clever partitioning Details

CMU SCS C. FaloutsosCMU-Q March Conclusions for classifiers Classification through trees Building phase - splitting policies Pruning phase (to avoid over-fitting) For scalability: –dynamic pruning –clever data partitioning

CMU SCS C. FaloutsosCMU-Q March Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering)

CMU SCS C. FaloutsosCMU-Q March Association rules - idea [Agrawal+SIGMOD93] Consider ‘market basket’ case: (milk, bread) (milk) (milk, chocolate) (milk, bread) Find ‘interesting things’, eg., rules of the form: milk, bread -> chocolate | 90%

CMU SCS C. FaloutsosCMU-Q March Association rules - idea In general, for a given rule Ij, Ik,... Im -> Ix | c ‘c’ = ‘confidence’ (how often people by Ix, given that they have bought Ij,... Im ‘s’ = support: how often people buy Ij,... Im, Ix

CMU SCS C. FaloutsosCMU-Q March Association rules - idea Problem definition: given – a set of ‘market baskets’ (=binary matrix, of N rows/baskets and M columns/products) –min-support ‘s’ and –min-confidence ‘c’ find –all the rules with higher support and confidence

CMU SCS C. FaloutsosCMU-Q March Association rules - idea Closely related concept: “large itemset” Ij, Ik,... Im, Ix is a ‘large itemset’, if it appears more than ‘min- support’ times Observation: once we have a ‘large itemset’, we can find out the qualifying rules easily (how?) Thus, let’s focus on how to find ‘large itemsets’

CMU SCS C. FaloutsosCMU-Q March Association rules - idea Naive solution: scan database once; keep 2**|I| counters Drawback? Improvement?

CMU SCS C. FaloutsosCMU-Q March Association rules - idea Naive solution: scan database once; keep 2**|I| counters Drawback? 2**1000 is prohibitive... Improvement? scan the db |I| times, looking for 1-, 2-, etc itemsets Eg., for |I|=3 items only (A, B, C), we have

CMU SCS C. FaloutsosCMU-Q March Association rules - idea AB C first pass min-sup:10

CMU SCS C. FaloutsosCMU-Q March Association rules - idea AB C first pass min-sup:10 A,B A,C B,C

CMU SCS C. FaloutsosCMU-Q March Association rules - idea Anti-monotonicity property: if an itemset fails to be ‘large’, so will every superset of it (hence all supersets can be pruned) Sketch of the (famous!) ‘a-priori’ algorithm Let L(i-1) be the set of large itemsets with i-1 elements Let C(i) be the set of candidate itemsets (of size i)

CMU SCS C. FaloutsosCMU-Q March Association rules - idea Compute L(1), by scanning the database. repeat, for i=2,3..., ‘join’ L(i-1) with itself, to generate C(i) two itemset can be joined, if they agree on their first i-2 elements prune the itemsets of C(i) (how?) scan the db, finding the counts of the C(i) itemsets - set this to be L(i) unless L(i) is empty, repeat the loop

CMU SCS C. FaloutsosCMU-Q March Association rules - Conclusions Association rules: a great tool to find patterns easy to understand its output fine-tuned algorithms exist

CMU SCS C. FaloutsosCMU-Q March Overall Conclusions Data Mining = ``Big Data’’ Analytics = Business Intelligence: – of high commercial, government and research interest DM = DB+ ML+ Stat+Sys Data warehousing / OLAP: to get the data Tree classifiers (SLIQ, SPRINT) Association Rules - ‘a-priori’ algorithm (clustering: BIRCH, CURE, OPTICS)

CMU SCS C. FaloutsosCMU-Q March Reading material Agrawal, R., T. Imielinski, A. Swami, ‘Mining Association Rules between Sets of Items in Large Databases’, SIGMOD M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996

CMU SCS C. FaloutsosCMU-Q March Additional references Agrawal, R., S. Ghosh, et al. (Aug , 1992). An Interval Classifier for Database Mining Applications. VLDB Conf. Proc., Vancouver, BC, Canada. Jiawei Han and Micheline Kamber, Data Mining, Morgan Kaufman, 2001, chapters , , 7.3.5