CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26)

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Data Mining Techniques Association Rule
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
LOGO Association Rule Lecturer: Dr. Bo Yuan
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Classification and Prediction
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Data Warehousing / Data Mining.
Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏
Fast Algorithms for Association Rule Mining
Mining Association Rules
Classification.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Online Analytical Processing (OLAP) Hweichao Lu CS157B-02 Spring 2007.
Chapter 7 Decision Tree.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
OnLine Analytical Processing (OLAP)
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Data Mining Find information from data data ? information.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26) C. Faloutsos and.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Data Mining Association Analysis: Basic Concepts and Algorithms
Association rule mining
Frequent Pattern Mining
Data Warehousing / Data Mining
Database Management Systems Data Mining - Data Cubes
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
I don’t need a title slide for a lecture
©Jiawei Han and Micheline Kamber
©Jiawei Han and Micheline Kamber
15-826: Multimedia Databases and Data Mining
Association Analysis: Basic Concepts
Presentation transcript:

CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26)

CMU SCS Faloutsos & PavloCMU SCS /6152 Data mining - detailed outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning –association rules –(clustering)

CMU SCS Faloutsos & PavloCMU SCS /6153 Problem Given: multiple data sources Find: patterns (classifiers, rules, clusters, outliers...) sales(p-id, c-id, date, $price) customers( c-id, age, income,...) NY SF ??? PGH

CMU SCS Faloutsos & PavloCMU SCS /6154 Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? How often? How about discrepancies / non- homegeneities?

CMU SCS Faloutsos & PavloCMU SCS /6155 Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? A: Triggers/Materialized views How often? A: [Art!] How about discrepancies / non- homegeneities? A: Wrappers/Mediators

CMU SCS Faloutsos & PavloCMU SCS /6156 Data Ware-housing Step 2: collect counts. (DataCubes/OLAP) Eg.:

CMU SCS Faloutsos & PavloCMU SCS /6157 OLAP Problem: “is it true that shirts in large sizes sell better in dark colors?” sales...

CMU SCS Faloutsos & PavloCMU SCS /6158 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS Faloutsos & PavloCMU SCS /6159 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS Faloutsos & PavloCMU SCS /61510 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS Faloutsos & PavloCMU SCS /61511 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS Faloutsos & PavloCMU SCS /61512 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS Faloutsos & PavloCMU SCS /61513 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size DataCube

CMU SCS Faloutsos & PavloCMU SCS /61514 DataCubes SQL query to generate DataCube: Naively (and painfully:) select size, color, count(*) from sales where p-id = ‘shirt’ group by size, color select size, count(*) from sales where p-id = ‘shirt’ group by size...

CMU SCS Faloutsos & PavloCMU SCS /61515 DataCubes SQL query to generate DataCube: with ‘cube by’ keyword: select size, color, count(*) from sales where p-id = ‘shirt’ cube by size, color

CMU SCS Faloutsos & PavloCMU SCS /61516 DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) Q2: Which operations to allow

CMU SCS Faloutsos & PavloCMU SCS /61517 DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) A: ROLAP/MOLAP Q2: Which operations to allow A: roll-up, drill down, slice, dice [More details: book by Han+Kamber]

CMU SCS Faloutsos & PavloCMU SCS /61518 DataCubes Q1: How to store a dataCube?

CMU SCS Faloutsos & PavloCMU SCS /61519 DataCubes Q1: How to store a dataCube? A1: Relational (R-OLAP)

CMU SCS Faloutsos & PavloCMU SCS /61520 DataCubes Q1: How to store a dataCube? A2: Multi-dimensional (M-OLAP) A3: Hybrid (H-OLAP)

CMU SCS Faloutsos & PavloCMU SCS /61521 DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube)

CMU SCS Faloutsos & PavloCMU SCS /61522 DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube) use existing RDBMS technology scale up better with dimensionality

CMU SCS Faloutsos & PavloCMU SCS /61523 DataCubes Pros/Cons: MOLAP strong points: (EssBase/hyperion.com) faster indexing (careful with: high-dimensionality; sparseness) HOLAP: (MS SQL server OLAP services) detail data in ROLAP; summaries in MOLAP

CMU SCS Faloutsos & PavloCMU SCS /61524 DataCubes Q1: How to store a dataCube Q2: What operations should we support?

CMU SCS Faloutsos & PavloCMU SCS /61525 DataCubes Q2: What operations should we support?  color size color; size

CMU SCS Faloutsos & PavloCMU SCS /61526 DataCubes Q2: What operations should we support? Roll-up  color size color; size

CMU SCS Faloutsos & PavloCMU SCS /61527 DataCubes Q2: What operations should we support? Drill-down  color size color; size

CMU SCS Faloutsos & PavloCMU SCS /61528 DataCubes Q2: What operations should we support? Slice  color size color; size

CMU SCS Faloutsos & PavloCMU SCS /61529 DataCubes Q2: What operations should we support? Dice  color size color; size

CMU SCS Faloutsos & PavloCMU SCS /61530 DataCubes Q2: What operations should we support? Roll-up Drill-down Slice Dice (Pivot/rotate; drill-across; drill-through top N moving averages, etc)

CMU SCS Faloutsos & PavloCMU SCS /61531 D/W - OLAP - Conclusions D/W: copy (summarized) data + analyze OLAP - concepts: –DataCube –R/M/H-OLAP servers –‘dimensions’; ‘measures’

CMU SCS Faloutsos & PavloCMU SCS /61532 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning –association rules –(clustering)

CMU SCS Faloutsos & PavloCMU SCS /61533 Decision trees - Problem ??

CMU SCS Faloutsos & PavloCMU SCS /61534 Decision trees Pictorially, we have num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

CMU SCS Faloutsos & PavloCMU SCS /61535 Decision trees and we want to label ‘?’ num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) ?

CMU SCS Faloutsos & PavloCMU SCS /61536 Decision trees so we build a decision tree: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) ? 50 40

CMU SCS Faloutsos & PavloCMU SCS /61537 Decision trees so we build a decision tree: age<50 Y + chol. <40 N -... Y N

CMU SCS Faloutsos & PavloCMU SCS /61538 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering)

CMU SCS Faloutsos & PavloCMU SCS /61539 Decision trees Typically, two steps: –tree building –tree pruning (for over-training/over-fitting)

CMU SCS Faloutsos & PavloCMU SCS /61540 Tree building How? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

CMU SCS Faloutsos & PavloCMU SCS /61541 Tree building How? A: Partition, recursively - pseudocode: Partition ( Dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2)

CMU SCS Faloutsos & PavloCMU SCS /61542 Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split? Not In Exam=N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61543 Tree building Q1: how to introduce splits along attribute A i A1: –for num. attributes: binary split, or multiple split –for categorical attributes: compute all subsets (expensive!), or use a greedy algo N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61544 Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split? N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61545 Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split? A: by how close to uniform each subset is - ie., we need a measure of uniformity: N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61546 Tree building entropy: H(p+, p-) p Any other measure? N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61547 Tree building entropy: H(p +, p - ) p ‘gini’ index: 1-p p - 2 p N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61548 Tree building entropy: H(p +, p - ) ‘gini’ index: 1-p p - 2 (How about multiple labels?) N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61549 Tree building Intuition: entropy: #bits to encode the class label gini: classification error, if we randomly guess ‘+’ with prob. p + N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61550 Tree building Thus, we choose the split that reduces entropy/classification-error the most: Eg.: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61551 Tree building Before split: we need (n + + n - ) * H( p +, p - ) = (7+6) * H(7/13, 6/13) bits total, to encode all the class labels After the split we need: 0 bits for the first half and (2+6) * H(2/8, 6/8) bits for the second half N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61552 Tree pruning What for? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61553 Tree pruning Shortcut for scalability: DYNAMIC pruning: stop expanding the tree, if a node is ‘reasonably’ homogeneous –ad hoc threshold [Agrawal+, vldb92] –( Minimum Description Language (MDL) criterion (SLIQ) [Mehta+, edbt96] ) N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61554 Tree pruning Q: How to do it? A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?) (A2: or, rely on MDL (= Minimum Description Language) ) N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61555 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering) N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61556 Scalability enhancements Interval Classifier [Agrawal+,vldb92]: dynamic pruning SLIQ: dynamic pruning with MDL; vertical partitioning of the file (but label column has to fit in core) SPRINT: even more clever partitioning N.I.E.

CMU SCS Faloutsos & PavloCMU SCS /61557 Conclusions for classifiers Classification through trees Building phase - splitting policies Pruning phase (to avoid over-fitting) For scalability: –dynamic pruning –clever data partitioning

CMU SCS Faloutsos & PavloCMU SCS /61558 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering)

CMU SCS Faloutsos & PavloCMU SCS /61559 Association rules - idea [Agrawal+SIGMOD93] Consider ‘market basket’ case: (milk, bread) (milk) (milk, chocolate) (milk, bread) Find ‘interesting things’, eg., rules of the form: milk, bread -> chocolate | 90%

CMU SCS Faloutsos & PavloCMU SCS /61560 Association rules - idea In general, for a given rule Ij, Ik,... Im -> Ix | c ‘c’ = ‘confidence’ (how often people by Ix, given that they have bought Ij,... Im ‘s’ = support: how often people buy Ij,... Im, Ix

CMU SCS Faloutsos & PavloCMU SCS /61561 Association rules - idea Problem definition: given – a set of ‘market baskets’ (=binary matrix, of N rows/baskets and M columns/products) –min-support ‘s’ and –min-confidence ‘c’ find –all the rules with higher support and confidence

CMU SCS Faloutsos & PavloCMU SCS /61562 Association rules - idea Closely related concept: “large itemset” Ij, Ik,... Im, Ix is a ‘large itemset’, if it appears more than ‘min- support’ times Observation: once we have a ‘large itemset’, we can find out the qualifying rules easily (how?) Thus, let’s focus on how to find ‘large itemsets’

CMU SCS Faloutsos & PavloCMU SCS /61563 Association rules - idea Naive solution: scan database once; keep 2**|I| counters Drawback? Improvement?

CMU SCS Faloutsos & PavloCMU SCS /61564 Association rules - idea Naive solution: scan database once; keep 2**|I| counters Drawback? 2**1000 is prohibitive... Improvement? scan the db |I| times, looking for 1-, 2-, etc itemsets Eg., for |I|=3 items only (A, B, C), we have

CMU SCS Faloutsos & PavloCMU SCS /61565 Association rules - idea AB C first pass min-sup:10

CMU SCS Faloutsos & PavloCMU SCS /61566 Association rules - idea AB C first pass min-sup:10 A,B A,C B,C

CMU SCS Faloutsos & PavloCMU SCS /61567 Association rules - idea Anti-monotonicity property: if an itemset fails to be ‘large’, so will every superset of it (hence all supersets can be pruned) Sketch of the (famous!) ‘a-priori’ algorithm Let L(i-1) be the set of large itemsets with i-1 elements Let C(i) be the set of candidate itemsets (of size i)

CMU SCS Faloutsos & PavloCMU SCS /61568 Association rules - idea Compute L(1), by scanning the database. repeat, for i=2,3..., ‘join’ L(i-1) with itself, to generate C(i) two itemset can be joined, if they agree on their first i-2 elements prune the itemsets of C(i) (how?) scan the db, finding the counts of the C(i) itemsets - set this to be L(i) unless L(i) is empty, repeat the loop

CMU SCS Faloutsos & PavloCMU SCS /61569 Association rules - Conclusions Association rules: a great tool to find patterns easy to understand its output fine-tuned algorithms exist

CMU SCS Faloutsos & PavloCMU SCS /61570 Overall Conclusions Data Mining = ``Big Data’’ Analytics = Business Intelligence: – of high commercial, government and research interest DM = DB+ ML+ Stat+Sys Data warehousing / OLAP: to get the data Tree classifiers (SLIQ, SPRINT) Association Rules - ‘a-priori’ algorithm (clustering: BIRCH, CURE, OPTICS)

CMU SCS Faloutsos & PavloCMU SCS /61571 Reading material Agrawal, R., T. Imielinski, A. Swami, ‘Mining Association Rules between Sets of Items in Large Databases’, SIGMOD M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996

CMU SCS Faloutsos & PavloCMU SCS /61572 Additional references Agrawal, R., S. Ghosh, et al. (Aug , 1992). An Interval Classifier for Database Mining Applications. VLDB Conf. Proc., Vancouver, BC, Canada. Jiawei Han and Micheline Kamber, Data Mining, Morgan Kaufman, 2001, chapters , , 7.3.5