CS 345: Topics in Data Warehousing Thursday, November 18, 2004.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining of Very Large Data
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
IDS561 Big Data Analytics Week 6.
 Back to finding frequent itemsets  Typically, data is kept in flat files rather than in a database system:  Stored on disk  Stored basket-by-basket.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Frequent Item Mining.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Organization “Association Analysis”
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 20 (Dec 7, 2005) Data Mining: Association Rules Rajeev Motwani (partially based on notes.
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Fast Algorithms for Association Rule Mining
Mining Association Rules
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Frequent Itemsets and Association Rules 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 3: Frequent Itemsets.
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
DATA MINING LECTURE 3 Frequent Itemsets Association Rules.
Supermarket shelf management – Market-basket model:  Goal: Identify items that are bought together by sufficiently many customers  Approach: Process.
Data Mining Association Analysis Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
1 What is Association Analysis: l Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based.
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Jeffrey D. Ullman Stanford University.  2% of your grade will be for answering other students’ questions on Piazza.  18% for Gradiance.  Piazza code.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel
Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
William Norris Professor and Head, Department of Computer Science
Frequent Itemsets Association Rules
CPS216: Advanced Database Systems Data Mining
Market Basket Many-to-many relationship between different objects
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Hash-Based Improvements to A-Priori
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Market Basket Analysis and Association Rules
Mining Association Rules in Large Databases
Association Analysis: Basic Concepts
What Is Association Mining?
Presentation transcript:

CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Review of Tuesday’s Class Data Mining –What is data mining? –Types of data mining –Data mining pitfalls Decision Tree Classifiers –What is a decision tree? –Learning decision trees –Entropy –Information Gain –Cross-Validation

Overview of Today’s Class Assignment #3 clarifications Association Rule Mining –Market basket analysis –What is an association rule? –Frequent itemsets Association rule mining algorithms –A-priori algorithm –Speeding up A-priori using hashing –One- and two-pass algorithms * Adapted from slides by Vipin Kumar (Minnesota) and Rajeev Motwani (Stanford)

Aggregate Tables FACT Dimension n dimension columns = 2 n possible aggregates 2 are special All columns = original dimension table No grouping columns Only 1 row No reason to join to FACT AGG Eliminate this foreign key from the fact aggregate table

Candidate Column Sets Including fact aggregates that use some base dimension tables is optional

Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Also known as market basket analysis Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

Definition: Frequent Itemset Itemset –A collection of one or more items Example: {Milk, Bread, Diaper} –k-itemset An itemset that contains k items Support count (  ) –Frequency of occurrence of an itemset –E.g.  ({Milk, Bread,Diaper}) = 2 Support –Fraction of transactions that contain an itemset –E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset –An itemset whose support is greater than or equal to a minsup threshold

Definition: Association Rule Example: Association Rule –An implication expression of the form X  Y, where X and Y are itemsets –Example: {Milk, Diaper}  {Beer} Rule Evaluation Metrics –Support (s) Fraction of transactions that contain both X and Y –Confidence (c) Measures how often items in Y appear in transactions that contain X

Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having –support ≥ minsup threshold –confidence ≥ minconf threshold High confidence = strong pattern High support = occurs often –Less likely to be random occurrence –Larger potential benefit from acting on the rule

Application 1 (Retail Stores) Real market baskets –chain stores keep TBs of customer purchase info –Value? how typical customers navigate stores positioning tempting items suggests cross-sell opportunities – e.g., hamburger sale while raising ketchup price … High support needed, or no $$’s

Application 2 (Information Retrieval) Scenario 1 –baskets = documents –items = words in documents –frequent word-groups = linked concepts. Scenario 2 –items = sentences –baskets = documents containing sentences –frequent sentence-groups = possible plagiarism

Application 3 (Web Search) Scenario 1 –baskets = web pages –items = outgoing links –pages with similar references  about same topic Scenario 2 –baskets = web pages –items = incoming links –pages with similar in-links  mirrors, or same topic

Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements

Mining Association Rules Goal – find all association rules such that –support –confidence Reduction to Frequent Itemsets Problems –Find all frequent itemsets X –Given X={A 1, …,A k }, generate all rules X-A j  A j –Confidence = sup(X)/sup(X-A j ) –Support = sup(X) –Exclude rules whose confidence is too low Observe X-A j also frequent  support known Finiding all frequent itemsets is the hard part!

Itemset Lattice Given m items, there are 2 m -1 possible candidate itemsets

Scale of Problem WalMart –sells m=100,000 items –tracks n=1,000,000,000 baskets Web –several billion pages –approximately one new “word” per page Exponential number of itemsets –m items → 2 m -1 possible itemsets –Cannot possibly example all itemsets for large m –Even itemsets of size 2 may be too many –m=100,000 → 5 trillion item pairs

Frequent Itemsets in SQL DBMSs are poorly suited to association rule mining Star schema –Sales Fact –Transaction ID degenerate dimension –Item dimension Finding frequent 3-itemsets: SELECT Fact1.ItemID, Fact2.ItemID, Fact3.ItemID, COUNT(*) FROM Fact1 JOIN Fact2 ON Fact1.TID = Fact2.TID AND Fact1.ItemID 1000 Finding frequent k-itemsets requires joining k copies of fact table –Joins are non-equijoins –Impossibly expensive!

Association Rules and Data Warehouses Typical procedure: –Use data warehouse to apply filters Mine association rules for certain regions, dates –Export all fact rows matching filters to flat file Sort by transaction ID Items in same transaction are grouped together –Perform association rule mining on flat file An alternative: –Database vendors are beginning to add specialized data mining capabilities –Efficient algorithms for common data mining tasks are built in to the database system Decisions trees, association rules, clustering, etc. –Not standardized yet

Finding Frequent Pairs Frequent 2-Sets –hard case already –focus for now, later extend to k-sets Naïve Algorithm –Counters – all m(m–1)/2 item pairs (m = # of distinct items) –Single pass – scanning all baskets –Basket of size b – increments b(b–1)/2 counters Failure? –if memory < m(m–1)/2 –m=100,000 → 5 trillion item pairs –Naïve algorithm is impractical for large m

Pruning Candidate Itemsets Monotonicity principle: –If an itemset is frequent, then all of its subsets must also be frequent Monotonicity principle holds due to the following property of the support measure: Converse: –If an itemset is infrequent, then all of its supersets must also be infrequent

Found to be Infrequent Illustrating Monotonicity Principle Pruned supersets

A-Priori Algorithm A-Priori – 2-pass approach in limited memory Pass 1 –m counters (candidate items in A) –Linear scan of baskets b –Increment counters for each item in b Mark as frequent, f items of count at least s Pass 2 –f(f-1)/2 counters (candidate pairs of frequent items) –Linear scan of baskets b –Increment counters for each pair of frequent items in b Failure – if memory < f(f–1)/2 –Suppose that 10% of items are frequent –Memory is (m 2 / 200) vs. (m 2 / 2)

Finding Larger Itemsets Goal – extend A-Priori to frequent k-sets, k > 2 Monotonicity itemset X is frequent only if X – {X j } is frequent for all X j Idea –Stage k – finds all frequent k-sets –Stage 1 – gets all frequent items –Stage k – maintain counters for all candidate k-sets –Candidates – k-sets whose (k–1)-subsets are all frequent –Total cost: number of passes = max size of frequent itemset

A-Priori Algorithm Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6 C C C 3 = 41 With support-based pruning, = 15

Memory Usage – A-Priori Candidate Items Pass 1Pass 2 Frequent Items Candidate Pairs MEMORYMEMORY MEMORYMEMORY

PCY Idea Improvement upon A-Priori –Uses less memory –Proposed by Park, Chen, and Yu Observe – during Pass 1, memory mostly idle Idea –Use idle memory for hash-table H –Pass 1 – hash pairs from b into H –Increment counter at hash location –At end – bitmap of high-frequency hash locations –Pass 2 – bitmap extra condition for candidate pairs Similar to bit-vector filtering in “Bloom join”

Memory Usage – PCY Candidate Items Pass 1Pass 2 MEMORYMEMORY MEMORYMEMORY Hash Table Frequent Items Bitmap Candidate Pairs

PCY Algorithm Pass 1 –m counters and hash-table T –Linear scan of baskets b –Increment counters for each item in b –Increment hash-table counter for each item-pair in b Mark as frequent, f items of count at least s Summarize T as bitmap (count > s  bit = 1) Pass 2 –Counter only for F qualified pairs (X i,X j ): both are frequent pair hashes to frequent bucket (bit=1) –Linear scan of baskets b –Increment counters for candidate qualified pairs of items in b

Multistage PCY Algorithm Problem – False positives from hashing New Idea –Multiple rounds of hashing –After Pass 1, get list of qualified pairs –In Pass 2, hash only qualified pairs –Fewer pairs hash to buckets  less false positives (buckets with count >s, yet no pair of count >s) –In Pass 3, less likely to qualify infrequent pairs Repetition – reduce memory, but more passes Failure – memory < O(f+F)

Memory Usage – Multistage PCY Candidate Items Pass 1Pass 2 Hash Table 1 Frequent Items Bitmap Frequent Items Bitmap 1 Bitmap 2 Candidate Pairs Hash Table 2

Approximation Techniques Goal –find all frequent k-sets –reduce to 2 passes –must lose something  accuracy Approaches –Sampling algorithm –SON (Savasere, Omiecinski, Navathe) Algorithm –Toivonen Algorithm

Sampling Algorithm Pass 1 – load random sample of baskets in memory Run A-Priori (or enhancement) –Scale-down support threshold (e.g., if 1% sample, use s/100 as support threshold) –Compute all frequent k-sets in memory from sample –Need to leave enough space for counters Pass 2 –Keep counters only for frequent k-sets of random sample –Get exact counts for candidates to validate Error? –No false positives (Pass 2) –False negatives (X frequent, but not in sample)

SON Algorithm Pass 1 – Batch Processing –Scan data on disk –Repeatedly fill memory with new batch of data –Run sampling algorithm on each batch –Generate candidate frequent itemsets Candidate Itemsets – if frequent in some batch Pass 2 – Validate candidate itemsets Monotonicity Property Itemset X is frequent overall  frequent in at least one batch

Toivonen’s Algorithm Lower Threshold in Sampling Algorithm –Example – if support threshold is 1%, use 0.8% as support threshold when evaluating sample –Goal – overkill to avoid any false negatives Negative Border –Itemset X infrequent in sample, but all subsets are frequent –Example: AB, BC, AC frequent, but ABC infrequent Pass 2 –Count candidates and negative border –Negative border itemsets all infrequent  candidates are exactly the frequent itemsets –Otherwise? – start over! Achievement? – reduced failure probability, while keeping candidate-count low enough for memory