Mining Association Rules in Large Databases

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Mining Association Rules
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CPS : Information Management and Mining
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Organization “Association Analysis”
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules Mining Part III. Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Mining Association Rules
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Find information from data data ? information.
UNIT-5 Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
COMP 5331: Knowledge Discovery and Data Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent-Pattern Tree
©Jiawei Han and Micheline Kamber
FP-Growth Wenlong Zhang.
Mining Association Rules in Large Databases
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts
What Is Association Mining?
CIS671-Knowledge Discovery and Data Mining
Presentation transcript:

Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Constraint-based association mining Summary February 2, 2019 Data Mining: Concepts and Techniques

Rule Measures: Support and Confidence Customer buys both Customer buys diaper Find all the rules X & Y  Z with minimum confidence and support support, s, probability that a transaction contains {X  Y  Z} confidence, c, conditional probability that a transaction having {X  Y} also contains Z Customer buys beer Let minimum support 50%, and minimum confidence 50%, we have A  C (50%, 66.6%) C  A (50%, 100%) February 2, 2019 Data Mining: Concepts and Techniques

Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality! February 2, 2019 Data Mining: Concepts and Techniques

Definition: Frequent Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support count () Frequency of occurrence of an itemset E.g. ({Milk, Bread,Diaper}) = 2 Support Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold February 2, 2019 Data Mining: Concepts and Techniques

Definition: Association Rule An implication expression of the form X  Y, where X and Y are itemsets Example: {Milk, Diaper}  {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X Example: February 2, 2019 Data Mining: Concepts and Techniques

Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive! February 2, 2019 Data Mining: Concepts and Techniques

Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements February 2, 2019 Data Mining: Concepts and Techniques

Mining Association Rules—An Example Min. support 50% Min. confidence 50% For rule A  C: support = support({A C}) = 50% confidence = support({A C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent February 2, 2019 Data Mining: Concepts and Techniques

Mining Frequent Itemsets: the Key Step Find the frequent itemsets: the sets of items that have minimum support A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules. February 2, 2019 Data Mining: Concepts and Techniques

Mining Association Rules Two-step approach: Frequent Itemset Generation Generate all itemsets whose support  minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive February 2, 2019 Data Mining: Concepts and Techniques

Frequent Itemset Generation Given d items, there are 2d possible candidate itemsets February 2, 2019 Data Mining: Concepts and Techniques

Frequent Itemset Generation Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!! February 2, 2019 Data Mining: Concepts and Techniques

Computational Complexity Given d unique items: Total number of itemsets = 2d Total number of possible association rules: If d=6, R = 602 rules February 2, 2019 Data Mining: Concepts and Techniques

Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction February 2, 2019 Data Mining: Concepts and Techniques

Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support February 2, 2019 Data Mining: Concepts and Techniques

Illustrating Apriori Principle Found to be Infrequent Pruned supersets February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques The Apriori Algorithm Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; February 2, 2019 Data Mining: Concepts and Techniques

The Apriori Algorithm — Example Database D L1 C1 Scan D C2 C2 L2 Scan D C3 L3 Scan D February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Generate Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1,4,7 2,5,8 3,6,9 Hash function February 2, 2019 Data Mining: Concepts and Techniques

Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 3,6,9 2,5,8 Hash on 1, 4 or 7 February 2, 2019 Data Mining: Concepts and Techniques

How to Generate Candidates? Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck February 2, 2019 Data Mining: Concepts and Techniques

How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction February 2, 2019 Data Mining: Concepts and Techniques

Example of Generating Candidates L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd} February 2, 2019 Data Mining: Concepts and Techniques

Methods to Improve Apriori’s Efficiency Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent February 2, 2019 Data Mining: Concepts and Techniques

Compact Representation of Frequent Itemsets Some itemsets are redundant because they have identical support as their supersets Number of frequent itemsets Need a compact representation February 2, 2019 Data Mining: Concepts and Techniques

Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Infrequent Itemsets Border February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset February 2, 2019 Data Mining: Concepts and Techniques

Maximal vs Closed Itemsets Transaction Ids Not supported by any transactions February 2, 2019 Data Mining: Concepts and Techniques

Maximal vs Closed Frequent Itemsets Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4 February 2, 2019 Data Mining: Concepts and Techniques

Maximal vs Closed Itemsets February 2, 2019 Data Mining: Concepts and Techniques

Is Apriori Fast Enough? — Performance Bottlenecks The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation Huge candidate sets: 104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates. Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern February 2, 2019 Data Mining: Concepts and Techniques

Alternative Methods for Frequent Itemset Generation Representation of Database horizontal vs vertical data layout February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques FP-growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques FP-tree construction null After reading TID=1: A:1 B:1 After reading TID=2: null B:1 A:1 B:1 C:1 D:1 February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques FP-Tree Construction Transaction Database null B:3 A:7 B:5 C:3 C:1 D:1 Header table D:1 C:3 E:1 D:1 E:1 D:1 E:1 D:1 Pointers are used to assist frequent itemset generation February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques FP-growth Conditional Pattern base for D: P = {(A:1,B:1,C:1), (A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)} Recursively apply FP-growth on P Frequent Itemsets found (with sup > 1): AD, BD, CD, ACD, BCD null A:7 B:1 B:5 C:1 C:1 D:1 D:1 C:3 D:1 D:1 D:1 February 2, 2019 Data Mining: Concepts and Techniques

Tree Projection Set enumeration tree: Possible Extension: E(A) = {B,C,D,E} Possible Extension: E(ABC) = {D,E} February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Tree Projection Items are listed in lexicographic order Each node P stores the following information: Itemset for node P List of possible lexicographic extensions of P: E(P) Pointer to projected database of its ancestor node Bitvector containing information about which transactions in the projected database contain the itemset February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Projected Database Projected Database for node A: Original Database: For each transaction T, projected transaction at node A is T  E(A) February 2, 2019 Data Mining: Concepts and Techniques

Benefits of the FP-tree Structure Completeness: never breaks a long pattern of any transaction preserves complete information for frequent pattern mining Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are more likely to be shared never be larger than the original database (if not count node-links and counts) Example: For Connect-4 DB, compression ratio could be over 100 February 2, 2019 Data Mining: Concepts and Techniques

Mining Frequent Patterns Using FP-tree General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree Method For each item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) February 2, 2019 Data Mining: Concepts and Techniques

Major Steps to Mine FP-tree Construct conditional pattern base for each node in the FP-tree Construct conditional FP-tree from each conditional pattern-base Recursively mine conditional FP-trees and grow frequent patterns obtained so far If the conditional FP-tree contains a single path, simply enumerate all the patterns February 2, 2019 Data Mining: Concepts and Techniques

Step 1: From FP-tree to Conditional Pattern Base Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 February 2, 2019 Data Mining: Concepts and Techniques

Properties of FP-tree for Conditional Pattern Base Construction Node-link property For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree header Prefix path property To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai. February 2, 2019 Data Mining: Concepts and Techniques

Step 2: Construct Conditional FP-tree For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base {} m-conditional pattern base: fca:2, fcab:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam {} f:3 c:3 a:3 m-conditional FP-tree c:3 b:1 b:1   a:3 p:1 m:2 b:1 p:2 m:1 February 2, 2019 Data Mining: Concepts and Techniques

Step 3: Recursively mine the conditional FP-tree {} f:3 c:3 am-conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree {} Cond. pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree February 2, 2019 Data Mining: Concepts and Techniques

FP-growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K February 2, 2019 Data Mining: Concepts and Techniques

FP-growth vs. Tree-Projection: Scalability with Support Threshold Data set T25I20D100K February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule February 2, 2019 Data Mining: Concepts and Techniques

Rule Generation for Apriori Algorithm Lattice of rules Pruned Rules Low Confidence Rule February 2, 2019 Data Mining: Concepts and Techniques

Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share the same prefix in the rule consequent join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC Prune rule D=>ABC if its subset AD=>BC does not have high confidence February 2, 2019 Data Mining: Concepts and Techniques

Effect of Support Distribution Many real data sets have skewed support distribution Support distribution of a retail data set February 2, 2019 Data Mining: Concepts and Techniques

Effect of Support Distribution How to set the appropriate minsup threshold? If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products) If minsup is set too low, it is computationally expensive and the number of itemsets is very large Using a single minimum support threshold may not be effective February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Iceberg Queries Icerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold Example: select P.custID, P.itemID, sum(P.qty) from purchase P group by P.custID, P.itemID having sum(P.qty) >= 10 Compute iceberg queries efficiently by Apriori: First compute lower dimensions Then compute higher dimensions only when all the lower ones are above the threshold February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C}  {D} and {A,B}  {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used February 2, 2019 Data Mining: Concepts and Techniques

Application of Interestingness Measure Interestingness Measures February 2, 2019 Data Mining: Concepts and Techniques

Computing Interestingness Measure Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X  Y Y X f11 f10 f1+ f01 f00 fo+ f+1 f+0 |T| f11: support of X and Y f10: support of X and Y f01: support of X and Y f00: support of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, etc. February 2, 2019 Data Mining: Concepts and Techniques

Drawback of Confidence Coffee Tea 15 5 20 75 80 90 10 100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375 February 2, 2019 Data Mining: Concepts and Techniques

Interestingness Measurements Objective measures Two popular measurements: support; and confidence Subjective measures (Silberschatz & Tuzhilin, KDD95) A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it) February 2, 2019 Data Mining: Concepts and Techniques

Criticism to Support and Confidence Example 1: (Aggarwal & Yu, PODS98) Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball  eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball  not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence February 2, 2019 Data Mining: Concepts and Techniques

Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SB) = 420/1000 = 0.42 P(S)  P(B) = 0.6  0.7 = 0.42 P(SB) = P(S)  P(B) => Statistical independence P(SB) > P(S)  P(B) => Positively correlated P(SB) < P(S)  P(B) => Negatively correlated February 2, 2019 Data Mining: Concepts and Techniques

Statistical-based Measures Measures that take into account statistical dependence February 2, 2019 Data Mining: Concepts and Techniques

Example: Lift/Interest Coffee Tea 15 5 20 75 80 90 10 100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated) February 2, 2019 Data Mining: Concepts and Techniques

Drawback of Lift & Interest Y X 10 90 100 Y X 90 10 100 Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Apriori-style support based pruning? How does it affect these measures? February 2, 2019 Data Mining: Concepts and Techniques

Constraint-Based Mining Interactive, exploratory mining giga-bytes of data? Could it be real? — Making good use of constraints! What kinds of constraints can be used in mining? Knowledge type constraint: classification, association, etc. Data constraint: SQL-like queries Find product pairs sold together in Vancouver in Dec.’98. Dimension/level constraints: in relevance to region, price, brand, customer category. Rule constraints small sales (price < $10) triggers big sales (sum > $200). Interestingness constraints: strong rules (min_support  3%, min_confidence  60%). February 2, 2019 Data Mining: Concepts and Techniques

Rule Constraints in Association Mining Two kind of rule constraints: Rule form constraints: meta-rule guided mining. P(x, y) ^ Q(x, w) ® takes(x, “database systems”). Rule (content) constraint: constraint-based query optimization (Ng, et al., SIGMOD’98). sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99): 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown above. 2-var: A constraint confining both sides (L and R). sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS) February 2, 2019 Data Mining: Concepts and Techniques

Constrain-Based Association Query Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price) A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C }, where C is a set of constraints on S1, S2 including frequency constraint A classification of (single-variable) constraints: Class constraint: S  A. e.g. S  Item Domain constraint: S v,   { , , , , ,  }. e.g. S.Price < 100 v S,  is  or . e.g. snacks  S.Type V S, or S V,   { , , , ,  } e.g. {snacks, sodas }  S.Type Aggregation constraint: agg(S)  v, where agg is in {min, max, sum, count, avg}, and   { , , , , ,  }. e.g. count(S1.Type)  1 , avg(S2.Price)  100 February 2, 2019 Data Mining: Concepts and Techniques

Constrained Association Query Optimization Problem Given a CAQ = { (S1, S2) | C }, the algorithm should be : sound: It only finds frequent sets that satisfy the given constraints C complete: All frequent sets satisfy the given constraints C are found A naïve solution: Apply Apriori for finding all frequent sets, and then to test them for constraint satisfaction one by one. Our approach: Comprehensive analysis of the properties of constraints and try to push them as deeply as possible inside the frequent set computation. February 2, 2019 Data Mining: Concepts and Techniques

Anti-monotone and Monotone Constraints A constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the super-patterns of S can satisfy Ca A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of S also satisfies it February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Succinct Constraint A subset of item Is is a succinct set, if it can be expressed as p(I) for some selection predicate p, where  is a selection operator SP2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I, s.t. SP can be expressed in terms of the strict power sets of I1, …, Ik using union and minus A constraint Cs is succinct provided SATCs(I) is a succinct power set February 2, 2019 Data Mining: Concepts and Techniques

Convertible Constraint Suppose all items in patterns are listed in a total order R A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that each suffix of S w.r.t. R also satisfies C A constraint C is convertible monotone iff a pattern S satisfying the constraint implies that each pattern of which S is a suffix w.r.t. R also satisfies C February 2, 2019 Data Mining: Concepts and Techniques

Relationships Among Categories of Constraints Succinctness Anti-monotonicity Monotonicity Convertible constraints Inconvertible constraints February 2, 2019 Data Mining: Concepts and Techniques

Property of Constraints: Anti-Monotone Anti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint. Examples: sum(S.Price)  v is anti-monotone sum(S.Price)  v is not anti-monotone sum(S.Price) = v is partly anti-monotone Application: Push “sum(S.price)  1000” deeply into iterative frequent set computation. February 2, 2019 Data Mining: Concepts and Techniques

Characterization of Anti-Monotonicity Constraints S  v,   { , ,  } v  S S  V S  V S  V min(S)  v min(S)  v min(S)  v max(S)  v max(S)  v max(S)  v count(S)  v count(S)  v count(S)  v sum(S)  v sum(S)  v sum(S)  v avg(S)  v,   { , ,  } (frequent constraint) yes no partly convertible (yes) February 2, 2019 Data Mining: Concepts and Techniques

Example of Convertible Constraints: Avg(S)  V Let R be the value descending order over the set of items E.g. I={9, 8, 6, 4, 3, 1} Avg(S)  v is convertible monotone w.r.t. R If S is a suffix of S1, avg(S1)  avg(S) {8, 4, 3} is a suffix of {9, 8, 4, 3} avg({9, 8, 4, 3})=6  avg({8, 4, 3})=5 If S satisfies avg(S) v, so does S1 {8, 4, 3} satisfies constraint avg(S)  4, so does {9, 8, 4, 3} February 2, 2019 Data Mining: Concepts and Techniques

Property of Constraints: Succinctness For any set S1 and S2 satisfying C, S1  S2 satisfies C Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1 , i.e., it contains a subset belongs to A1 , Example : sum(S.Price )  v is not succinct min(S.Price )  v is succinct Optimization: If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint alone is not affected by the iterative support counting. February 2, 2019 Data Mining: Concepts and Techniques

Characterization of Constraints by Succinctness S  v,   { , ,  } v  S S V S  V S  V min(S)  v min(S)  v min(S)  v max(S)  v max(S)  v max(S)  v count(S)  v count(S)  v count(S)  v sum(S)  v sum(S)  v sum(S)  v avg(S)  v,   { , ,  } (frequent constraint) Yes yes weakly no (no) February 2, 2019 Data Mining: Concepts and Techniques

Why Is the Big Pie Still There? More on constraint-based mining of associations Boolean vs. quantitative associations Association on discrete vs. continuous data From association to correlation and causal structure analysis. Association does not necessarily imply correlation or causal relationships From intra-trasanction association to inter-transaction associations E.g., break the barriers of transactions (Lu, et al. TOIS’99). From association analysis to classification and clustering analysis E.g, clustering association rules February 2, 2019 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Summary Association rule mining probably the most significant contribution from the database community in KDD A large number of papers have been published Many interesting issues have been explored An interesting research direction Association analysis in other types of data: spatial data, multimedia data, time series data, etc. February 2, 2019 Data Mining: Concepts and Techniques