Association Rule Mining Dr. P. Viswanath, RGMCET November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Overview Data Mining Association rule mining Apriori method Some other methods Conclusion November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Data mining: A process of discovering previously unknown and potentially useful relationships among data elements in a large database. Various techniques from Statistics, Pattern Recognition, Machine Intelligence, Databases can be used for this purpose. But scalability to large data sets is the main concern. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Some tasks The relationships that can be discovered could be a kind of rule between various elements Quantitative descriptive rule Quantitative discriminant rule Association rule natural groups among data items Data clustering a prediction about future Time series analysis November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Association Rule An example: There is a super-market, and people are buying items from it. The goods bought by each person are stored in a database. Let the items are {A, B, C, … }. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Association Rule A rule like, if a person buys a set of items {A,C,E} then mostly he/she will buy another set of items {D,F}. {A,C,E} {D,F} is the association rule. Eg: People who buy potato chips are also buying cool-drinks. Potato chips cool-drinks November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Association Rule But, how good are these rules? That is, how much we can trust these rules. Are these rules useful? How frequently is this rule applicable. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Association Rule {D} {A} is an association rule. According to the given database, this rule is true. [confidence is high] But, only one person bought both D and A. [support is low] November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Association Rule {A} {C} is an association rule. According to the given database, this rule is true only partly. [confidence is not high] But, 2 out of 4 bought both A and C. [support is moderate] November 21, 2018 Data Mining: Concepts and Techniques
Notation and Definitions Let I be the set of all items. X, Y, … be the subsets of I We call X, Y, … as itemsets. If X has k items, then X is called as a k-itemset If I is of size n. That is, in total there are n items. Then, the total number of itemsets is 2n – 1. Association rule is of the form X Y November 21, 2018 Data Mining: Concepts and Techniques
Notation and Definitions Support for the rule X Y is the fraction of transactions which contains both X and Y. That is, Support = #transactions containing X and Y / Total # of transactions. Confidence of the rule = #transactions that contains both X and Y / #transactions that contains X. Very often these are given in percentages (not in fractions). November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques The Example For rule A C : support = 0.5 (or 50%) confidence = 0.666 (or 66.6%) November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Notation Normally support is defined for an itemset. Support (X) = percent of transactions having X. Confidence is defined for a rule. Confidence (X Y) = Support (X and Y) / Support (X) November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques An Exercise Problem Transaction Id Items bought 100 A,B,C 101 B,C 102 A,C 103 A,B,D 104 105 A,C,E 106 B,D 107 Find out support and confidence of A B Find out support and confidence of B A November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques The Problem Given a transactional database, find out all association rules satisfying a given minimum support and confidence. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques The Problem This problem boils down into two subproblems Find out all itemsets for which the support is more than the minimum value. This is called frequent itemset mining. Find out the association rules using frequent itemsets. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques The Problem Frequent itemset mining is the more difficult problem. Find out all itemsets for which the support is more than a given value. How much difficult is this problem? November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques A simple algorithm If the minimum support is s%. If there are m transacations, then if an itemset is present in more than sm/100 transactions, it is frequent. Here sm/100 is the threshold number. A simple naïve algorithm for this is … November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques A Naive Algorithm For each itemset create a counter. Intialize all counters to zero. For each transaction in the database, Find out all subsets of the transaction and increment their respective counters. Select those itemsets for which the counter value is more than the given threshold value. November 21, 2018 Data Mining: Concepts and Techniques
Analysis of the Algorithm If there are n items. Then the total number of counters is 2n – 1 . If n is a small number (perhaps <20) then this is a feasible solution. But when n is large (like 1000) then it is not feasible to create 21000 – 1 counters. As an exercise, try to find out how much big this number is. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Analysis Time complexity is O(m) [Good] #database scans = one only. [Good] Space complexity is O(2n) [Very Bad] In data mining #database scans is one important measure of scalability. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Other Naïve Method The other way is to use only one counter and find the support for each itemset separately. For this one has to scan the database 2n – 1 times. Space complexity is reduced, but time complexity is increased. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Apriori Algorithm One of the initial algorithms to solve this problem in a better way. It uses an important property regarding the itemsets A subset of a frequent itemset must also be a frequent itemset i.e., if {A,B} is a frequent itemset, both {A} and {B} should also be frequent. If either {A} or {B} is not frequent, then {A,B} is also non-frequent. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Apriori Algorithm Some of the itemsets, we can discard at early stages. For example, if X is a non-frequent itemset, then there is no need to worry about all supersets of X. But, if X is frequent, then may be a superset of X is also frequent. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Apriori Algorithm This is a bottom-up method. First find frequent 1-itemsets, then find frequent 2-itemsets, … If we already found frequent k-itemsets. We call this LK November 21, 2018 Data Mining: Concepts and Techniques
Apriori Algorithm Continued … We generate candidates which can be frequent K+1 itemsets. We call these candidates as CK+1 We find count of these candidates and find LK+1 November 21, 2018 Data Mining: Concepts and Techniques
How candidates are generated If {A,B,C} and {A,B,D} are two itemsets in L3 then a candidate itemset in C4 is {A,B,C,D} provided all its subsets of size 3 are in L3 If, for example, {B,C,D} is not in L3 then {A,B,C,D} can not be frequent and is removed from C4 [This is called the pruning step] November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k Find L1 ; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do (i) increment the count of all (ii) candidates in Ck+1 that are contained in t (iii) Lk+1 = candidates in Ck+1 with min_support end return k Lk; November 21, 2018 Data Mining: Concepts and Techniques
The Apriori Algorithm — Example Database D L1 C1 Scan D C2 C2 L2 Scan D C3 L3 Scan D November 21, 2018 Data Mining: Concepts and Techniques
Analysis of Apriori Algorithm If the largest itemset size is k then we need to scan the database atleast k times. The space required depends on the number of candidates generated. But, certainly this is better than the naïve methods. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Exercise Problem Transaction Id Items bought 100 A,B,C,D,E 101 A,B,C,D,F 102 B,C,F 103 A,C,F,G Let the minimum support required is 50%, find out all frequent itemsets using the Apriori algorithm. At each stage show the candidates generated and describe how the Apriori property is used to prune the candidates set. November 21, 2018 Data Mining: Concepts and Techniques
Methods to Improve Apriori’s Efficiency Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB November 21, 2018 Data Mining: Concepts and Techniques
Methods to Improve Apriori’s Efficiency Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent November 21, 2018 Data Mining: Concepts and Techniques
Mining Frequent Patterns Without Candidate Generation Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques FP-tree based mining Develops an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only! November 21, 2018 Data Mining: Concepts and Techniques
Partition based methods Partition the database and then apply divide-and-conquer strategies. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Summary Association rule mining probably the most significant contribution from the database community in KDD A large number of papers have been published Many interesting issues have been explored An interesting research direction Association analysis in other types of data: spatial data, multimedia data, time series data, etc. November 21, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Thank you !!! November 21, 2018 Data Mining: Concepts and Techniques