These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.

Slides:



Advertisements
Similar presentations
Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip.
Advertisements

Association Rule Mining
Mining Association Rules in Large Databases
Recap: Mining association rules from large datasets
Brian Chase.  Retailers now have massive databases full of transactional history ◦ Simply transaction date and list of items  Is it possible to gain.
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Techniques Association Rule
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
חוקי Association ד " ר אבי רוזנפלד. המוטיבציה מה הם הדברים שהולכים ביחד ? –איזה מוצרים בסופר שווה לשים ביחד –מערכות המלצה – Recommendation Systems שבוע.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Data Mining Association Analysis: Basic Concepts and Algorithms
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining Part 1 Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏
Fast Algorithms for Association Rule Mining
Lecture14: Association Rules
Mining Association Rules
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Data Mining Chapter 2 Association Rule Mining
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association rule mining Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf). Assume all data.
Association rule mining Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf). Assume all data.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Data Mining (and machine learning) The A Priori Algorithm.
Elsayed Hemayed Data Mining Course
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
Introduction to Machine Learning Lecture 13 Introduction to Association Rules Albert Orriols i Puig Artificial.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining (and machine learning)
Association Analysis: Basic Concepts
Presentation transcript:

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Advanced Database Systems F24DS2 / F29AT2 The Apriori Algorithm

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Reading The main technical material (the Apriori algorithm and its variants) in this lecture is based on: Fast Algorithms for Mining Association Rules, by Rakesh Agrawal and Ramakrishan Sikant, IBM Almaden Research Center You don’t have to read that paper, but why not read it anyway – the pdf is on my teaching resources page.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me We will be talking about datasets like that on the following page. This is a transaction database, where each record represents a transaction between (usually) a customer and a shop. Each record in a supermarket’s transaction DB, for example, corresponds to a basket of specific items.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me And we will also be talking about things like rules, confidence and coverage (see last lecture). But note that we will now talk about itemsets instead of rules. Also, the coverage of a rule is the same as the support of an items. Don’t get confused!

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Find rules in two stages Agarwal and colleagues divided the problem of finding good rules into two phases: 1.Find all itemsets with a specified minimal support (coverage). An itemset is just a specific set of items, e.g. {apples, cheese}. The Apriori algorithm can efficiently find all itemsets whose coverage is above a given minimum. 2.Use these itemsets to help generate interersting rules. Having done stage 1, we have considerably narrowed down the possibilities, and can do reasonably fast processing of the large itemsets to generate candidate rules.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Terminology k-itemset : a set of k items. E.g. {beer, cheese, eggs} is a 3-itemset {cheese} is a 1-itemset {honey, ice-cream} is a 2-itemset support: an itemset has support s% if s% of the records in the DB contain that itemset. minimum support: the Apriori algorithm starts with the specification of a minimum level of support, and will focus on itemsets with this level or above.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Terminology large itemset: doesn’t mean an itemset with many items. It means one whose support is at least minimum support. L k : the set of all large k-itemsets in the DB. C k : a set of candidate large k-itemsets. In the algorithm we will look at, it generates this set, which contains all the k-itemsets that might be large, and then eventually generates the set above.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Terminology sets: Let A be a set (A = {cat, dog}) and let B be a set (B = {dog, eel, rat}) and let C = {eel, rat} I use `A + B’ to mean A union B. So A + B = {cat, dog, eel. Rat} When X is a subset of Y, I use Y – X to mean the set of things in Y which are not in X. E.g. B – C = {dog}

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me ID a, b, c, d, e, f, g, h, i E.g. 3-itemset {a,b,h} has support 15% 2-itemset {a, i} has support 0% 4-itemset {b, c, d, h} has support 5% If minimum support is 10%, then {b} is a large itemset, but {b, c, d, h} Is a small itemset!

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me The Apriori algorithm for finding large itemsets efficiently in big DBs 1: Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3{C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } 7 Set L k := all c in C k whose count >= minsup 8 } /* end -- return all of the L k sets.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Explaining the Apriori Algorithm … 1: Find all large 1-itemsets To start off, we simply find all of the large 1-itemsets. This is done by a basic scan of the DB. We take each item in turn, and count the number of times that item appears in a basket. In our running example, suppose minimum support was 60%, then the only large 1-itemsets would be: {a}, {b}, {c}, {d} and {f}. So we get L 1 = { {a}, {b}, {c}, {d}, {f}}

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) We already have L 1. This next bit just means that the remainder of the algorithm generates L 2, L 3, and so on until we get to an L k that’s empty. How these are generated is like this:

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) Given the large k-1-itemsets, this step generates some candidate k-itemsets that might be large. Because of how apriori-gen works, the set C k is guaranteed to contain all the large k-itemsets, but also contains some that will turn out not to be `large’.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero We are going to work out the support for each of the candidate k-itemsets in C k, by working out how many times each of these itemsets appears in a record in the DB.– this step starts us off by initialising these counts to zero.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } We now take each record r in the DB and do this: get all the candidate k-itemsets from C k that are contained in r. For each of these, update its count.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } 7 Set L k := all c in C k whose count >= minsup Now we have the count for every candidate. Those whose count is big enough are valid large itemsets of the right size. We therefore now have L k, We now go back into the for loop of line 2 and start working towards finding L k+1

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } 7 Set L k := all c in C k whose count >= minsup 8 } /* end -- return all of the L k sets. We finish at the point where we get an empty L k. The algorithm returns all of the (non-empty) L k sets, which gives us an excellent start in finding interesting rules (although the large itemsets themselves will usually be very interesting and useful.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me apriori-gen : notes Suppose we have worked out that the large 2-itemsets are: L 2 = { {milk, noodles}, {milk, tights}, {noodles, quorn}} apriori-gen now generates 3-itemsets that all may be large. An obvious way to do this would be to generate all of the possible 3- itemsets that you can make from {milk, noodles, tights, quorn}. But this would include, for example, {milk, tights, quorn}. Now, if this really was a large 3-itemset, that would mean the number of records containing all three is >= minsup; this means it would have to be true that the number of records containing {tights, quorn} is >= minsup. But, it can’t be, because this is not one of the large 2-itemsets.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me apriori-gen : the join step apriori-gen is clever in generating not too many candidate large itemsets, but making sure to not lose any that do turn out to be large. To explain it, we need to note that there is always an ordering of the items. We will assume alphabetical order, and that the datastructures used always keep members of a set in alphabetical order. a < b will mean that a comes before b in alphabetical order. Suppose we have L k and wish to generate C k+1 First we take every distinct pair of sets in L k {a 1, a 2, … a k } and {b 1, b 2, … b k }, and do this: in all cases where {a 1, a 2, … a k-1 } = {b 1, b 2, … b k-1 }, and a k < b k, {a 1, a 2, … a k, b k } is a candidate k+1-itemset.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me An illustration of that Suppose the 2-itemsets are: L 2 = { {milk, noodles}, {milk, tights}, {noodles, quorn}, {noodles, peas}, {noodles, tights}} The pairs that satisfy this: {a 1, a 2, … a k-1 } = {b 1, b 2, … b k-1 }, and a k < b k, are: {milk, noodles}|{milk, tights} {noodles, peas}|{noodles, quorn} {noodles, peas}|{noodles, tights} {noodles, quorn}|{noodles, tights} So the candidate 3-itemsets are: {milk, noodles, tights}, {noodles, peas, quorn} {noodles, peas, tights}, {noodles, quorn, tights}

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me apriori-gen : the prune step Now we have some candidate k+1 itemsets, and are guaranteed to have all of the ones that possibly could be large, but we have the chance to maybe prune out some more before we enter the next stage of Apriori that counts their support. In the prune step, we take the candidate k+1 itemsets we have, and remove any for which some 2-subset of it is not a large k- itemset. Such couldn’t possibly be a large k+1-itemset. E.g. in the current example, we have (n = noodles, etc): L 2 = { {milk, n}, {milk, tights}, {n, quorn}, {n, peas}, {n, tights}} And candidate k+1-itemsets so far: {m, n, t}, {n, p, q}, {n, p, t}, {n, q, t} Now, {p, q} is not a 2-itemset, so {n,p,q} is pruned. {p,t} is not a 2-itemset, so {n,p,t} is pruned {q,t} is not a 2-itemset, so {n,q,t} is pruned. After this we finally have C 3 = {{milk, noodles, tights}}

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Understanding rules The Apriori algorithm finds interesting (i.e. frequent) itemsets. E.g. it may find that {apples, bananas, milk} has coverage 30% -- so 30% of transactions contain each of these three things. What can you say about the coverage of {apples, milk}? We can invent several potential rules, e.g.: IF basket contains apples and bananas, it also contains MILK. Suppose support of {a, b} is 40%; what is the confidence of this rule?

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Understanding rules II Suppose itemset A = {beer, cheese, eggs} has 30% support in the DB {beer, cheese} has 40%, {beer, eggs} has 30%, {cheese, eggs} has 50%, and each of beer, cheese, and eggs alone has 50% support.. What is the confidence of: IF basket contains Beer and Cheese, THEN basket also contains Eggs ? The confidence of a rule if A then B is simply: support(A + B) / support(A). So it’s 30/40 = 0.75 ; this rule has 75% confidence What is the confidence of: IF basket contains Beer, THEN basket also contains Cheese and Eggs ? 30 / 50 = 0.6 so this rule has 60% confidence

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Understanding rules III If the following rule has confidence c: If A then B and if support(A) = 2 * support(B), what can be said about the confidence of: If B then A confidence c is support(A + B) / support(A) = support(A + B) / 2 * support(B) Let d be the confidence of ``If B then A’’. d is support(A+B / support(B) -- Clearly, d = 2c E.g. A might be milk and B might be newspapers

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What this lecture was about The Apriori algorithm for efficiently finding frequent large itemsets in large DBs Associated terminology Associated notes about rules, and working out the confidence of a rule based on the support of its component itemsets

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Appendix

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me A full run through of Apriori ID a, b, c, d, e, f, g We will assume this is our transaction database D and we will assume minsup is 4 (20%) This will not be run through in the lecture; it is here to help with revision

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me First we find all the large 1-itemsets. I.e., in this case, all the 1-itemsets that are contained by at least 4 records in the DB. In this example, that’s all of them. So, L 1 = {{a}, {b}, {c}, {d}, {e}, {f}, {g}} Now we set k = 2 and run apriori-gen to generate C 2 The join step when k=2 just gives us the set of all alphabetically ordered pairs from L 1, and we cannot prune any away, so we have C 2 = {{a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {a, g}, {b, c}, {b, d}, {b, e}, {b, f}, {b, g}, {c, d}, {c, e}, {c, f}, {c, g}, {d, e}, {d, f}, {d, g}, {e, f}, {e, g}, {f, g}}

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me So we have C 2 = {{a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {a, g}, {b, c}, {b, d}, {b, e}, {b, f}, {b, g}, {c, d}, {c, e}, {c, f}, {c, g}, {d, e}, {d, f}, {d, g}, {e, f}, {e, g}, {f, g}} Line 4 of the Apriori algorithm now tells us set a counter for each of these to 0. Line 5 now prepares us to take each record in the DB in turn, and find which of those in C 2 are contained in it. The first record r1 is: {a, b, d, g}. Those of C 2 it contains are: {a, b}, {a, d}, {a, g}, {a, d}, {a, g}, {b, d}, {b, g}, {d, g}. Hence C r1 = {{a, b}, {a, d}, {a, g}, {a, d}, {a, g}, {b, d}, {b, g}, {d, g}} and the rest of line 6 tells us to increment the counters of these itemsets. The second record r2 is:{c, d, e}; C r2 = {{c, d}, {c, e}, {d, e}}, and we increment the counters for these three itemsets. … After all 20 records, we look at the counters, and in this case we will find that the itemsets with >= minsup (4) counters are: {a, d}, {c, e}. So, L 2 = {{a, c}, {a, d}, {c, d}, {c, e}, {c, f}}

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me So we have L 2 = {{a, c}, {a, d}, {c, d}, {c, e}, {c, f}} We now set k = 3 and run apriori-gen on L 2. The join step finds the following pairs that meet the required pattern: {a, c}:{a, d} {c, d}:{c, e} {c, d}:{c, f} {c, e}:{c, f} This leads to the candidates 3-itemsets: {a, c, d}, {c, d, e}, {c, d, f}, {c, e, f} We prune {c, d, e} since {d, e} is not in L 2 We prune {c, d, f} since {d, f} is not in L 2 We prune {c, e, f} since {e, f} is not in L 2 We are left with C 3 = {a, c, d} We now run lines 5—7, to count how many records contain {a, c, d}. The count is 4, so L 3 = {a, c, d}

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me So we have L 3 = {a, c, d} We now set k = 4, but when we run apriori-gen on L 3 we get the empty set, and hence eventually we find L 4 = {} This means we now finish, and return the set of all of the non- empty Ls – these are all of the large itemsets: Result = {{a}, {b}, {c}, {d}, {e}, {f}, {g}, {a, c}, {a, d}, {c, d}, {c, e}, {c, f}, {a, c, d}} Each large itemset is intrinsically interesting, and may be of business value. Simple rule-generation algorithms can now use the large itemsets as a starting point.