Tutorial on Assignment 3 in Data Mining 2008 Association Rule Mining

Tutorial on Assignment 3 in Data Mining 2008 Association Rule Mining
Gyozo Gidofalvi Uppsala Database Laboratory

Announcements Monday the 15th of December, has been added as a possible date for examination of assignment 3 and/or 4. Updated course home page with assignment 3 material. 11/30/2018 Gyozo Gidofalvi

Outline Association rules and Apriori-based frequent itemset mining
Frequent pattern growth based frequent itemset mining Amos II / AmosQL crash course Frequent itemset mining - elements of a DB implementation using Amos II / AmosQL The assignment 11/30/2018 Gyozo Gidofalvi

Association Rules and Apriori-Based Frequent Itemset Mining

Association Rules – The Basic Idea
By examining transactions, or shop carts, we can find which items are commonly purchased together. This knowledge can be used in advertising or in goods placement in stores. Association rules have the general form: I1  I2 where I1 and I2 are disjoint sets of items that for example can be purchased in a store. The rule should be read as: Given that someone has bought the items in the set I1 they are likely to also buy the items in the set I2 . 11/30/2018 Gyozo Gidofalvi

Large Itemsets An itemset is a set of single items from the database of transactions. If some items often occur together they can form an association rule. The support of an itemset expresses how often the itemset appears in a single transaction in the database, i.e., the support of an itemset I is the ratio of transactions in which the itemset is a subset. Number of transactions containing I Total number of transactions An itemset is considered to be a large itemset if its support is above some threshold. SuppI = 11/30/2018 Gyozo Gidofalvi

Finding the Large Itemsets
The Brute Force Approach Just take all items and form all possible combinations of them and count away. Unfortunately, this will take some time... A better approach: The Apriori Algorithm Basic idea: An itemset can only be a large itemset if all its subsets are large itemsets 11/30/2018 Gyozo Gidofalvi

Example Assume that we have a transaction database with 100 transactions and we have the items a, b and c. Assume that the minimum support is set to 0.60, which gives us a minimum support count of 60. Since the count of {b} is below the minimum support count no itemset containing b can be a large itemset. This means that when we look for itemsets of size 2 we do not have to look for any sets containing b, which in turn leaves us with the only possible candidate {a,c}. Itemset Count {a} 65 {b} 50 {c} 80 11/30/2018 Gyozo Gidofalvi

The Apriori Algorithm A general outline of the algorithm is the following: Count all itemsets of size K. Prune the unsupported itemsets of size K. Generate new candidates of size K+1. Prune the new candidates based on supported subsets. Repeat from 1 until no more candidates or large itemsets are found. When all supported itemsets have been found, generate the rules. 11/30/2018 Gyozo Gidofalvi

Candidate Generation Assume that we have the large itemsets of size 2
{a,b}, {b,c}, {a,c} and {c,d} From these sets we can form the candidates of size 3 {a,b,c}, {a,c,d} and {b,c,d} But... Why not {a,b,d}? Answer: 11/30/2018 Gyozo Gidofalvi

Candidate Pruning Again, assume that we have the large 2-itemsets
{a,b}, {b,c}, {a,c} and {c,d} And the 3-candidates {a,b,c}, {a,c,d} and {b,c,d} Can all of these 3-candidates be large itemsets? Answer: 11/30/2018 Gyozo Gidofalvi

Find Out if the Itemset is Large
When we have found our final candidates we can simply count all occurrences of the itemset in the transaction database and then remove those that have too low support count. If at least one of the candidates have enough support count we loop again until we can find no more candidates or large itemsets. 11/30/2018 Gyozo Gidofalvi

Rule Metrics When we have our large itemsets we want to form association rules from them. As we said earlier the association rule has the form I1  I2 The support, Supptot , of the rule is the support of the itemset Itot where Itot = I1 U I2 The confidence, C, of the rule is Supptot SuppI1 C = 11/30/2018 Gyozo Gidofalvi

Rule Metrics (cont.) How can we interpret the support and the confidence of a rule? The support is how common the rule is in the transaction database. The confidence is how often the left hand side of the rule is associated with the right hand side. In what situations can we have a rule with: High support and low confidence? Low support and high confidence? High support and high confidence? 11/30/2018 Gyozo Gidofalvi

Frequent Pattern Growth Based Frequent Itemset Mining

The Frequent Pattern Growth Approach to FIM
The bottleneck of Apriori is candidate generation and testing. High cost of mining long itemsets! Idea: Instead of bottom up, in a top-down fashion extend frequent patterns by adding a single locally frequent item to it Question: What does “locally” mean? Answer: To find the frequent itemsets that contain an item i, the only transactions that need to be considered are transactions that contain i. Definition: A frequent item i-related projected transaction table, is denoted as PT_i, that collects all frequent items (larger than i) in the transactions containing i. Let’s look at an example! 11/30/2018 Gyozo Gidofalvi

Example Transactions Relational format Frequent items Filtered
Items co-occuring with item 1 Frequent items Projected table on item 1 Items co-occuring with item 3 (and 1) Frequent items Projected table on item 3 (and 1) 11/30/2018 Gyozo Gidofalvi

Frequent itemset tree depth-first Discover all frequent itemsets by recursively filtering and projecting transactions in a depth-first-search manner until there are frequent items in the filtered/projected table. 11/30/2018 Gyozo Gidofalvi

Amos II / AmosQL Crash Course

Amos II / AmosQL Resources
Amos II: extensible, main-memory Object-Relational Database Management System (ORDBMS) Runs on PCs on Windows and Linux AmosQL: object-relational and functional query language Homepage: Tutorial: 11/30/2018 Gyozo Gidofalvi

Basic Amos II Data Model
11/30/2018 Gyozo Gidofalvi

Objects Regard database as set of objects
Built-in atomic types, literals: String, Integer, Real, Boolean Collection types: bags , e.g., bag(“Tore”) and vectors, e.g., {1,2,3} Surrogate types: Objects have unique object identifiers (OIDs), e.g. OID[45] Explicit creation and deletion AmosQL Example: Amos> create type Person; Amos> create Person instances :tore; creates new object, say OID[45], and binds environment variable :tore to it. 11/30/2018 Gyozo Gidofalvi

Types Classification of objects
Objects are grouped by the types to which they belong (are instances of) Organized in type/subtype Directed Acyclic Graph Defines that OIDs of one type is subset of OIDs of other types 11/30/2018 Gyozo Gidofalvi

Functions Define semantics of objects: More than one argument allowed
Properties of objects Relationships among objects (mappings between arguments and results) Views on objects Stored procedures for objects More than one argument allowed Bag valued results allowed, e.g, FrequentItems Tuple valued results allowed 11/30/2018 Gyozo Gidofalvi

Vectors Constructor: {} Indexing: Conversion between vectors and bags:
Amos> set :v = {1,2,3}; Indexing: Amos> :v[1]; Conversion between vectors and bags: Amos> in(:v); Amos> vectorof(in(:v)); Useful vector operators: dim, concat 11/30/2018 Gyozo Gidofalvi

Defining Set Operations On Vectors
Amos> create function vintersect(vector v1, vector v2) -> vector as sort(select i from object i where i in v1 and i in v2); Amos> create function vdiff(vector v1, vector v2) -> vector and notany (select where i in v2)); Amos> create function vunion(vector v1, vector v2) -> vector as sort(select distinct i or i in v2); Amos> create function vcontain (vector v1, vector v2) -> boolean as select true where dim(v2) = dim(vintersect(v1, v2)); 11/30/2018 Gyozo Gidofalvi

Example: Association Rule DB
Amos> create function FIS(vector) -> integer as stored; Amos> add FIS({'beer'})=5; Amos> add FIS({'other'})=2; Amos> add FIS({'diapers'})=3; Amos> add FIS({'beer','diapers'})=2; Amos> add FIS({'beer','other'})=1; Amos> add FIS({'beer','diapers','other'})=1; Amos> add FIS({'diapers','other'})=1; Amos> create function showFIS() -> bag of <vector, integer> as select is, supp from vector is, integer supp where FIS(is)=supp; Amos> create function ARconf(vector prec, vector cons) -> real as FIS(vunion(prec,cons))/FIS(prec); Amos> ARconf({'beer'},{'diapers'}); Amos> ARconf({‘diapers'},{'beer'}); 11/30/2018 Gyozo Gidofalvi

Frequent Itemset Mining – Elements of a DB Implementation Using Amos II / AmosQL

Transactions Function (table) to store transactions:
Amos> create function transact(integer tid)->bag of integer as stored; Population of the transaction table: Amos> add transact(1)=in({1,2,3}); remove transact(1)=2; Selection of a transaction: Amos> transact(1); Transactions as a bag of tuples: Amos> create function transact_bt()->bag of <integer, integer> as select tid, item from integer tid, integer item where transact(tid)=item; Materialized transaction table as a vector: Amos> create function transact_vt()->vector of <integer, integer> as vectorof(transact_bt()); 11/30/2018 Gyozo Gidofalvi

Support Counting Function to calculate the supports of items in vt:
Amos> create function itemsupps_vt(vector of <integer, integer> vt) ->bag of <integer, integer> as groupby ((select i,t from integer i, integer t where <t,i> in vt), #'count'); 11/30/2018 Gyozo Gidofalvi

Item-Related Projection
Function to calculate the pitem-related projected minsupp-filtered materialized transaction table Amos> create function irpft(vector of <integer, integer> vt, integer minsupp, integer pitem) ->vector of <integer, integer> as vselect tid, item from integer tid, integer item, integer i, integer s where <tid, item> in vt and freqitems_vt(vt, minsupp)=<i,s> and item = i and tid = itemsuppby_vt(vt, pitem) and item > pitem; 11/30/2018 Gyozo Gidofalvi

Subset Generation Function to filter a vector v to contain only elements that are larger than p Amos> create function vlarger(vector of integer v, integer p) ->vector of integer as vselect c from integer c where c in v and c>p; Recursive function to calculate all subsets of the set of elements in a vector c (to be called with empty vector p) Amos> create function subsets_rec(vector of integer p, vector of integer c) ->vector of integer as begin result (select res from vector of integer res, vector of integer newp, integer i where res = p or (res = subsets_rec(newp, vlarger(c,i)) and newp = concat(p,{i}) and i in c)); end; 11/30/2018 Gyozo Gidofalvi

The Assignment

The Assignment You can choose your favorite programming language, but NOT MATLAB since it is not suitable for this assignment. Since you can choose your own language you cannot expect help in language issues. You are going to run the solution on UNIX so please make sure that your solution is runnable on the University's UNIX system. The program must run in a few minutes since we are going to run it during the examination. Too slow programs will be rejected. 11/30/2018 Gyozo Gidofalvi

The Assignment (cont.) The algorithm is easy to get wrong and then you will get a superexponential behaviour that causes the execution time to blow up! Please make sure you understand the algorithm before you start programming. There are example runs on the course's home page. Your solution must be able to get these results before the examination. (Oh, and we will make other trial runs ;-) Finally, please read the instructions carefully, especially the parts about the command line interface and the rule sorting. 11/30/2018 Gyozo Gidofalvi

Exercise Find the rules with support 0.5 and confidence 0.75 in the following database: TID Transaction 1 {a b c d} 2 {a c d} 3 {a b c} 4 {b c d} 5 6 7 {c d e} 8 {a c} 11/30/2018 Gyozo Gidofalvi

Reference and Further Reading
Mining Associations between Sets of Items in Large Databases. R. Agrawal, T. Imielinski, and A. Swami. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pp , May 1993. Fast Algorithms for Mining Association Rules. R. Agrawal and R. Srikant. In Proceedings of the 20th International Conference on Very Large Databases, pp , September 1994. An Effective Hash Based Algorithm for Mining Association Rules. J. S. Park, M.-S. Chen, and P. S. Yu. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pp , May 1995. Mining Frequent Patterns without Candidate Generation. J. Han, H. Pei, and Y. Yin. In Proceedings of the ACM-SIGMOD International Conference on Management of Data (SIGMOD'00), pp. 1-12, May 2000. Efficient Frequent Pattern Mining in Relational Databases. X. Shang, K.-U. Sattler, and I. Geist. 5. Workshop des GI-Arbeitskreis Knowledge Discovery (AK KD) im Rahmen der LWA 2004. 11/30/2018 Gyozo Gidofalvi

Tutorial on Assignment 3 in Data Mining 2008 Association Rule Mining

Similar presentations

Presentation on theme: "Tutorial on Assignment 3 in Data Mining 2008 Association Rule Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tutorial on Assignment 3 in Data Mining 2008 Association Rule Mining

Similar presentations

Presentation on theme: "Tutorial on Assignment 3 in Data Mining 2008 Association Rule Mining"— Presentation transcript:

Similar presentations

About project

Feedback