Association Mining Data Mining Spring 2012
Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database Transactional dataset EggsCheeseMilk Jam CheeseBaconEggsCat food ButterBread ButterEggsMilkCheese
Item = {Milk}, {Cheese}, {Bread}, etc. Itemset = {Milk}, {Milk, Cheese}, {Bacon, Bread, Milk} Doesn’t have to be in the dataset Can be of size 1 – n Items and Itemsets Transactional dataset EggsCheeseMilk Jam CheeseBaconEggsCat food ButterBread ButterEggsMilkCheese
The Support Measure
Support Examples Support({Eggs}) = 3/5 = 60% Support({Eggs, Milk}) = 2/5 = 40% Transactional dataset EggsCheeseMilk Jam CheeseBaconEggsCat food ButterBread ButterEggsMilkCheese
Minimum Support Minsup – The minimum support threshold for an itemset to be considered frequent (User defined) Frequent itemset – an itemset in a database whose support is greater than or equal to minsup. Support(X) > minsup = frequent Support(X) < minsup = infrequent
Minimum Support Examples Minimum support = 50% Support({Eggs}) = 3/5 = 60% Pass Support({Eggs, Milk}) = 2/5 = 40% Fail Transactional dataset EggsCheeseMilk Jam CheeseBaconEggsCat food ButterBread ButterEggsMilkCheese
Association Rules
Confidence Example 1 {Eggs} => {Bread} Confidence = sup({Eggs, Bread})/Sup({Eggs}) Confidence = (1/5)/(3/5) = 33% Transactional dataset EggsCheeseMilk Jam CheeseBaconEggsCat food ButterBread ButterEggsMilkCheese
Confidence Example 2 {Milk} => {Eggs, Cheese} Confidence = sup({Milk, Eggs, Cheese})/sup({Milk}) Confidence = (2/5)/(3/5) = 66% Transactional dataset EggsCheeseMilk Jam CheeseBaconEggsCat food ButterBread ButterEggsMilkCheese
Strong Association Rules Minimum Confidence – A user defined minimum bound on confidence. (Minconf) Strong association rule – a rule X=>Y whose conf > minconf. - this is a potentially interesting rule for the user. Conf(X=>Y) > minconf = strong Conf(X=>Y) < minconf = uninteresting
Minimum Confidence Example Minconf = 50% {Eggs} => {Bread} Confidence = (1/5)/(3/5) = 33% Fail {Milk} => {Eggs, Cheese} Confidence = (2/5)/(3/5) = 66% Pass
Association Mining Association Mining: - Finds strong rules contained in a dataset from frequent itemsets. Can be divided into two major subtasks: 1. Finding frequent itemsets 2. Rule generation
Some algorithms change items into letters or numbers Numbers are more compact Easier to make comparisons Transactional Database Revisited Transactional dataset
Basic Set Logic Subset – a subset itemset X is contained in an itemset Y. Superset – a superset itemset Y contains an itemset X. example: X = {1,2} Y = {1,2,3,5} Y X
Apriori Arranges database into a temporary lattice structure to find associations Apriori principle – 1. itemsets in the lattice with support < minsup will only produce supersets with support < minsup. 2. the subsets of frequent itemsets are always frequent. Prunes lattice structure of non-frequent itemsets using minsup. Reduces the number of comparisons Reduces the number of candidate itemsets
Monotonicity Monotone (upward closed) - if X is a subset of Y, then support(X) cannot exceed support(Y). Anti-Monotone (downward closed) - if X is a subset of Y, then support(Y) cannot exceed support(X). Apriori is anti-monotone. - uses this property to prune the lattice structure.
Itemset Lattice
Lattice Pruning
Lattice Example Count occurrences of each 1-itemset in the database and compute their support: Support = #occurrences/#rows in db Prune anything less than minsup = 30%
Lattice Example Count occurrences of each 2-itemset in the database and compute their support Prune anything less than minsup = 30%
Lattice Example ABCDE BD ABD AD Count occurrences of the last 3-itemset in the database and compute its support. Prune anything less than minsup = 30%
Example - Results Frequent itemsets: {1}, {2}, {3}, {1,2}, {1,3}, {2,3}, {1,2,3}
Apriori Algorithm
Frequent Itemset Generation ItemsetSupportFrequent {1}75%Yes {2}50%No {3}75%Yes {4}25%No {5}100%Yes Transactional Database Minsup = 70% 2.Generate all 1-itemsets 3.Calculate the support for each itemset 4.Determine whether or not the itemsets are frequent
Frequent Itemset Generation ItemsetSupportFrequent {1,3}50%Yes {1,5}75%Yes {3,5}75%Yes Transactional Database Generate all 2-itemsets, minsup = 70% {1} U {3} = {1,3}, {1} U {5} = {1,5} {3} U {5} = {3,5}
Frequent Itemset Generation ItemsetSupportFrequent {1,3,5}50%Yes Transactional Database Generate all 3-itemsets, minsup = 70% {1,3} U {1,5} = {1,3,5}
Frequent Itemset Results All frequent itemsets generated are output: {1}, {3}, {5} {1,3}, {1,5}, {3,5} {1,3,5}
Apriori Rule Mining
Rule Combinations: 1. {1,2} 2-itemsets {1}=>{2} {2}=>{1} 2. {1,2,3} 3-itemsets {1}=>{2,3} {2,3}=>{1} {1,2}=>{3} {3}=>{1,2} {1,3}=>{2} {2}=>{1,3}
Strong Rule Generation RuleConfidenceStrong {1}=>{3}No {3}=>{1}No {1}=>{5}Yes {5}=>{1}No {3}=>{5}Yes {5}=>{3}No Transactional Database I = {{1}, {3}, {5}} 2.Rules = X => Y 3.Minconf = 80%
Strong Rule Generation RuleConfidenceStrong {2}=>{3,5}Yes {3,5}=>{2}No {2,3}=>{5}Yes {5}=>{2,3}No {2,5}=>{3}Yes {3}=>{2,5}No Transactional Database I = {{1}, {3}, {5}} 2.Rules = X => Y 3.Minconf = 80%
Strong Rules Results All strong rules generated are output: {1}=>{5} {3}=>{5} {2}=>{3,5} {2,3}=>{5} {2,5}=>{3}
Other Frequent Itemsets Closed Frequent Itemset – a frequent itemset X who has no immediate supersets with the same support count as X. Maximal Frequent Itemset – a frequent itemset whom none of its immediate supersets are frequent.
Itemset Relationships Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets
Targeted Association Mining
* Users may only be interested in specific results * Potential to get smaller, faster, and more focused results * Examples: 1. User wants to know how often only bread and garlic cloves occur together. 2. User wants to know what items occur with toilet paper.
Itemset Trees * Itemset Tree: - A data structure which aids in users querying for a specific itemset and it’s support. * Items within a transaction are mapped to integer values and ordered such that each transaction is in lexical order. {Bread, Onion, Garlic} = {1, 2, 3} * Why use numbers? - make the tree more compact - numbers follow ordering easily
Itemset Trees An Itemset Tree T contains: * A root pair (I, f(I)), where I is an itemset and f(I) is its count. * A (possibly empty) set {T 1, T 2,..., T k } each element of which is an itemset tree. * If I j is in the root, then it will also be in The root’s children * If I j is not in the root, then it might be in the root’s children if: first_item(I) < first_item(I j ) and last_item(I) < last_item(I j )
Building an Itemset Tree Let c i be a node in the itemset tree. Let I be a transaction from the dataset Loop: Case 1: c i = I Case 2: c i is a child of I - make I the parent node of c i Case 3: c i and I contain a common lexical overlap i.e. {1,2,4} vs. {1,2,6} - make a node for the overlap - make I and c i it’s children. Case 4: c i is a parent of I - Loop to check c i ’s children - make I a child of c i Note: {2,6} and {1,2,6} do not have a Lexical overlap
Itemset Trees - Creation Dataset
Itemset Trees - Creation Dataset Child node.
Itemset Trees - Creation Dataset Child node.
Itemset Trees - Creation Dataset Child node.
Itemset Trees - Creation Dataset Lexical overlap
Itemset Trees - Creation Dataset Parent node.
Itemset Trees - Creation Dataset Child node.
Itemset Trees – Querying Let I be an itemset, Let c i be a node in the tree Let totalSup be the total count for I in the tree For all s.t. first_item(c i ) < first_item(I): Case 1: If I is contained in c i. - Add support to totalSup. Case 2: If I is not contained and last_item(c i ) < last_item(I) - proceed down the tree
Example 1
Itemset Trees - Querying Querying Example 1: Query: {2} totalSup = 0
Itemset Trees - Querying Querying Example 1: Query: {2} 2 = 2 Add to support: totalSup = 3
Itemset Trees - Querying Querying Example 1: Query: {2} 1,2 contains 2 Add to support totalSup = = 5
Itemset Trees - Querying Querying Example 1: Query: {2,9} 3 > 2, and end of Subtree. Return support totalSup = 5
Example 2
Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 0
Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 0 2 < 2 2 < 9 continue
Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 0 2 < 2 4 < 9 {2,4} doesn’t contain {2,9}, go to next sibling
Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 1 {2,9} = {2,9} Add to support!
Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 1 1 < 2 2 < 9 continue
Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 1 1 < 2 5 < 9 {1,2,3,5} doesn’t contain {2,9}, go to next sibling
Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 1 1 < 2 6 < 9 {1,2,6} doesn’t contain {2,9}, go to next node
Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 1 3 < 2 <= fail 9 < 9 End of tree, totalSupp = 1 Nodes = 8