Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science
Huffman Code Example Given: ABCDE By using an increasing algorithm (changing from smallest to largest), it changes to: BCADE 12346
Huffman Code Example – Step 1 Because B and C are the lowest values, they can be appended. The new value is 3
Huffman Code Example – Step 2 Reorder the problem using the increasing algorithm again. This gives us: BCADE 3346
Huffman Code Example – Step 3 Doing another append will give:
Huffman Code Example – Step 4 From the initial BC A D E code we get: DEABC 46 6 DEBCA 46 6 DABCE 4 66 DBCAE 4 66
Huffman Code Example – Step 5 Taking derivates from the previous step, we get: DEBCA 466 EDBCA 6 10 DABCE 106 DEABC 46 6
Huffman Code Example – Step 6 Taking derivates from the previous step, we get: BCADE 646 EDBCA 6 10 EDABC ABCDE 64 6
Huffman Code Example – Step 7 After the previous step, we’re supposed to map a 1 to each right branch and a 0 to each left branch. The results of the codes are:
Example Items={milk, coke, pepsi, beer, juice}. Support = 3 baskets. B1 = {m, c, b}B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b}B6 = {m, c, b, j} B7 = {c, b, j}B8 = {b, c} Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.
Association Rules Association rule R : Itemset1 => Itemset2 –Itemset1, 2 are disjoint and Itemset2 is non- empty –meaning: if transaction includes Itemset1 then it also has Itemset2 Examples –A,B => E,C –A => B,C
Example B1 = {m, c, b}B2 = {m, p, j} B3 = {m, b}B4 = {c, j} B5 = {m, p, b}B6 = {m, c, b, j} B7 = {c, b, j}B8 = {b, c} An association rule: {m, b} → c. –Confidence = 2/4 = 50%. + _ +
From Frequent Itemsets to Association Rules Q: Given frequent set {A,B,E}, what are possible association rules? –A => B, E –A, B => E –A, E => B –B => A, E –B, E => A –E => A, B –__ => A,B,E (empty rule), or true => A,B,E
Classification vs Association Rules Classification Rules Focus on one target field Specify class in all cases Measures: Accuracy Association Rules Many target fields Applicable in some cases Measures: Support, Confidence, Lift
Rule Support and Confidence Suppose R : I => J is an association rule –sup (R) = sup (I J) is the support count support of itemset I J (I or J) –conf (R) = sup(J) / sup(R) is the confidence of R fraction of transactions with I J that have J Association rules with minimum support and count are sometimes called “strong” rules
Association Rules Example: Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ? A, B => E : conf=2/4 = 50% A, E => B : conf=2/2 = 100% B, E => A : conf=2/2 = 100% E => A, B : conf=2/2 = 100% Don’t qualify A =>B, E : conf=2/6 =33%< 50% B => A, E : conf=2/7 = 28% < 50% __ => A,B,E : conf: 2/9 = 22% < 50%
Find Strong Association Rules A rule has the parameters minsup and minconf: –sup(R) >= minsup and conf (R) >= minconf Problem: –Find all association rules with given minsup and minconf First, find all frequent itemsets
Finding Frequent Itemsets Start by finding one-item sets (easy) Q: How? A: Simply count the frequencies of all items
Finding itemsets: next level Apriori algorithm (Agrawal & Srikant) Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets, … –If (A B) is a frequent item set, then (A) and (B) have to be frequent item sets as well! –In general: if X is frequent k-item set, then all (k-1)- item subsets of X are also frequent Compute k-item set by merging (k-1)-item sets
Finding Association Rules A typical question: “find all association rules with support ≥ s and confidence ≥ c.” –Note: “support” of an association rule is the support of the set of items it mentions. Hard part: finding the high-support (frequent ) itemsets. –Checking the confidence of association rules involving those sets is relatively easy.
Naïve Algorithm A simple way to find frequent pairs is: –Read file once, counting in main memory the occurrences of each pair. Expand each basket of n items into its n (n -1)/2 pairs. Fails if #items-squared exceeds main memory.
C1C1 L1L1 C2C2 L2L2 C3C3 Filter Construct First pass Second pass
Fast Algorithms for Mining Association Rules, by Rakesh Agrawal and Ramakrishan Sikant, IBM Almaden Research Center [Agrawal, Srikant 94]
ItemsTID Set-of- itemsets TID { {1},{3},{4} }100 { {2},{3},{5} }200 { {1},{2},{3},{5} }300 { {2},{5} }400 SupportItemset 2{1} 3{2} 3{3} 3{5} itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} Set-of-itemsetsTID { {1 3} }100 { {2 3},{2 5} {3 5} }200 { {1 2},{1 3},{1 5}, {2 3}, {2 5}, {3 5} } 300 { {2 5} }400 SupportItemset 2{1 3} 3{2 3} 3{2 5} 2{3 5} itemset {2 3 5} Set-of-itemsetsTID { {2 3 5} }200 { {2 3 5} }300 SupportItemset 2{2 3 5} Database C^ 1 L2L2 C2C2 C^ 2 C^ 3 L1L1 L3L3 C3C3
Dynamic Programming Approach Dynamic Programming Approach l Want proof of principle of optimality and overlapping subproblems l Principle of Optimality F The optimal solution to L k includes the optimal solution of L k-1 F Proof by contradiction l Overlapping Subproblems F Lemma of every subset of a frequent item set is a frequent item set F Proof by contradiction
The Apriori Algorithm: Example Consider a database, D, consisting of 9 transactions. Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ) Let minimum confidence required is 70%. We have to first find out the frequent itemset using Apriori algorithm. Then, Association rules will be generated using min. support & min. confidence. TIDList of Items T100I1, I2, I5 T100I2, I4 T100I2, I3 T100I1, I2, I4 T100I1, I3 T100I2, I3 T100I1, I3 T100I1, I2,I3, I5 T100I1, I2, I3
Step 1 : Generating 1-itemset Frequent Pattern ItemsetSup.Count {I1}6 {I2}7 {I3}6 {I4}2 {I5}2 ItemsetSup.Count {I1}6 {I2}7 {I3}6 {I4}2 {I5}2 In the first iteration of the algorithm, each item is a member of the set of candidate. The set of frequent 1-itemsets, L 1, consists of the candidate 1-itemsets satisfying minimum support. Scan D for count of each candidate Compare candidate support count with minimum support count C1C1 L1L1
Step 2 : Generating 2-itemset Frequent Pattern Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} ItemsetSup. Count {I1, I2}4 {I1, I3}4 {I1, I4}1 {I1, I5}2 {I2, I3}4 {I2, I4}2 {I2, I5}2 {I3, I4}0 {I3, I5}1 {I4, I5}0 Itemse t Sup Count {I1, I2}4 {I1, I3}4 {I1, I5}2 {I2, I3}4 {I2, I4}2 {I2, I5}2 Generate C 2 candidates from L 1 C2C2 C2C2 L2L2 Scan D for count of each candidate Compare candidate support count with minimum support count
Step 2 : Generating 2-itemset Frequent Pattern [Cont.] To discover the set of frequent 2-itemsets, L 2, the algorithm uses L 1 Join L 1 to generate a candidate set of 2-itemsets, C 2. Next, the transactions in D are scanned and the support count for each candidate itemset in C 2 is accumulated (as shown in the middle table). The set of frequent 2-itemsets, L 2, is then determined, consisting of those candidate 2-itemsets in C 2 having minimum support. Note: We haven’t used Apriori Property yet.
Step 3 : Generating 3-itemset Frequent Pattern Itemset {I1, I2, I3} {I1, I2, I5} ItemsetSup. Count {I1, I2, I3}2 {I1, I2, I5}2 ItemsetSup Count {I1, I2, I3}2 {I1, I2, I5}2 C3C3 C3C3 L3L3 Scan D for count of each candidate Compare candidate support count with min support count Scan D for count of each candidate The generation of the set of candidate 3-itemsets, C 3, involves use of the Apriori Property. In order to find C 3, we compute L 2 Join L 2. C 3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Now, Join step is complete and Prune step will be used to reduce the size of C 3. Prune step helps to avoid heavy computation due to large C k.
Step 3 : Generating 3-itemset Frequent Pattern [Cont.] Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four latter candidates cannot possibly be frequent. How ? For example, lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L 2, We will keep {I1, I2, I3} in C 3. Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}. BUT, {I3, I5} is not a member of L 2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C 3. Therefore, C 3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning. Now, the transactions in D are scanned in order to determine L 3, consisting of those candidates 3-itemsets in C 3 having minimum support.
Step 4 : Generating 4-itemset Frequent Pattern The algorithm uses L 3 Join L 3 to generate a candidate set of 4-itemsets, C 4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent. Thus, C 4 = φ, and algorithm terminates, having found all of the frequent items. This completes our Apriori Algorithm. What’s Next ? These frequent itemsets will be used to generate strong association rules ( where strong association rules satisfy both minimum support & minimum confidence).
Step 5: Generating Association Rules from Frequent Itemsets Procedure: For each frequent itemset “l”, generate all nonempty subsets of l. For every nonempty subset s of l, output the rule “s (l-s)” if support_count(l) / support_count(s) >= min_conf where min_conf is minimum confidence threshold. Back To Example: We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}. –Lets take l = {I1,I2,I5}. –Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
Step 5: Generating Association Rules from Frequent Itemsets [Cont.] Let minimum confidence threshold is, say 70%. The resulting association rules are shown below, each listed with its confidence. –R1: I1 ^ I2 I5 Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% R1 is Rejected. –R2: I1 ^ I5 I2 Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100% R2 is Selected. –R3: I2 ^ I5 I1 Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100% R3 is Selected.
Step 5: Generating Association Rules from Frequent Itemsets [Cont.] – R4: I1 I2 ^ I5 Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33% R4 is Rejected. – R5: I2 I1 ^ I5 Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29% R5 is Rejected. – R6: I5 I1 ^ I2 Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100% R6 is Selected. In this way, We have found three strong association rules.
ABCDE ACDE B ABCE D ACD BE ADE BC CDE AB ACE BD BCE AD ACE BD ABE CD ABC ED Large itemset Rules with minsup Simple algorithm: Fast algorithm: ACE BD ABCDE ACDE B ABCE D Example