1 FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial Engineering 3128 CEBA Building Louisiana State University Baton Rouge, LA and
2 Introduction Background A fuzzy approach for mining associate rules Experimental evaluation Conclusions Outline
3 Introduction Associate analysis is a new and attractive research area in data mining The Apriori algorithm (R. Agrawal, IBM 1993) is a key technique for Associate analysis Though the Apriori principle allows us to considerably reduce the search space, the technique still requires a huge computation, particularly for large databases This research proposes an approach for finding fuzzy sets for quantitative attributes in a database by using clustering techniques and then employs techniques for mining of fuzzy Associate rules.
4 Introduction Background Associate rules and the Apriori algorithm Necessity to find fuzzy sets for quantitative attributes A fuzzy approach for fuzzy mining associate rules Experimental evaluation Conclusions Outline
5 Associate rules: Market basket analysis Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets” (in the form X Y, where X and Y are sets of items) I = {I1=beer, I2=cake, I3=onigiri} A transactional database An Associate rule: {I1} {I3} How often people buy candy and beer together? TID1: {I1, I2, I3} TID2: {I1, I2} TID3: {I2, I3} TID4: {I2} TID5: {I1, I2}
6 Rule measures: Support and Confidence Associate rule: X Y support s = probability that a transaction contains X and Y confidence c = conditional probability that a transaction having X also contains Y A C (s=50%, c=66.6%) C A (s=50%, c=100%) Customer buys onigiri Customer buys both Customer buys beer
7 Associate mining: the Apriori algorithm It is composed of two steps: 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count 2. Generate strong Associate rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence (Agrawal, 1993)
8 Associate mining: the Apriori principle For rule A C support = support({A and C}) = 50% confidence = support({A and C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent (if an itemset is not frequent, neither are its supersets) Min. support 50% Min. confidence 50%
9 The Apriori algorithm: Finding frequent itemsets using candidate generation 1.Find the frequent itemsets: the sets of items that have support higher than the minimum support A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemsets Iteratively find frequent itemsets L k with cardinality from 1 to k (k-itemset) from candidate itemsets C k (L k C k ) 2.Use the frequent itemsets to generate Associate rules. C 1 … L i-1 C i L i C i+1 … L k
10 Example (min_sup_count = 2) TID List of items_IDs T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 Itemset Sup.Count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 C1 Itemset Sup.Count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 L1 Transactional data Scan D for count of each candidate Compare candidate support count with minimum support count
11 Example (min_sup_count = 2) Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} C2 Scan D for count of each candidate Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 {I3, I4} 0 {I3, I5} 1 {I4, I5} 0 C2 Compare candidate support count with minimum support count Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 L2 Generate candidates C3 from L2 by using the Apriori principle Itemset {I1, I2, I3} {I1, I2, I5} Scan D for count of each candidate Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2 C3 Compare candidate support count with minimum support count Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2 L3 Generate candidates C2 from L1 by using the Apriori principle
12 Necessity to find fuzzy sets for quantitative attributes Transaction IDAgeMarriedNumCars 10033Yes Yes No No0 A quantitative associate rule with min_sup= min_conf =50% (Age = 33 or 39) and (Married = Yes) -> (NumCars =2) A quantitative associate rule with min_sup= min_conf=50% (Age = ) and (Married = Yes) -> (NumCars =2) A fuzzy associate rule with min_sup= min_conf =50% (Age = middle-aged) and (Married = Yes) -> (NumCars =2)
13 Solution: Shape boundary intervals It is composed of two steps: 1.Partition the attribute domains into small intervals and combine adjacent intervals into larger ones such that the combined intervals will have enough supports 2. Replace the original attribute by its attribute-interval pairs, the quantitative problem can be transformed to a Boolean one. (Srikant and Agrawal, 1996)
14 Example: Shape boundary intervals Transaction IDAgeMarriedNumCars 10033Yes Yes No No0 Yes No Age: No Yes Age: No Yes Married Yes No NumCars:0-1 No400 No300 Yes200 Yes100 NumCars:2-3Transaction ID Algorithms ignore or over-emphasize the elements near the boundary of the intervals in the mining process The use of shape boundary interval is also not intuitive with respect to human perception
15 Solution: Experts An user or expert must provide to this algorithm the required fuzzy sets of the quantitative attributes and their corresponding membership functions Fuzzy sets and their corresponding membership functions provided by experts may not be suitable for mining fuzzy Associate rules in the database
16 Solution: Fuzzy sets for quantitative attributes It is composed of three steps: Step 1: T ransform the original database into positive integer Step 2 : For each attribute Cluster values of the attribute i th into k medoids Classify the attribute i th into k fuzzy sets Generate membership functions for each fuzzy set End for Step 3 : Transform the database based on fuzzy sets (Ada, 1998) Lose association between attributes in the mining approach
17 Introduction Background A fuzzy approach for fuzzy mining associate rules Fuzzy approach Fuzzy mining associate rules Experimental evaluation Conclusions Outline
18 Fuzzy approach It is composed of five steps: Step 1: T ransform the original database into one with positive integers Step 2 : Cluster values of attributes into k medoids. Step 3 : Classify attributes into k fuzzy sets Step 4 : Generate membership functions for each fuzzy set Step 5 : Transform the database based on fuzzy sets
19 Fuzzy approach: Step 2 Clustering: The clustering method considers the search space of a database with n attributes as an n-dimensional space Use the Matlab fuzzy tool box Do not lose association between attributes in the mining approach
20 Fuzzy approach: Step 3 Classify: Let {m 1, m 2, …, m k } be k medoids found from step 2, where m i = {a i1, a i2, …, a in } is the medoid i th. Let the attribute j th have a range [min j, max j ] and {a 1j, a 2j, …, a kj } be set of mid-points of the attribute j th. The k fuzzy sets of this attribute will be ranged in [min j, a 2j ], [a 1j, a 3j ], …, [a (i-1)j, a (i+1)j ], …, and [a (k-1)j, max j ] m1m1 a 11 …a j1 …a 1n ……..……… mkmk a k1 …a jn a kn min j max j a (i- 1)j a ij a (i+1)j Fuzzy set
21 Fuzzy approach: Step 4 Generate membership functions (triangular function):
22 Fuzzy approach: Step 5 Transform the database based on fuzzy sets: Let T ij be the value of the i th transaction at the j th attribute T ij = fuzzy label i th if f ij (T ij ) = max(f kj (T ij ))
23 Example of fuzzy approach ID SalaryIQ – High_S – Medium_S – Low_S Mid-pointRangeFuzzy label – 200 High_I – 165 Medium_I – 120Low_I Mid-pointRangeFuzzy label ID Low_I Low_S Medium_IMedium_S Medium_I Medium_S Low_I Low_S High_I High_S Low_I Low_S Low_I Low_S IQSalary ID IQ’s membership Salary’s membership Step 2 Steps 3, 4, 5
24 Fuzzy mining Associate rules (Attilia, 2000) It is composed of two steps: 1.Find all itemsets that have fuzzy support (FS ) above the user specified minimum support. These itemsets are called frequent itemsets. 2.Use the frequent itemsets to generate the desired rules. Let X and Y be frequent itemsets. We can determine if the rule X => Y holds by computing the fuzzy confidence FC, > and this value is larger than the user specified minimum confidence value.
25 Fuzzy mining Associate rules - cont D = {t 1, t 2, …, t n }: transactions with X is attributes and A is the corresponding fuzzy sets in X Z = X U Y, C = A U B
26 Introduction Background A fuzzy approach for fuzzy mining associate rules Experimental evaluation Conclusions Outline
27 Experiments: Synthetic datasets Using synthetic datasets of varying sizes: Name|D||T|Size (MB) D100k.T10100K103M D100k.T20100K206M D320k.T30320K3018M |D| = Number of transactions |T| = Average amount of items on transactions
28 Experiment environment Software Database : Microsoft Access 2003 Language: C++ and Visual Basic, Matlab Platform: Windows Hardware PC Pentium IV-2.66 GMhz, RAM 1GB
29 Evaluate mean of rules From database Salary and IQ, we have rules from the approach with minimum support=43% and minimum confidence = 50% as follows: Rule 1: If 1 st variable is low approximately 7000 [ 4000, 10000] then 2 nd variable is low approximately 100 [50, 120] Rule 2: If 1 st variable is medium approximately [7000, 20000] then 2 nd variable is medium approximately 140 [ 100, 165] the Apriori algorithmMining quantitative algorithm with fuzzy approach No frequent ItemsetsFrequent Itemset 1 1 st variable is low approximately 7000 [4000, 10000], 2 nd variable is low approximately 100 [50, 120] Frequent Itemset 2 1 st variable is medium approximately [7000, 20000], 2 nd variable is medium approximately 140 [ 100, 165] Minimum support = 43%
30 Evaluate mean of rules - cont the Apriori algorithmMining quantitative algorithm Frequent Itemset 1 1 st variable is 5000, 2 nd variable is 85 Frequent Itemset 2 1 st variable is 7000, 2 nd variable is 100 Frequent Itemset 3 1 st variable is 9000, 2 nd variable is 110 Frequent Itemset 4 1 st variable is 10000, 2 nd variable is 120 Frequent Itemset 5 1 st variable is 15000, 2 nd variable is 140 Frequent Itemset 6 1 st variable is 20000, 2 nd variable is 165 Frequent Itemset 7 1 st variable is 30000, 2 nd variable is 183 Frequent Itemset 1 1 st variable is low approximately 7000 [ 4000, 10000], 2 nd variable is low approximately 100 [50, 120] Frequent Itemset 2 1 st variable is high approximately [15000, 32000], 2 nd variable is high approximately 183 [140, 200] Frequent Itemset 3 1 st variable is medium approximately [7000, 20000], 2 nd variable is medium approximately 140 [ 100, 165] minimum support = 15%
31 Evaluate fuzziness ID IQ’s membership Salary’s membership ID IQ’s membership Salary’s membership AdaNew approach Using the Yager’s fuzziness with p = 1 Ada_fuzziness_Salary ≈ ≤ NewApproach_fuzziness_Salary ≈ Ada_fuzziness_IQ ≈ 0.51 ≤ NewApproach_fuzziness_IQ ≈ 0.59 The new approach is fuzzier than Ada
32 Evaluate fuzziness - cont Ada’s approachNew approach Frequent Itemset 1 1 st variable is low approximately 5000 [ 4000, 10000], 2 nd variable is low approximately 85 [50, 120] Frequent Itemset 2 1 st variable is high approximately [15000, 32000], 2 nd variable is high approximately 165 [140, 200] Frequent Itemset 3 1 st variable is medium approximately [7000, 20000], 2 nd variable is medium approximately 120 [ 100, 165] Frequent Itemset 1 1 st variable is low approximately 7000 [ 4000, 10000], 2 nd variable is low approximately 100 [50, 120] Frequent Itemset 2 1 st variable is high approximately [15000, 32000], 2 nd variable is high approximately 183 [140, 200] Frequent Itemset 3 1 st variable is medium approximately [7000, 20000], 2 nd variable is medium approximately 140 [ 100, 165] minimum support = 15% In Ada’s Approach, mid points of ranges are moved out centre values. This leads to change mean of frequent itemsets.
33 Execution time (sec.) with different minimum support thresholds NameMin_sup = 35%Min_sup = 40%Min_sup = 50% AprioriFuzzy*AprioriFuzzy *AprioriFuzzy * D100k.T D100k.T D320k.T *: do not include the transfer time NameTransferring time a database into fuzzy sets D100k.T3095 D100k.T D320k.T309112
34 Execution time (sec.) with different minimum support thresholds - cont Execution time (transfer + mining time) of the fuzzy method is better than the Apriori. Moreover, mean of rules is more “Understandable”
35 Conclusions Proposed an approach to find fuzzy sets for quantitative attributes for mining associate rules An experimental evaluation shows that the mean of rules and execution time when using the fuzzy approach in mining Associate rules are better than that of other algorithms Future work: Improve the fuzzy mining approach Develop incremental algorithms for associate analysis using Support Vector Machines
36 THANK YOU H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial Engineering 3128 CEBA Building Louisiana State University Baton Rouge, LA and