Presentation is loading. Please wait.

Presentation is loading. Please wait.

On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002.

Similar presentations


Presentation on theme: "On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002."— Presentation transcript:

1 On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002

2 In my last DB talk... I talked about some subjective approaches for finding interesting association rules Subjective approaches require that a domain expert work on a huge set of mined rules Some adopted another approach to find “ Optimal rules ” instead Optimal according to some objective interestingness measure

3 Contents Basic concepts on probability Entropy and information The maximum entropy method (MEM) Paper review: Pruning Redundant Association Rules Using Maximum Entropy Principle

4 Basic concepts A finite probability space is a pair (S,P), in which S is a finite non-empty set P is a mapping P:S  [0,1], satisfying  s  S P(s) = 1 Each s  S is called an event P(s) is the probability of the event s I ’ ll also use p s to denote P(s) in this talk A partition U is a collection of mutually exclusive events whose union equals S Sometimes this is also known as a system of events

5 Basic concepts (cont ’ d) The product U  B of 2 partitions U = U a i and B = U b j is defined as a partition whose elements are all intersections a i b j of the elements of U and B Graphically: UB U  B a1a1 a2a2 a3a3 b1b1 b2b2 b3b3 a1b1a1b1

6 Self-information The probability P(s) of an event s is a measure of our uncertainty that the event s would occur If P(s) = 0.999, s is almost certain to occur If P(s) = 0.1, quite reasonably we can believe that s would not occur The self-information of s is defined as I(s) = – log P(s) Note that: The smaller the value of P(s), the larger that of the I(s) When P(s) = 0, I(s) =  Notion: when something supposed to be very unlikely to happen really happens, it contains much information

7 Entropy The measure of uncertainty that any event of a partition U would occur is called the entropy of the partitioning U The entropy H( U ) of a partition U is defined as H( U ) = – p 1 log p 1 – p 2 log p 2 – … – p N log p N Where p 1,..., p N are respectively the probabilities of events a 1,..., a N of U Note that: Each term corresponds to the self-information of an event weighted by its probability H( U ) is maximum if p 1 = p 2 =... = p N = 1/N

8 Conditional entropy Let U = {a 1,..., a N }, and B = {b 1,..., b M } be 2 partitions The conditional entropy of U assuming b j is H( U |b j ) = -  P(a i |b j ) log P(a i |b j ) The conditional entropy of U assuming B is thus H( U|B ) =  P(b j ) H( U |b j ) We can go on to show that H( U  B ) = H( B ) + H( U|B ) = H( U ) + H( B|U )

9 Mutual information Suppose that U, B are two partitions in S. The mutual information I( U,B ) between U and B is: I( U,B ) = H( U ) + H( B ) – H( U  B ) Applying the equality H( U  B ) = H( B ) + H( U|B ) = H( U ) + H( B|U ) we get: I( U,B ) = H( U ) - H( U|B ) = H( B ) – H( B|U )

10 The maximum entropy method (MEM) The MEM determines the probabilities p i of the events in a partition U, subject to various given constraints. By MEM, when some of the p i ’ s are unknown, they must be chosen to maximize the entropy of U, subject to the given constraints. Let ’ s illustrate the MEM with an example.

11 Example: rolling a die Let p 1 … p 6 denote the probability that the outcome of rolling the die is 1 … 6 respectively. The entropy of this partitioning U is H( U ) = - p 1 log p 1 - … - p 6 log p 6 If we have no information of the die: By MEM, we choose p 1 = p 2 = … = p 6 = 1/6 Suppose now we know that a player bets $10 on “ odd ” each game, and on average wins $2 per game: p 1 + p 3 + p 5 = 0.6 and p 2 + p 4 + p 6 = 0.4 By MEM, we get p 1 =p 3 =p 5 =0.2 and p 2 =p 4 =p 6 =0.133 …

12 Paper review S. Jaroszewicz and D.A. Simovici, Pruning Redundant Association Rules Using Maximum Entropy Principle Published in PAKDD 02 Highlights: To identify a small and non-redundant set of interesting association rules that describes the data as completely as possible The solution proposed in the paper uses the maximum entropy approach

13 Definitions A constraint C is a pair C = ( I, p), where: I is an itemset p  [0,1] is the probability of I occurring in a transaction The set of constraints generated by an association rule I  J is defined as C( I  J ) = {( I, supp( I )), ( I  J, supp( I  J ))} A rule K  J is a sub-rule of I  J if K  I

14 The active interestingness of a rule w.r.t. a set of constraints The active interestingness reflects the impact of adding the constraints generated by the rule to the current set of constraints whereD is some divergence function, and Q C is the probability distribution induced by C Q C is obtained by using MEM. For simplicity, how Q C is obtained is omitted here. It is proposed in the paper and a proof is available.

15 The passive interestingness of a rule w.r.t. a set of constraints The passive interestingness is the difference between the confidence estimated from the data and that estimated from the probability distribution induced by the constraints where is the probability of X induced by C conf( I  J ) is the confidence of the rule I  J

16 I -nonredundancy A rule I  J is considered I -nonredundant with respect to R, where R is a set of association rules, if: I = , or I (C I, J (R), I  J ) is larger than some threshold, where I () is either I act () or I pass (), C I, J (R) is the constraints induced by all sub-rules of I  J in R

17 Pruning redundant association rules Input: A set R of association rules 1. For each singleton A i in the database 2. R i = {   A i } 3. k = 1 4. For each rule I  A i  R, | I |=k, do 5. If I  A i is I -nonredundant w.r.t. R i then 6. R i = R i  { I  A i } 7. k = k+1 8. Goto 4 9. R =  R i

18 Results An elderly people census dataset, with 300K tuples, was used in experiments. When support threshold was set to 1%, Apriori found 247476 association rules (without considering confidence of the rules). The proposed algorithm trimmed the rule set to 194 rules when interestingness threshold was set to 0.3. Running time was 4801s. When interestingness threshold was lowered to 0.1, the proposed algorithm trimmed the rule set to 2056 rules. Running time was 15480s.

19 Entropy and interestingness Some common measures to rank the interestingness of association rules are based on entropy and mutual information. Examples include: Entropy gain Gini Index

20 Conclusion Entropy and mutual information are some tools that tell the uncertainty of some event(s). The maximum entropy method (MEM) is an application of entropy, allowing us to make reasonable guesses on probabilities of events. The MEM can be applied to prune uninteresting association rules.

21 References R. J. Bayardo Jr. and R. Agrawal. Mining the Most Interesting Rules. Proc. KDD99, 1999. D. Hankerson, G. A. Harris, P. D. Johnson, Jr. Introduction to Information Theory and Compression. CRC Press LLC, 1998. S. Jaroszewicz and D.A. Simovici. A General Measure of Rule Interestingness. Proc. PKDD ’ 01, 2001. S. Jaroszewica and D.A. Simovici. Pruning Redundant Association rules Using Maximum Entropy Principle. Proc. PAKDD ’ 02, 2002. A. Papoulis. Probability, Random Variables, and Stochastic Processes, Third Edition. McGraw Hill, 1991.

22 Q & A


Download ppt "On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002."

Similar presentations


Ads by Google