Download presentation
Presentation is loading. Please wait.
Published byGodwin Logan Modified over 9 years ago
1
MAXIMALLY INFORMATIVE K-ITEMSETS
2
Motivation Subgroup Discovery typically produces very many patterns with high levels of redundancy Grammatically different patterns represent same subgroup Complement Combinations of patterns marital-status = ‘Married-civ-spouse’ ∧ age ≥ 29 marital-status = ‘Married-civ-spouse’ ∧ education-num ≥ 8 marital-status = ‘Married-civ-spouse’ ∧ age ≤ 76 age ≤ 67 ∧ marital-status = ‘Married-civ-spouse’ marital-status = ‘Married-civ-spouse’ age ≥ 33 ∧ marital-status = ‘Married-civ-spouse’ …
3
Dissimilar patterns Optimize dissimilarity of patterns reported Additional value of individual patterns reported Consider extent of patterns Treat patterns as binary features/items Joint entropy of itemset captures informativeness of pattern set
4
Joint Entropy of Itemset Binary features (items) x 1, x 2,..., x n Itemset of size k : X = {x 1, …, x k } Joint entropy:
5
Joint Entropy Each item cuts the database into 2 parts (not necessarily of equal size) Each additional item cuts each part in 2 Joint entropy is maximal when all parts have equal size
6
Definition miki A Maximally Informative k-Itemset (miki) is an itemset of size k that maximizes the joint entropy: An itemset X I of cardinality k is a maximally informative k- itemset, iff for all itemsets Y I of cardinality k,
7
Properties of joint entropy, miki ’ s Symmetric treatment of 0 ’ s and 1 ’ s Both infrequent and frequent items discouraged Optimal at p(x i ) = 0.5 Items in miki are (relatively) independent Goals orthogonal to mining associations Focus on value 1 Frequent items are encouraged Find items that are dependent
8
More Properties At most 1 bit of information per item monotonicity of joint entropy Suppose X and Y are two itemsets such that X Y. Then unit growth of joint entropy Suppose X and Y are two itemsets such that X Y. Then
9
Properties: Independence Bound independence bound on joint entropy Suppose that X = {x 1, …, x k } is an itemset. Then Every item adds at most H(x i ) Items potentially share information, hence ≤ = iff items are independent A candidate itemset can be discarded if the bound is not above the current maximum (no need to check the data)
10
Example 1 H(A) = 1, H(B) = 1, H(C) = 1 H(D) = − ⅜lg⅜ − ⅝lg⅝ ≈ 0.96 {A, B, C} is a miki H({A, B, C}) = 2.5 ≤ 3 ABCD 1110 1100 1110 1000 0110 0001 0011 0001
11
Partitions of itemsets Group items that share information Obtain tighter bound Precompute joint entropy of small itemsets (e.g. 2- or 3-itemsets) joint entropy of partition Suppose that P = {B 1, …, B m } is a partition of an itemset. The joint entropy of P is defined as
12
Partition Properties partitioned bound on joint entropy Suppose that P = {B 1, …, B m } is a partition of an itemset X. Then independence bound on partitioned joint entropy Suppose that P = {B 1, …, B m } is a partition of an itemset X = {x 1, …, x k }. Then
13
Example 2 B and D are similar {{B, D}, {C}} partition of {B, C, D} H({B, C, D}) = 2.16 H({{B, D}, {C}}) = 2.41 H(B) + H(C) + H(D) = 2.96 ABCD 1110 1100 1110 1000 0110 0001 0011 0001
14
Algorithms Algorithm 1 Exhaustively consider all itemsets of size k, and return the optimal Algorithm 2 Use independence bound to skip table scan if not above current optimal Algorithm 3 Use partitioned bound on joint entropy. Use random partition of k/2 blocks
15
Algorithms Algorithm 4 Consider prefix X of size k l of current itemset. If upper bound on any extension of X is below current optimal, then skip all extensions of X. l = 3 gives best results in practice Algorithm 5 Repeatedly add the item that improves the joint entropy the most (forward selection)
16
Example 1 82 subgroup discovered in 2-dimensional space Miki of 4 patterns
17
Mushroom (119 x 8124) 234567 Alg 17021273,819 7.94 10 6 1.82 10 8 3.47 10 9 5.6 10 10 Alg 2122654,91769,134 1.23 10 6 1.95 10 7 Alg 34836029,747211,934 4.58 10 6 Alg 46029,747209,329 4.4 10 6 Alg 5237354470585699812 number of table scans
18
234567 Alg 10:1816:29735:23>1000 Alg 200:031:3434:21692:42>1000 Alg 30:360:381:3623:25445:58>1000 Alg 41:3716:17244:11>1000 Alg 5000:010:030:04 running time Mushroom (119 x 8124)
19
Joint Entropy of miki ’ s (Mushroom) number of items (y = x) entropy of miki entropy of greedy approximation
20
Joint Entropy of miki ’ s (LumoLogp) number of items (y = x) entropy of miki entropy of greedy approximation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.