Download presentation
Presentation is loading. Please wait.
1
Maximally Informative k-Itemsets
2
Motivation Subgroup Discovery typically produces very many patterns with high levels of redundancy Grammatically different patterns represent same subgroup Complement Combinations of patterns marital-status = ‘Married-civ-spouse’ ∧ age ≥ 29 marital-status = ‘Married-civ-spouse’ ∧ education-num ≥ 8 marital-status = ‘Married-civ-spouse’ ∧ age ≤ 76 age ≤ 67 ∧ marital-status = ‘Married-civ-spouse’ marital-status = ‘Married-civ-spouse’ age ≥ 33 ∧ marital-status = ‘Married-civ-spouse’ …
3
Dissimilar patterns Optimize dissimilarity of patterns reported
Additional value of individual patterns reported Consider extent of patterns Treat patterns as binary features/items Joint entropy of itemset captures informativeness of pattern set
4
Joint Entropy of itemset
Binary features (items) x1, x2, ..., xn Itemset of size k: X = {x1, …, xk} Joint entropy:
5
Joint Entropy Each item cuts the database into 2 parts (not necessarily of equal size) Each additional item cuts each part in 2 Joint entropy is maximal when all parts have equal size
6
Example: joint entropy of three items
B C 1 111: 2x 110: 1x 100: 1x 011: 1x 000: 2x 001: 1x H (2,1,1,1,2,1) = -¼ lg ¼ – ⅛ lg ⅛ - ⅛ lg ⅛ - ⅛ lg ⅛ - ¼ lg ¼ – ⅛ lg ⅛ = 2.5 bits
7
Definition miki A Maximally Informative k-Itemset (miki) is an itemset of size k that maximizes the joint entropy: An itemset X I of cardinality k is a maximally informative k-itemset, iff for all itemsets Y I of cardinality k,
8
Properties of joint entropy, miki’s
Symmetric treatment of 0’s and 1’s Both infrequent and frequent items discouraged Optimal at p(xi) = 0.5 Items in miki are (relatively) independent Goals orthogonal to mining associations Focus on value 1 Frequent items are encouraged Find items that are dependent
9
More Properties At most 1 bit of information per item
monotonicity of joint entropy Suppose X and Y are two itemsets such that X Y. Then unit growth of joint entropy
10
Properties: Independence Bound
independence bound on joint entropy Suppose that X = {x1, …, xk} is an itemset. Then Every item adds at most H(xi) Items potentially share information, hence ≤ Equality iff items are independent A candidate itemset can be discarded if the bound is not above the current maximum (no need to check the data)
11
Example 1 H(A) = 1, H(B) = 1, H(C) = 1 H(D) = − ⅜lg⅜ − ⅝lg⅝ ≈ 0.96
{A, B, C} is a miki H({A, B, C}) = 2.5 ≤ 3 A B C D 1
12
Partitions of itemsets
Group items that share information Obtain tighter bound Precompute joint entropy of small itemsets (e.g. 2- or 3-itemsets) joint entropy of partition Suppose that P = {B1, …, Bm} is a partition of an itemset. The joint entropy of P is defined as
13
Partition Properties partitioned bound on joint entropy
Suppose that P = {B1, …, Bm} is a partition of an itemset X. Then independence bound on partitioned joint entropy Suppose that P = {B1, …, Bm} is a partition of an itemset X = {x1, …, xk}. Then
14
Example 2 B and D are similar {{B, D}, {C}} partition of {B, C, D}
H({B, C, D}) = 2.16 H({{B, D}, {C}}) = 2.41 H(B) + H(C) + H(D) = 2.96 A B C D 1
15
Algorithms Algorithm 1 Exhaustively consider all itemsets of size k, and return the optimal Algorithm 2 Use independence bound to skip table scan if not above current optimal Algorithm 3 Use partitioned bound on joint entropy. Use random partition of k/2 blocks
16
Algorithms Algorithm 4 Consider prefix X of size k l of current itemset. If upper bound on any extension of X is below current optimal, then skip all extensions of X. l = 3 gives best results in practice Algorithm 5 Repeatedly add the item that improves the joint entropy the most (forward selection)
17
Example 3 82 subgroup discovered in 2-dimensional space
Miki of 4 patterns
18
Mushroom (119 x 8124) number of table scans 2 3 4 5 6 7 Alg 1 7021
273,819 7.94106 1.82108 3.47109 5.61010 Alg 2 12 265 4,917 69,134 1.23106 1.95107 Alg 3 83 602 9,747 211,934 4.58106 Alg 4 209,329 4.4106 Alg 5 237 354 470 585 699 812
19
Mushroom (119 x 8124) running time 2 3 4 5 6 7 Alg 1 0:18 16:29 735:23
>1000 Alg 2 0:03 1:34 34:21 692:42 Alg 3 0:36 0:38 1:36 23:25 445:58 Alg 4 1:37 16:17 244:11 Alg 5 0:01 0:04
20
Joint Entropy of miki’s (Mushroom)
number of items (y = x) entropy of miki entropy of greedy approximation
21
Joint Entropy of miki’s (LumoLogp)
number of items (y = x) entropy of miki entropy of greedy approximation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.