MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically.

MAXIMALLY INFORMATIVE K-ITEMSETS

Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically different patterns represent same subgroup  Complement  Combinations of patterns marital-status = ‘Married-civ-spouse’ ∧ age ≥ 29 marital-status = ‘Married-civ-spouse’ ∧ education-num ≥ 8 marital-status = ‘Married-civ-spouse’ ∧ age ≤ 76 age ≤ 67 ∧ marital-status = ‘Married-civ-spouse’ marital-status = ‘Married-civ-spouse’ age ≥ 33 ∧ marital-status = ‘Married-civ-spouse’ …

Dissimilar patterns  Optimize dissimilarity of patterns reported  Additional value of individual patterns reported  Consider extent of patterns  Treat patterns as binary features/items  Joint entropy of itemset captures informativeness of pattern set

Joint Entropy of Itemset  Binary features (items) x 1, x 2,..., x n  Itemset of size k : X = {x 1, …, x k }  Joint entropy:

Joint Entropy  Each item cuts the database into 2 parts (not necessarily of equal size)  Each additional item cuts each part in 2  Joint entropy is maximal when all parts have equal size

Definition miki A Maximally Informative k-Itemset (miki) is an itemset of size k that maximizes the joint entropy: An itemset X  I of cardinality k is a maximally informative k- itemset, iff for all itemsets Y  I of cardinality k,

Properties of joint entropy, miki ’ s  Symmetric treatment of 0 ’ s and 1 ’ s  Both infrequent and frequent items discouraged  Optimal at p(x i ) = 0.5  Items in miki are (relatively) independent  Goals orthogonal to mining associations  Focus on value 1  Frequent items are encouraged  Find items that are dependent

More Properties At most 1 bit of information per item monotonicity of joint entropy Suppose X and Y are two itemsets such that X  Y. Then unit growth of joint entropy Suppose X and Y are two itemsets such that X  Y. Then

Properties: Independence Bound independence bound on joint entropy Suppose that X = {x 1, …, x k } is an itemset. Then  Every item adds at most H(x i )  Items potentially share information, hence ≤  = iff items are independent  A candidate itemset can be discarded if the bound is not above the current maximum (no need to check the data)

Example 1  H(A) = 1, H(B) = 1, H(C) = 1  H(D) = − ⅜lg⅜ − ⅝lg⅝ ≈ 0.96  {A, B, C} is a miki  H({A, B, C}) = 2.5 ≤ 3 ABCD 1110 1100 1110 1000 0110 0001 0011 0001

Partitions of itemsets  Group items that share information  Obtain tighter bound  Precompute joint entropy of small itemsets (e.g. 2- or 3-itemsets) joint entropy of partition Suppose that P = {B 1, …, B m } is a partition of an itemset. The joint entropy of P is defined as

Partition Properties partitioned bound on joint entropy Suppose that P = {B 1, …, B m } is a partition of an itemset X. Then independence bound on partitioned joint entropy Suppose that P = {B 1, …, B m } is a partition of an itemset X = {x 1, …, x k }. Then

Example 2  B and D are similar  {{B, D}, {C}} partition of {B, C, D}  H({B, C, D}) = 2.16  H({{B, D}, {C}}) = 2.41  H(B) + H(C) + H(D) = 2.96 ABCD 1110 1100 1110 1000 0110 0001 0011 0001

Algorithms Algorithm 1 Exhaustively consider all itemsets of size k, and return the optimal Algorithm 2 Use independence bound to skip table scan if not above current optimal Algorithm 3 Use partitioned bound on joint entropy. Use random partition of k/2 blocks

Algorithms Algorithm 4 Consider prefix X of size k  l of current itemset. If upper bound on any extension of X is below current optimal, then skip all extensions of X. l = 3 gives best results in practice Algorithm 5 Repeatedly add the item that improves the joint entropy the most (forward selection)

Example 1  82 subgroup discovered in 2-dimensional space  Miki of 4 patterns

Mushroom (119 x 8124) 234567 Alg 17021273,819 7.94  10 6 1.82  10 8 3.47  10 9 5.6  10 10 Alg 2122654,91769,134 1.23  10 6 1.95  10 7 Alg 34836029,747211,934 4.58  10 6 Alg 46029,747209,329 4.4  10 6 Alg 5237354470585699812 number of table scans

234567 Alg 10:1816:29735:23>1000 Alg 200:031:3434:21692:42>1000 Alg 30:360:381:3623:25445:58>1000 Alg 41:3716:17244:11>1000 Alg 5000:010:030:04 running time Mushroom (119 x 8124)

Joint Entropy of miki ’ s (Mushroom) number of items (y = x) entropy of miki entropy of greedy approximation

Joint Entropy of miki ’ s (LumoLogp) number of items (y = x) entropy of miki entropy of greedy approximation

MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically.

Similar presentations

Presentation on theme: "MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically.

Similar presentations

Presentation on theme: "MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically."— Presentation transcript:

Similar presentations

About project

Feedback