Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research.

Similar presentations


Presentation on theme: "Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research."— Presentation transcript:

1 Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research

2 Copyright © 2005 by Limsoon Wong Plan Frequent itemsets –Convexity –Equivalence classes, generators, & closed patterns –Plateau representation –Efficient mining of generators & closed patterns Emerging patterns Odds ratio patterns Relative risk patterns

3 Copyright © 2005 by Limsoon Wong Frequent Itemsets

4 Copyright © 2005 by Limsoon Wong Association Rules Buyer’s behaviour in supermarket Mgmt are interested in rules such as

5 Copyright © 2005 by Limsoon Wong Frequent Itemsets List of items: I = {a, b, c, d, e, f} List of transactions: T = {T 1, T 2, T 3, T 4, T 5 } T 1 = {a, c, d} T 2 = {b, c, e} T 3 = {a, b, c, e, f} T 4 = {b, e} T 5 = {a, b, c, e} For each itemset I  I, sup(I,T) = |{ T i  T | I  T i }| Freq itemsets: F T = F(ms,T) ={I  I | sup(I,T)  ms}

6 Copyright © 2005 by Limsoon Wong Freq itemset from our example: A priori property: I  F T   I’  I, I’  F T A Priori Property ms=2

7 Copyright © 2005 by Limsoon Wong Lattice of Freq Itemsets F T can be very large Is there a concise rep? Observation: –{a, b, c, e} is maximal –{ } is minimal –everything else is betw them  { }, {a, b, c, e}  a concise rep for F T ?

8 Copyright © 2005 by Limsoon Wong Convexity An itemset space S is convex if, for all X, Y  S st X  Y, we have Z  S whenever X  Z  Y An itemset X is most general in S if there is no proper subset of X in S. These itemsets form the left bound L of S An itemset is most specific in S if there is no proper superset of X in S.These itemsets form the right bound R of S  L, R  is a concise rep of S [L, R] = { Z | X  L, Y  R, X  Z  Y} = S

9 Copyright © 2005 by Limsoon Wong Convexity of Freq Itemsets Proposition 1: The freq itemset space is convex   L, R  is a concise rep for a freq itemset space

10 Copyright © 2005 by Limsoon Wong Is it good enough?  { }, {a, b, c, e}  can be a concise rep for F T But we cant get support values for elems in F T

11 Copyright © 2005 by Limsoon Wong What is a good concise rep? A good concise rep for F T should enable these tasks below efficiently, w/o accessing T again: –Task 1: Enumerate {I  F T } –Task 2: Enumerate {(I, sup(I,T)) | I  F T } –Task 3: Given I, decide if I  F T, & if so report sup(I,T) –Task 4: Enumerate itemsets w/ sup in a given range –etc.

12 Copyright © 2005 by Limsoon Wong Closed Itemset Rep A pattern is a closed pattern if each of its supersets has a smaller support than it The closed itemset rep of F T is CR ={ (I, sup(I,T)) | I  F T, I is closed pattern} Proposition 2: {(I, sup(I,T)) | I  F T } = {(I, max{sup(I’, T) | (I’, sup(I’,T))  CR, I  I’}) | I  F T }  May be inefficient for Tasks 2, 3, 4

13 Copyright © 2005 by Limsoon Wong Generator Rep A pattern is a generator if each of its subsets has a larger support than it The generator rep of F T is GR =  {(I, sup(I,T)) | I  F T, I is generator}, GBd-  where GBd- are the min in-freq itemsets Proposition 3: {(I, sup(I,T)) | I  F T } = {(I, min{sup(I’,T) | I’  GR, I’  I}) | I  F T }  May be inefficient for Tasks 2, 3, 4

14 Copyright © 2005 by Limsoon Wong Decompose freq itemset lattice into plateaus wrt itemset support, S =  i P i, with P i = {I  S | sup(I,T) = i} Proposition 6: Each P i is convex  S =  i [L i, R i ], where [L i, R i ] = P i Freq Itemset Plateaus

15 Copyright © 2005 by Limsoon Wong From Generators & Closed Patterns To Equivalence Classes The equivalence class of an itemset I is [I] T = { I’ | { T i  T | I’  T i } = {T j  T | I  T j }} Proposition 4: [I] T is convex. Furthermore, if [L,R] = [I] T, then L = min [I] T, and R = max [I] T is a singleton Proposition 5: –An itemset I is a generator iff I  min [I] T –An itemset I is a closed pattern iff I  max [I] T

16 Copyright © 2005 by Limsoon Wong Plateaus = Generators + Closed Patterns Theorem 7: Let [L i,R i ] = P i be a freq itemset plateau of F T. Then –P i = [X 1 ] T  …  … [X k ] T, where Ri = {X 1, …, X k } –R i are the closed patterns in P i –L i =  i min [X i ] T are the generators in P i

17 Copyright © 2005 by Limsoon Wong Freq Itemset Plateau Rep The freq itemset plateau rep of F T is PR = {(  L i, R i ,i) | i  ms} where [L i,R i ] is plateau at support level i in F T Proposition 8: {(I, sup(I,T)) | I  F T } = {(I, i)| (  L i, R i , i)  PR, X  L i, Y  R i, X  I  Y}  All 4 tasks are obviously efficient

18 Copyright © 2005 by Limsoon Wong Remarks PR is a good concise rep for freq itemsets PR is more flexible compared to other reps PR unifies diff notions used in data mining Nice... But can we mine PR fast?

19 Copyright © 2005 by Limsoon Wong Mining PR Fast To mine PR fast, mine its borders fast To mine its borders fast, mine equiv classes in the plateau fast To mine equiv classes fast, mine generators & closed patterns of equivalence classes fast

20 From SE-Tree To Trie To FP-Tree {} bcda abacad abcabd abcd acd bcbd bcd cd SE-tree of possible itemsets T T 1 = {a,c,d} T 2 = {b,c,d} T 3 = {a,b,c,d} T 4 = {a,d} Copyright © 2005 by Limsoon Wong............ a b c d d c d b cd d d c d d Trie of transactions < 1 : right-to-left, top-to-bottom traversal of SE-tree FP-tree head table

21 Copyright © 2005 by Limsoon Wong GC-growth: Fast Simultaneous Mining of Generators & Closed Patterns

22 Step 1: FP-tree construction Copyright © 2005 by Limsoon Wong

23 Step 2: Right-to-left, top-to-bottom traversal Copyright © 2005 by Limsoon Wong

24 Step 5: Confirm X i is generator Copyright © 2005 by Limsoon Wong Proposition 9: Generators enjoy the apriori property. That is every subset of a generator is also a generator

25 Step 7: Find closed pattern of X i Copyright © 2005 by Limsoon Wong Proposition 10: Let X be a generator. Then the closed pattern of X is  {X’’| X’  H[last(X)],X  X’, X’ prefix of X’’, T[X’’] = true}.

26 Correctness of GC-growth Theorem 11: GC-growth is sound and complete for mining generators and closed patterns Copyright © 2005 by Limsoon Wong

27 Performance of GC-growth GC-growth is mining both generators and closed patterns But is comparable in speed to the fastest algorithms that mined only closed patterns

28 Copyright © 2005 by Limsoon Wong Emerging Patterns

29 0% edible mushrooms poisonous mushrooms EPs x%x% Example: {odor=none, gill_size=broad, ring_number=1} 64% (edible) vs 0% (poisonous) Differentiation and Contrast Copyright © 2005 by Limsoon Wong

30 NB: For this talk, we restrict ourselves to “jumping” emerging patterns Emerging Patterns An emerging pattern is a set of conditions –usually involving several features –that most members of a class P satisfy –but none or few of the other class N satisfy  I is emerging pattern if sup(I,P) / sup(I,N) > k, for some fixed threshold k

31 Copyright © 2005 by Limsoon Wong Convexity of Emerging Patterns Theorem 12: Let E be an EP space and P i = { I  E | sup(I) = i}. Then E =  i P i, E is convex, and each P i is convex. That is, E can be decomposed into convex plateaus

32 Copyright © 2005 by Limsoon Wong EP Plateau Rep A concise rep for E =  i P i is EP plateau rep: EP_PR = { (  L i, R i , i) | [L i, R i ] = P i } Proposition 13: {(I, sup(I)) | I  E} = { (I, i) | (  L i, R i , i)  EP_PR, X  L i, Y  R i, X  I  Y}  All 4 tasks are obvious efficient

33 Efficient Mining of EP_PR Modify GC-growth so that for each equiv class C, it outputs its support in +ve transactions S pos [C] & in -ve transactions S neg [C] Then [R[C], C] are emerging patterns if S pos [C] / S neg [C] > k Copyright © 2005 by Limsoon Wong NB. Assume the threshold for EP is k

34 Copyright © 2005 by Limsoon Wong Odds Ratio Patterns

35 0% edible mushrooms poisonous mushrooms EPs x%x% Example: {odor=none, gill_size=broad, ring_number=1} 64% (edible) vs 0% (poisonous) Is an emerging pattern that is absent in most of the positive transactions a “real” pattern? Copyright © 2005 by Limsoon Wong What if this is 4%? 0.4%? 0.04%?

36 Copyright © 2005 by Limsoon Wong Odds Ratio Odds ratio for a (compound) factor P in a case- control study D is OR(P,D) = (PD,ed / PD,-d) / (PD,e- / PD,--)  P is a odds ratio pattern if OR(P,D) > k, for some threshold k

37 Copyright © 2005 by Limsoon Wong Nonconvexity of Odds Ratio Pattern Space Proposition 14: Let S k OR (ms,D) = { P  F(ms,D) | OR(P,D)  k}. Then S k OR (ms,D) is not convex

38 Convexity of Odds Ratio Pattern Space Plateaus Theorem 15: Let S n,k OR (ms,D) = { P  F(ms,D) | P D,ed =n, OR(P,D)  k}. Then S n,k OR (ms,D) is convex  The space of odds ratio patterns is not convex in general, but becomes convex when stratified into plateaus based on support levels  The space of odds ratio patterns can be concisely represented by plateau borders Copyright © 2005 by Limsoon Wong

39 How do you find these fast is key! Efficient Mining of Odds Ratio Pattern Space Plateaus GC-growth can find these fast :-)

40 Copyright © 2005 by Limsoon Wong Performance FPClose* and CLOSET+ –closed patterns only Our method computes –closed patterns –generators, and –odds ratio patterns (OR > 2.5)  Patterns that are much more statistically sophisticated than frequent patterns can now be mined efficiently

41 Copyright © 2005 by Limsoon Wong Relative Risk Patterns

42 Copyright © 2005 by Limsoon Wong Relative Risk Relative risk for a (compound) factor P in a prospective study D is  P is a relative risk pattern if RR(P,D) > k, for some threshold k

43 Copyright © 2005 by Limsoon Wong Nonconvexity of Relative Risk Pattern Space Proposition 16: Let S k RR (ms,D) = { P  F(ms,D) | RR(P,D)  k}. Then S k RR (ms,D) is not convex

44 Convexity of Relative Risk Pattern Space Plateaus Theorem 17: Let S n,k RR (ms,D) = { P  F(ms,D) | P D,ed =n, RR(P,D)  k}. Then S n,k RR (ms,D) is convex  The space of relative risk patterns is not convex in general, but becomes convex when stratified into plateaus based on support levels  The space of relative risk patterns can be concisely represented by plateau borders Copyright © 2005 by Limsoon Wong

45 How do you find these fast is key! Efficient Mining of Relative Risk Pattern Space Plateaus GC-growth can find these fast :-) x := RR(R,D);

46 Copyright © 2005 by Limsoon Wong Concluding Remarks Equiv classes & plateaus are fundamental in –Frequent itemsets –Emerging patterns –Odds ratio patterns –Relative risk patterns,... Equiv classes & plateaus of these complex patterns are convex spaces  Complex pattern spaces are concisely representable by borders  Complex pattern spaces can be efficiently and completely mined

47 Copyright © 2005 by Limsoon Wong Future Works

48 Copyright © 2005 by Limsoon Wong Impact of item ordering Impact of pushing complex statistical filters deeper into equivalence class generators Generate borders of equiv classes & support levels Test for odds ratio Test for relative risk Test for  2 Improve Implementations Modular pattern mining by construction of a fast equiv class generator and multiple statistical condition filters

49 Copyright © 2005 by Limsoon Wong Simple ensemble PCL Apply to Classification Develop classifiers based on the mined patterns –Simple ensemble –PCL Impact on accuracy of using generators vs closed patterns Argmax c  C  r  Rc, r > 50% accuracy r(X) f(X) =

50 Copyright © 2005 by Limsoon Wong Enrich Data Mining Foundations Increase statistical sophistication of patterns mined Increase dimensions and size of data handled

51 Copyright © 2005 by Limsoon Wong Acknowledgements Haiquan Li Jinyan Li Mengling Feng Yap Peng Tan


Download ppt "Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research."

Similar presentations


Ads by Google