The Multi-hop closure theorem for the Rolodex Model using pTrees Arijit Chatterjee, Arjun G. Roy, Mohammad Hossain William Perrizo Computer Science Department North Dakota State University
Data Mining Reference : http://www.egeen.ee/u/vilo/edu/2004-05/Andmekaevandus/index.cgi?f=Intro
But it is pure (pure0) so this branch ends Vertical Structuring into predicate Trees (pTrees): project attributes (4 files) pTrees then vertically slice off each bit position (12 files) Given a table of Horizontal records. (traditionally Vertically Processed, so VPHD ) then compress each bit slice into a pTree e.g., compress R11 into P11: =2 Determine the number of occurences of 7 0 1 4 We use Horizontal Processing of Vertical Data, HPVD, to find the number of occurences of 7 0 1 4? R(A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 Base 10 Base 2 2 7 6 1 6 7 6 0 3 7 5 1 2 7 5 7 3 2 1 4 2 2 1 5 7 0 1 4 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 = for Horizontally structured records Scan vertically R11 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 pure1? false=0 pure1? true=1 pure1? false=0 pure1? false=0 pure1? false=0 Record the truth of the predicate pure1 (all 1-bits) in a tree recursively on halves until the half is pure. 1. Whole is pure1? false 0 0 0 0 1 P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 1 0 0 0 1 01 10 1 0 1 0 01 0 1 0 0 0 1 0 0 10 01 ^ 2. Left half pure1? false 0 3. Right half pure1? false 0 0 0 P11 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 0 0 0 1 To count occurrences of 7,0,1,4 use 111000001100 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 7 0 1 4 But it is pure (pure0) so this branch ends
Multi-relationships RoloDex Model: and the RoloDex Model 2 Entity Many relationship cards axes Multi-relationships and the RoloDex Model Customer 1 2 3 4 Item 1 customer rates movie as 5 card 1 5 people 2 3 4 items terms DataCube Model for 3 entities, items, people and terms. cust item card 5 6 7 People 1 2 3 4 Author movie 2 3 1 5 4 customer rates movie card 2 3 4 5 PI 2 3 4 5 PI 4 3 2 1 Course Enrollments 1 Doc termdoc card authordoc card 1 3 2 Doc 1 2 3 4 Gene genegene card (ppi) docdoc People term 7 6 5 4 3 Gene 1 2 3 4 G 5 6 7 6 5 4 3 2 t 1 termterm card (share stem?) 1 3 Exp expPI card expgene card genegene card (ppi) The bottom line point of this slides is that all entities are interrelated through multiple relationships (multiple "hops") Items: i1 i2 i3 i4 i5 |0 001|0 |0 11| |1 001|0 |1 01| |2 010|1 |0 10| People: p1 p2 p3 p4 |0 100|A|M| |1 001|T|M| |2 010|S|F| |3 011|B|F| |4 100|C|M| Terms: t1 t2 t3 t4 t5 t6 |1 010|1 101|2 11| |2 001|0 000|3 11| |3 011|1 001|3 11| |4 011|3 001|0 00| Relationship: p1 i1 t1 |0 0| 1 |0 1| 1 |1 0| 1 |2 0| 2 |3 0| 2 |4 1| 2 |5 1|_2 Relational Model: How can we data mine those multi-relationships?
Multi-hop rule mining using the RoloDex model: A hop is a relationship, R, hopping from one entity, E to another, F. Given a high value "relationship" data set, we pay the one-time cost of creating pTrees both ways. 2 3 4 5 F 4 1 3 1 2 1 1 1 R, as a matrix, has both E-pTrees (horizontal bit slices, Re) and F-pTrees (vertical bit slices, Rf) E R(E,F) Standard ARM finds strong (frequent, confident) single-entity rules. ( i.e., both A and C E. Counts are on the other entity, F). ct(&eARe) mnsp ct(&eARe &eCRe) / ct(&eARe) mncf What about multi-entity rules? ( i.e., AE, CF ). A 1-hop (AE and CF) F-focused rule is strong if: ( "focused on" refers to where the counts are taken) ct(&eARe) mnsp ct(&eARe & PC) / ct(&eARe) mncf 1-hop, F-focused strong rules can be mined efficiently because of 1. (antecedent downward closure) If A is frequent, then all of its subsets are frequent. Or, if A is infrequent, then all its supersets are infrequent. Since frequency involves only A, we can mine for all qualifying antecedents efficiently using downward closure. 2. (consequent upward closure) If AC is non-confident, then so is AD for all subsets, D, of C. So frequent antecedent, A, use upward closure to mine for all of its' confident consequents. The theorem suggested here is: For (a+c)-hop strong rule mining with a focus entity which is a hops from the antecedent and c hops from the consequent, if a {c} is odd [even], use downward [upward] closure on the frequency {confidence} step in the mining. In this case A is 1-hop from F (1 is odd, use downward closure). C is 0-hops from F (0 is even, use upward closure). A 1-hop (AE and CF) E-focused rule is strong if: ct( PA) mnsp ct( PA &fC SC ) / ct( PA ) mncf
ct(&eARe &gCSg) / ct(&eARe) mncf 2-hop F-focused (The focus is on middle entity, F) C G S(F,G) 1 4 1 3 AC strong if: ct(&eARe) mnsp ct(&eARe &gCSg) / ct(&eARe) mncf 1 2 1 1 1. (antecedent downward closure) If A is infrequent, then so are all of its supersets. 2 3 4 5 F 2. (consequent downward closure) If AC is non-confident, so is AD for all supersets, D. 4 1 1,1 down, down 3 1 2 1 1 1 A E R(E,F) 2-hop G-focused ct(&f&eAReSf)mnsp mncf ct(&f&eAReSf & PC) / &f&eAReSf 1. (antecedent upward closure) If A is infrequent, then so for are all subsets. 2,0 up, up 2. (consequent upward closure) If AC is non-confident, so is AD for all subsets, D. 2-hop E-focused ct(PA)mnsp mncf ct(PA&f&gCSgRf ) / ct(PA) 0,2 up,up 1. (antecedent upward closure) If A is infrequent, then so for are all subsets. 2. (consequent upward closure) If AC is non-confident, so is AD for all subsets, D. It was 2-hop F-focus that generated the interest in multi-hop rule mining, in particular: R = a "friends" relationship (e.g., from Facebook) S = a "buys" relationship between people and items. Is it a strong rule that friends of those who bought a set of items, also buy those items?
ct( &f&eAReSf &h(& )UiTh ) / ct(&f&eAReSf) S(F,G) R(E,F) 1 2 3 4 E F 5 G A C T(G,H) H U(H,I) I V(I,J) J 5-hop Focus on G: (if yellow then green) mnsp ct( &f&eAReSf &h(& )UiTh ) / ct(&f&eAReSf) mncnf i(&jCVj) 5-hop focus on G: 1. (antecedent has upward closure) 2. (consequent has downward closure)
Multi-hop closure property theorem “For transitive (a+c)-hop strong rule mining with a focus entity which is ‘a’ hops from the antecedent and ‘c’ hops from the consequent, if a [or c] is odd or even then one can use downward or upward closure respectively on that step“
&elist(&clist(&aDXa)Yc)We The Multi-hop Closure Theorem A condition is downward [upward] closed: If when it is true of A, it is true for all subsets [supersets], D, of A. Given an (a+c)-hop multi-relationship, where the focus entity is a hops from the antecedent and c hops from the consequent, if a [or c] is odd/even then downward/upward closure applies. A pTree, X, is said to be "covered by" a pTree, Y, if one-bit in X, there is a one-bit at that same position in Y (the list corresponding to the bitmap, Y, is a superset of the list corresponding to the bitmap, X) Lemma-0: For any two pTrees, X, Y; X&Y is covered by X and thus ct(X&Y) ct(X) and list(X&Y)list(X) Proof-0: ANDing with Y may zero some of X's ones but it will never change any zeros to ones. Lemma-1: Let AD, &aAXa covers &aDXa (ANDing over a superset always covers) Lemma-2: Let AD, &clist(&aDXa)Yc covers &clist(&aAXa)Yc Proof-1&2: Let Z=&aD-AXa then &aDXa =Z&(&aAXa). lemma-1 now follows from lemma-0, as does D'=list(&aAXa) A'=list(&aDXa) so by lemma-1, we get lemma-2: Lemma-2: Let AD, &clist(&aDXa)Yc covers &clist(&aAXa)Yc Lemma-3: AD, &elist(&clist(&aAXa)Yc)We covers &elist(&clist(&aDXa)Yc)We Proof-3: lemma-3 follows in the same way from lemma-1 and lemma-2. Continuing this establishes: If there are an odd number of nested &'s then the expression with D is covered by the expression with A. Therefore the count with D with A. Thus, if the frequent expression and the confidence expression are > threshold for A then the same is true for D. This establishes downward closure. Exactly analogously, if there are an even number of nested &'s we get the upward closures.