The Multi-hop closure theorem for the Rolodex Model using pTrees

Slides:



Advertisements
Similar presentations
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Advertisements

Copyright © Cengage Learning. All rights reserved. CHAPTER 5 SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
RoloDex Model The Data Cube Model gives a great picture of relationships, but can become gigantic (instances are bitmapped rather than listed, so there.
Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.
Toward a Unified Theory of Data Mining DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely labeled components.
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING
Data Mining 1 Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query Processing, rather than the "please.
Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
1234 G Exp G So as not to duplicate axes, this copy of G should be folded over to coincide with the other copy, producing a "conical" unipartite.
A hop is a relationship, R, hopping from entity, E, to entity, F. Strong Rule Mining finds all frequent, confident rules R(E,F)
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
document 2345 course Text person EnrollEnroll Buy MYRRH ManY-Relationship-Rule Harvester.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
Proof And Strategies Chapter 2. Lecturer: Amani Mahajoub Omer Department of Computer Science and Software Engineering Discrete Structures Definition Discrete.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Adding Probabilities 12-5
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
By Arijit Chatterjee Dr
pTrees predicate Tree technologies
Reducing Number of Candidates
Toward a Unified Theory of Data Mining DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely.
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH
Frequent Pattern Mining
Taibah University College of Computer Science & Engineering Course Title: Discrete Mathematics Code: CS 103 Chapter 2 Sets Slides are adopted from “Discrete.
Association Rules.
The vertex-labelled, edge-labelled graph
MYRRH A hop is a relationship, R, hopping from one entity, E, to another entity, F. Strong Rule Mining (SRM) finds all frequent and confident rules, AC.
North Dakota State University Fargo, ND USA
William Norris Professor and Head, Department of Computer Science
pTrees predicate Tree technologies
Yue (Jenny) Cui and William Perrizo North Dakota State University
Market Basket Analysis and Association Rules
PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Pre-Processing What is the best amount of amortized preprocessing?
Association Rule Mining
Vertical K Median Clustering
Data Mining Association Analysis: Basic Concepts and Algorithms
Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data Imad Rahal, Dongmei Ren, Amal Perera, Hassan Najadat and.
Vertical K Median Clustering
CHAPTER 1 - Sets and Intervals
prove that it is an addition (it is a nudge worth reading about).
CHAPTER 1 - Sets and Intervals
North Dakota State University Fargo, ND USA
Frequent patterns and Association Rules
Functional Analytic Unsupervised and Supervised data mining Technology
Computer Security Foundations
Vertical K Median Clustering
Department of Computer Science National Tsing Hua University
Standard k-means Clustering: Assume X={x1, x2,
North Dakota State University Fargo, ND USA
PARTIALLY ORDERED SET DIRECTED ACYCLIC GRAPH
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
REVISION Relation. REVISION Relation Introduction to Relations and Functions.
Association Analysis: Basic Concepts
Presentation transcript:

The Multi-hop closure theorem for the Rolodex Model using pTrees Arijit Chatterjee, Arjun G. Roy, Mohammad Hossain William Perrizo Computer Science Department North Dakota State University

Data Mining Reference : http://www.egeen.ee/u/vilo/edu/2004-05/Andmekaevandus/index.cgi?f=Intro

But it is pure (pure0) so this branch ends Vertical Structuring into predicate Trees (pTrees): project attributes (4 files) pTrees then vertically slice off each bit position (12 files) Given a table of Horizontal records. (traditionally Vertically Processed, so VPHD ) then compress each bit slice into a pTree e.g., compress R11 into P11: =2 Determine the number of occurences of 7 0 1 4 We use Horizontal Processing of Vertical Data, HPVD, to find the number of occurences of 7 0 1 4? R(A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 Base 10 Base 2 2 7 6 1 6 7 6 0 3 7 5 1 2 7 5 7 3 2 1 4 2 2 1 5 7 0 1 4 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 = for Horizontally structured records Scan vertically R11 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 pure1? false=0 pure1? true=1 pure1? false=0 pure1? false=0 pure1? false=0 Record the truth of the predicate pure1 (all 1-bits) in a tree recursively on halves until the half is pure. 1. Whole is pure1? false  0 0 0 0 1 P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 1 0 0 0 1 01 10 1 0 1 0 01 0 1 0 0 0 1 0 0 10 01 ^ 2. Left half pure1? false  0 3. Right half pure1? false  0 0 0 P11 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 0 0 0 1 To count occurrences of 7,0,1,4 use 111000001100 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 7 0 1 4 But it is pure (pure0) so this branch ends

Multi-relationships RoloDex Model: and the RoloDex Model 2 Entity Many relationship cards axes Multi-relationships and the RoloDex Model  Customer 1 2 3 4 Item 1 customer rates movie as 5 card 1 5 people 2 3 4 items terms DataCube Model for 3 entities, items, people and terms. cust item card 5 6 7 People  1 2 3 4 Author movie 2 3 1 5 4 customer rates movie card 2 3 4 5 PI 2 3 4 5 PI 4 3 2 1 Course Enrollments 1 Doc termdoc card authordoc card 1 3 2 Doc 1 2 3 4 Gene genegene card (ppi) docdoc People  term  7 6 5 4 3 Gene 1 2 3 4 G 5 6 7 6 5 4 3 2 t 1 termterm card (share stem?) 1 3 Exp expPI card expgene card genegene card (ppi) The bottom line point of this slides is that all entities are interrelated through multiple relationships (multiple "hops") Items: i1 i2 i3 i4 i5 |0 001|0 |0 11| |1 001|0 |1 01| |2 010|1 |0 10| People: p1 p2 p3 p4 |0 100|A|M| |1 001|T|M| |2 010|S|F| |3 011|B|F| |4 100|C|M| Terms: t1 t2 t3 t4 t5 t6 |1 010|1 101|2 11| |2 001|0 000|3 11| |3 011|1 001|3 11| |4 011|3 001|0 00| Relationship: p1 i1 t1 |0 0| 1 |0 1| 1 |1 0| 1 |2 0| 2 |3 0| 2 |4 1| 2 |5 1|_2 Relational Model: How can we data mine those multi-relationships?

Multi-hop rule mining using the RoloDex model: A hop is a relationship, R, hopping from one entity, E to another, F. Given a high value "relationship" data set, we pay the one-time cost of creating pTrees both ways. 2 3 4 5 F 4 1 3 1 2 1 1 1 R, as a matrix, has both E-pTrees (horizontal bit slices, Re) and F-pTrees (vertical bit slices, Rf) E R(E,F) Standard ARM finds strong (frequent, confident) single-entity rules. ( i.e., both A and C  E. Counts are on the other entity, F). ct(&eARe)  mnsp ct(&eARe &eCRe) / ct(&eARe)  mncf What about multi-entity rules? ( i.e., AE, CF ). A 1-hop (AE and CF) F-focused rule is strong if: ( "focused on" refers to where the counts are taken) ct(&eARe)  mnsp ct(&eARe & PC) / ct(&eARe)  mncf 1-hop, F-focused strong rules can be mined efficiently because of 1. (antecedent downward closure) If A is frequent, then all of its subsets are frequent. Or, if A is infrequent, then all its supersets are infrequent. Since frequency involves only A, we can mine for all qualifying antecedents efficiently using downward closure. 2. (consequent upward closure) If AC is non-confident, then so is AD for all subsets, D, of C. So  frequent antecedent, A, use upward closure to mine for all of its' confident consequents. The theorem suggested here is: For (a+c)-hop strong rule mining with a focus entity which is a hops from the antecedent and c hops from the consequent, if a {c} is odd [even], use downward [upward] closure on the frequency {confidence} step in the mining. In this case A is 1-hop from F (1 is odd, use downward closure). C is 0-hops from F (0 is even, use upward closure). A 1-hop (AE and CF) E-focused rule is strong if: ct( PA)  mnsp ct( PA &fC SC ) / ct( PA )  mncf

ct(&eARe &gCSg) / ct(&eARe)  mncf 2-hop F-focused (The focus is on middle entity, F) C G S(F,G) 1 4 1 3 AC strong if: ct(&eARe)  mnsp ct(&eARe &gCSg) / ct(&eARe)  mncf 1 2 1 1 1. (antecedent downward closure) If A is infrequent, then so are all of its supersets. 2 3 4 5 F 2. (consequent downward closure) If AC is non-confident, so is AD for all supersets, D. 4 1 1,1 down, down 3 1 2 1 1 1 A  E R(E,F) 2-hop G-focused ct(&f&eAReSf)mnsp  mncf ct(&f&eAReSf & PC) / &f&eAReSf 1. (antecedent upward closure) If A is infrequent, then so for are all subsets. 2,0 up, up 2. (consequent upward closure) If AC is non-confident, so is AD for all subsets, D. 2-hop E-focused ct(PA)mnsp  mncf ct(PA&f&gCSgRf ) / ct(PA) 0,2 up,up 1. (antecedent upward closure) If A is infrequent, then so for are all subsets. 2. (consequent upward closure) If AC is non-confident, so is AD for all subsets, D. It was 2-hop F-focus that generated the interest in multi-hop rule mining, in particular: R = a "friends" relationship (e.g., from Facebook) S = a "buys" relationship between people and items. Is it a strong rule that friends of those who bought a set of items, also buy those items?

ct( &f&eAReSf &h(& )UiTh ) / ct(&f&eAReSf) S(F,G) R(E,F) 1 2 3 4 E F 5 G A C T(G,H) H U(H,I) I V(I,J) J 5-hop Focus on G: (if yellow then green)  mnsp ct( &f&eAReSf &h(& )UiTh ) / ct(&f&eAReSf)  mncnf i(&jCVj) 5-hop focus on G: 1. (antecedent has upward closure) 2. (consequent has downward closure)

Multi-hop closure property theorem “For transitive (a+c)-hop strong rule mining with a focus entity which is ‘a’ hops from the antecedent and ‘c’ hops from the consequent, if a [or c] is odd or even then one can use downward or upward closure respectively on that step“

&elist(&clist(&aDXa)Yc)We The Multi-hop Closure Theorem A condition is downward [upward] closed: If when it is true of A, it is true for all subsets [supersets], D, of A. Given an (a+c)-hop multi-relationship, where the focus entity is a hops from the antecedent and c hops from the consequent, if a [or c] is odd/even then downward/upward closure applies. A pTree, X, is said to be "covered by" a pTree, Y, if  one-bit in X, there is a one-bit at that same position in Y (the list corresponding to the bitmap, Y, is a superset of the list corresponding to the bitmap, X) Lemma-0: For any two pTrees, X, Y; X&Y is covered by X and thus ct(X&Y)  ct(X) and list(X&Y)list(X) Proof-0: ANDing with Y may zero some of X's ones but it will never change any zeros to ones. Lemma-1: Let AD, &aAXa covers &aDXa (ANDing over a superset always covers) Lemma-2: Let AD, &clist(&aDXa)Yc covers &clist(&aAXa)Yc Proof-1&2: Let Z=&aD-AXa then &aDXa =Z&(&aAXa). lemma-1 now follows from lemma-0, as does D'=list(&aAXa) A'=list(&aDXa)  so by lemma-1, we get lemma-2: Lemma-2: Let AD, &clist(&aDXa)Yc covers &clist(&aAXa)Yc Lemma-3: AD, &elist(&clist(&aAXa)Yc)We covers &elist(&clist(&aDXa)Yc)We Proof-3: lemma-3 follows in the same way from lemma-1 and lemma-2. Continuing this establishes: If there are an odd number of nested &'s then the expression with D is covered by the expression with A. Therefore the count with D  with A. Thus, if the frequent expression and the confidence expression are > threshold for A then the same is true for D. This establishes downward closure. Exactly analogously, if there are an even number of nested &'s we get the upward closures.