Download presentation
Presentation is loading. Please wait.
Published byBethanie Alexander Modified over 8 years ago
1
APPENDIX: Data Mining DUALITIES : 1. PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Given any set, S: A Partition is a decomposition of a set into subsets which are mutually exclusive (non- overlapping) and collectively exhaustive (every set point is in one subset). We assume the partition has unique labels on each of its component subsets (required for unambiguous reference). An Equivalence Relation equivocates pairs of S-points, x ~ y such that x~x; x~y y~x; x~y and y~z x~z. A Partition of S Induces the Function which takes each point in the set to the label of its component, Induces the Equivalence Relation which equates two points iff they are in the same component, Induces the Undirected Graph with an edge connecting each S-pair from the same component. A Function, f, from S to R Induces the Partition into the pre-image components of its R-points, {f -1 (r)} r R Induces the Equivalence Relation which equates x~y iff f(x)=f(y)=r (labeling that component as r), Induces the Undirected Graph with an edge connecting x with y iff f(x)=f(y). An Equivalence Relation on S, x~y Induces the Partition into its equivalence sets, {y x comp S iff y~x} (x s = canonical comp. reps. ), Induces the function, f(y)=x iff y x comp, Induces the Undirected Graph with an edge connecting x with y iff x~y. 2. PARTIALLY ORDERED SET CLOSED DIRECTED ACYCLIC GRAPH A Closed Directed Acyclic Graph on S Induced thePartially Ordered Set containing (x, y) iff there is an edge from x to y in the graph. A Partially Ordering, , on S Induced the Directed Acyclic Graph with an edge running form x to y iff x y.
2
APPENDIX: RoloDex Model : The dualities on the previous slide apply to unary relationships (relationships in which there is one entity and we are relating pairs of instances from that entity). The relationships we had been talking about were all Bipartite Relationships in which we have two separate entities (or two disjoint subsets from the same entity), and we only relate x to y if x and y are instance from different entities. Often, we need to analyze an even more complex situation in which we have a combination of bipartite and unipartite relationships called, Bipartite - Unipartite on Part (BUP) relationships. Examples: In Bioinformatics, bipartite relationships between genes and experiments (a gene is related to an experiment iff it expresses at a threshold level in that experiment) are studied in conjunction with unipartite relationships gene pairs (e.g., gene-gene or protein-protein interactions). In Market Research, bipartite relationships between items and customers are studied in conjunction with unipartite relationships on the customers (e.g., x~y iff x and y are males under 25). For these BUP situation, we suggest the RoloDex Model. In this model, we each relationship is expressed as a "card" in a rolodex revolving around the entities involved in that relationship.
3
1234 G=Genes 1 1 3 E=Expiments 1 2 3 4 G=Genes So as not to duplicate axes, this copy of G=Genes should ideally be folded over to coincide with the other copy, producing a "conical" unipartite card. The Bipartite, Unipartite-on-Part (BUP), Experiment-Gene-Gene Relationship, EGG 1234 G=Genes 1 1 3 E=Expiments Each conical RoloDex card revolving about the Gene Entity axis is a separate Gene-Gene (or Protien-Protein) interactions Each rectangular RoloDex card revolving about the Gene Entity axis and about the Experiment axis is a separate Gene- Experiment relationship.
4
Axis-Card pair (Entity-Relationship pair), a c(a,b), a support count for AxisSets (or ratio or %) : A, for a graph relationship, supp G (A, a c(a,b))=|{b: a A, (a,b) c}| and for a multigraph, supp MG is the histogram over b of (a,b)-EdgeCounts, a A. Other quantifiers can be used also (e.g., the universal, is used in MBR) Customer 1 2 3 4 Item 7 6 5 4 3 2 t 1 6 5 4 3 Gene 1 1 1 Doc 1 2 3 4 Gene 1 1 3 Exp 1 1 1 1 1 1 1 1 1234 Author 1234 G 56 term 7 567 People 1 1 1 1 1 1 3 2 1 Doc 2345 PI People cust item card authordo c card termdoc card docdoc termterm card (share stem?) expgen e card gene gene card (ppi) expPI card Most interestingness measure are based on one of these supports. In IR, df(t) = supp G ({t}, t c(t,d)); tf(t,d) is the one histogram bar in supp MG ({t}, t c(t,d)) In MBR supp(I)=supp G (I. i c(i,t)) In MDA, supp MG (GSet, g c(g,e)) Of course all supports are inherited redundantly by the card, c(a,b). 5 6 16 ItemSet Supp(A) = CusFreq(ItemSet) gene gene card (ppi) RoloDex Model ItemSet antecede nt 12345616 itemset itemset card Conf(A B) =Supp(A B)/Supp(A)
5
card (RELATIONSHIP), c(I,T), one has I-Association Rules among disjoint Isets, A C, A,C I, with A∩C= ∅ and T-Association Rules among disjoint Tsets, A C, A,C T, with A∩C= ∅ Two measures of quality of A C are: SUPP(A C) where e.g., for any Iset, A, SUPP(A) ≡ |{ t | (i,t) E i A}| CONF(A C) = SUPP(A C)/SUPP(A) First Cousin Association Rules: Given any card, c(T,U) sharing axis, T, with the bipartite relationship, b(T,I), Cousin Association Rules are those in which the antecedent, T-sets is generated by a subset, S, of U as follows: {t T| u S such that (t,u) C} (note this should be called an "existential first cousin AR" since we are using the existential quantifier. One can use the universal quantifier (as was used in MBR ARM) ) E.g., S U, A=C(S), A' T then A A' is a CAR and we can also label it S A' First Cousin Association Rules Once Removed (FCAR-1r) are those in which both Tsets are generated by another bipartite relationship and we can label antecedent and or the consequent using the generating set or the Tset. Cousin Association Rules (CARs)
6
The Cousin Association Rules Second Cousin Association Rules are those in which the antecedent Tset is generated by a subset of an axis which shares a card with T, which shares the card, B, with I. 2CAR s can be denoted using the generating (second cousin) set or the Tset antecedent. Second Cousin Association Rules once removed are those in which the antecedent Tset is generated by a subset of an axis which shares a card with T, which shares the card, B, with I and the consequent is generated by C(T,U) (a first cousin, Tset). 2CAR-1r s can be denoted using any combination of the generating (second cousin) set or the Tset antecedent and the generating (first cousin) or Tset consequent. Second Cousin Association Rules twice removed are those in which the antecedent Tset is generated by a subset of an axis which shares a card with T, which shares the card, B, with I and the consequent is generated by a subset of an axis which shares a card with T, which shares another first cousin card with I. 2CAR-2r s can be denoted using any combination of the generating (second cousin) set or the Tset antecedent and the generating (second cousin) or Tset consequent. Note 2CAR-2r s are also 2CAR-1r s so they can be denoted as above also. Third Cousin Association Rules are those.... We note that these definitions give us many opportunities to define quality measures
7
Customer 1 2 3 4 Item 7 6 5 4 3 2 t 1 6 5 4 3 Gene 1 1 1 Doc 1 2 3 4 Gene 1 1 3 Exp 1 1 1 1 1 1 1 1 1234 Author 1234 G 56 term 7 567 People 1 1 1 1 1 1 3 2 1 Doc 2345 PI People cust item card authordo c card termdoc card docdoc termterm card (share stem?) expgen e card gene gene card (ppi) expPI card For Distance CARMA relationships, quality (e.g., supp or conf or???) can be measured using information on any/all cards along the relationship (multiple cards can contribute factors or terms or in some other way???) gene gene card (ppi) Measuring Quality in the RoloDex Model
8
First, we propose definition of Generalized Association Rules (GARs) which contains the standard "1 Entity Itemset" AR definition as a special case. Association Pathway Mining (APM) is a DM technique (with application to bioinformatics?) Given Relationships, R 1,R 2 (RoloDex cards) with shared Entity,F, (axis), E R 1 F R 2 G and given A E and C G, then A C, is a Generalized F Association Rule, with Support R 1 R 2 (A C) = | {t E 2 | a A, (a,t) R 1 and c C, (c,t) R 2 } | Confidence R 1 R 2 (A C) = Support R 1 R 2 (A C) / Support R 1 (A) where as always, Support R 1 (A) = |{t F| a A, (a,t) R 1 }|. E=G, the GAR is a standard AR iff A C= . Association Pathway Mining (APM) is the identification and assessment (e.g., support, confidence, etc.)of chains of GARs in a RoloDex. Restricting to the mining of cousin GARs reduces the number of strong rules or pathways links. Generalized CARs:
9
Downward closure property of Support Sets: SS(A‘,C') SS(A,C) A' A, C' C Therefore, if all labels are non-negative, then LWS(A,C) LWS(A‘,C') (in order for LSW(A,C) to exceed a threshold is that all LWS(A‘,C') exceed that threshold A' A, C' C). So an Apriori-like frequent set pair mine would go as: Start with pairs of 1-sets (in E and G). The only candidate 2-antecedents with 1-consequents (equiv, 2-consequents with 1- antecedents) would be those formed by joining... The weighted support concept can be extended to the case there R 1 and/or R 2 have labels as well. Vertical methods can be applied by converting F to vertical format (F instances are the rows and pertinent features from other cards/axes are "rolled over" to F as derived feature attributes More generally, for entities E, F, G and relationships, R 1 (E,F) and R 2 (F,G), A E R 1 F R 2 G C Support-Set R 1 R 2 (A,C) = SS R 1, R 2 (A,C) = {t E 2 | a A (a,t) R 1, c C (c,t) R 2 } If E 2 has real labels, Label-Weighted-Support R 1 R 2 (A,C) = LWS R 1 R 2 (A,C) = t SS R 1 R 2 label(t) (the un-generalized case occurs when all weights are 1) R1R1 11 11 l 2,2 l 2,3 F 11 11 R3R3 E G A C SS R1R2
10
VERTIGO: A Vertically Structured Rendition of the GO (Gene Ontology)? How do we include GO data into the Data Mining processes? 1. Treat it as a horizontally structured dataset. 2. View GO as a Gene Set hierarchy (that seems to be how it is used, often?) with the other aspects of it as node labels. One could then minimize it - as a subset of the Set Enumeration Tree with highly structured labels? 3. Preprocess pertinent GO information into derived attributes on a Gene Table. 4. Use the RoloDex Model for it? Preliminary thoughts on this alternative include: Each of the three major annotation areas (Molecular Function, Cellular Location, Biological Process) is a Gene-to-Annotation Card.
11
APPENDIX: Vertical Select-Project-Join (SPJ) Queries A Select-Project-Join query has joins, selections and projections. Typically there is a central fact relation to which several dimension relations are to be joined (standard STAR DW) E.g., Student(S), Course(C), Enrol(E) STAR DB below (bit encoding is shown in reduced font italics for certain attributes) S|s____|name_|gen| C|c____|name|st|term| E|s____|c____|grade | |0 000|CLAY |M 0| |0 000|BI |ND|F 0| |0 000|1 001|B 10| |1 001|THAIS|M 0| |1 001|DB |ND|S 1| |0 000|0 000|A 11| |2 010|GOOD |F 1| |2 010|DM |NJ|S 1| |3 011|1 001|A 11| |3 011|BAID |F 1| |3 011|DS |ND|F 0| |3 011|3 011|D 00| |4 100|PERRY|M 0| |4 100|SE |NJ|S 1| |1 001|3 011|D 00| |5 101|JOAN |F 1| |5 101|AI |ND|F 0| |1 001|0 000|B 10| |2 010|2 010|B 10| |2 010|3 011|A 11| |4 100|4 100|B 10| |5 101|5 101|B 10| Vertical bit sliced (uncompressed) attrs stored as: S.s 2 S.s 1 S.s 0 S.gC.c 2 C.c 1 C.c 0 C.tE.s 2 E.s 1 E.s 0 E.c 2 E.c 1 E.c 0 E.g 1 E.g 0 0000000000000110 0010001100000011 1000100100101100 1011101000100010 0101010101101111 0111011001101100 01001010 01000111 10010010 10110110 Vertical (un-bit-sliced) attributes are stored: S.name C.name C.st |CLAY | |BI | |ND| |THAIS| |DB | |ND| |GOOD | |DM | |NJ| |BAID | |DS | |ND| |PERRY| |SE | |NJ| |JOAN | |AI | |ND|
12
O :o c r |0 000|0 00|0 01| |1 001|0 00|1 01| |2 010|1 01|0 00| |3 011|1 01|1 01| |4 100|2 10|0 00| |5 101|2 10|2 10| |6 110|2 10|3 11| |7 111|3 11|2 10| C:c n cred |0 00|B|1 01| |1 01|D|3 11| |2 10|M|3 11| |3 11|S|2 10| Vertical preliminary Select-Project-Join Query Processing (SPJ) R:r cap |0 00|30 11| |1 01|20 10| |2 10|30 11| |3 11|10 01| SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S:s n gen |0 000|A|M| |1 001|T|M| |2 100|S|F| |3 111|B|F| |4 010|C|M| |5 011|J|F| E:s o grade |0 000|1 001|2 10| |0 000|0 000|3 11| |3 011|1 001|3 11| |3 011|3 011|0 00| |1 001|3 011|0 00| |1 001|0 000|2 10| |2 010|2 010|2 10| |2 010|7 111|3 11| |4 100|4 100|2 10| |5 101|5 101|2 10| S.s 2 0 1 0 E.s 2 0 1 C.c 1 0 1 R.r 1 0 1 S.s 1 0 1 S.s 0 0 1 0 1 0 1 S.n A T S B C J S.g M F M F C.c 0 0 1 0 1 C.n B D M S C.r 1 0 1 C.r 0 1 0 R.r 0 0 1 0 1 R.c 1 1 0 R.c 0 1 0 1 E.s 1 0 1 0 1 0 E.s 0 0 1 0 1 E.o 2 0 1 E.o 1 0 1 0 1 0 E.o 0 1 0 1 0 1 0 1 E.g 1 1 0 1 E.g 0 0 1 0 1 0 In the SCORE database (Students, Courses, Offerings, Rooms, Enrollments), numeric attributes are represente vertically as P-trees (not compressed). Categorical are projected to a 1 column vertical file O.o 2 0 1 O.o 1 0 1 0 1 O.o 0 0 1 0 1 0 1 0 1 O.c 1 0 1 O.c 0 0 1 0 1 O.r 1 0 1 O.r 0 1 0 1 0 1 0 decimal binary.
13
SM 1 0 1 0 For selections, S.g=M=1 b C.r=2=10 b E.g=A=11 b R.c=20=10 b create the selection masks using ANDs and COMPLEMENTS. S.s 2 0 1 0 S.s 1 0 1 S.s 0 0 1 0 1 0 1 S.n A T S B C J S.g 1 0 1 0 E.s 2 0 1 E.s 1 0 1 0 1 0 E.s 0 0 1 0 1 E.o 2 0 1 0 1 E.o 1 0 1 0 1 0 E.o 0 1 0 1 0 1 0 1 E.g 1 1 0 1 E.g 0 0 1 0 1 0 C.c 1 0 1 C.c 1 0 1 0 1 C.n B D M S C.r 1 0 1 C.r 2 1 0 O.o 2 0 1 O.o 1 0 1 0 1 O.o 0 0 1 0 1 0 1 0 1 O.c 1 0 1 O.c 0 0 1 0 1 O.r 1 0 1 O.r 0 1 0 1 0 1 0 R.r 1 0 1 R.r 0 0 1 0 1 R.c 1 1 0 R.c 0 1 0 1 SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; C.r 1 0 1 C.r’ 2 0 1 Cr2 0 1 E.g 1 1 0 1 E.g 0 0 1 0 1 0 EgA 0 1 0 1 0 R.c 1 1 0 R.c’ 0 0 1 0 Rc20 0 1 0 Apply these selection masks (Zero out numeric values, blanked out others). S.s 2 0 S.s 1 0 1 0 S.s 0 0 1 0 S.n A T C E.s 2 0 E.s 1 0 1 0 1 0 E.s 0 0 1 0 E.o 2 0 1 0 E.o 1 0 1 0 E.o 0 0 1 0 1 0 C.c 1 0 1 C.c 0 0 1 C.n S O.o 2 0 1 O.o 1 0 1 0 1 001010101001010101 O.c 1 0 1 O.c 0 0 1 0 1 O.r 1 0 1 O.r 0 1 0 1 0 1 0 R.r 1 0 R.r 0 0 1 0
14
SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S.s 2 0 S.s 1 0 1 0 S.s 0 0 1 0 S.n A T C E.s 2 0 E.s 1 0 1 0 1 0 E.s 0 0 1 0 E.o 2 0 1 0 E.o 1 0 1 0 E.o 0 0 1 0 1 0 C.c 1 0 1 C.c 0 0 1 C.n S O.o 2 0 1 O.o 1 0 1 0 1 O.o 0 0 1 0 1 0 1 0 1 O.c 1 0 1 O.c 0 0 1 0 1 O.r 1 0 1 O.r 0 1 0 1 0 1 0 R.r 1 0 R.r 0 0 1 0 For the joins, S.s=E.s C.c=O.c O.o=E.o O.r=R.r, one approach is to follow an indexed nested loop like method. (Noting that attribute P-trees ARE an index for that attribute). The join O.r=R.r is simply part of a selection on O (R doesn’t contribute output nor participate in any further operations) Use the Rc20-masked R as the outer relation Use O as the indexed inner relation to produce that O-selection mask. Rc20 0 1 0 Get 1 st R.r value, 01 b (there's only 1) Mask the O tuples: P O.r 1 ^P’ O.r 0 O.r 1 0 1 O’.r 0 0 1 0 1 0 1 OM 0 1 0 1 This is the only R.r value (if there were more, one would do the same for each, then OR those masks to get the final O-mask). Next, we apply the O-mask, OM to O O.o 2 0 1 0 1 O.o 1 0 1 O.o 0 0 1 0 1 O.c 1 0 1 0 1 O.c 0 0 1
15
SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S.s 2 0 S.s 1 0 1 0 S.s 0 0 1 0 S.n A T C E.s 2 0 E.s 1 0 1 0 1 0 E.s 0 0 1 0 E.o 2 0 1 0 E.o 1 0 1 0 E.o 0 0 1 0 1 0 C.c 1 0 1 C.c 0 0 1 C.n S For the final 3 joins C.c=O.c O.o=E.o E.s=S.s the same indexed nested loop like method can be used. O.o 2 0 1 0 1 O.o 1 0 1 O.o 0 0 1 0 1 O.c 1 0 1 0 1 O.c 0 0 1 Get 1 st masked C.c value, 11 b Mask corresponding O tuples: P O.c 1 ^P O.c 0 O.c 1 0 1 0 1 O.c 0 0 1 OM 0 1 Get 1 st masked O.o value, 111 b Mask corresponding E tuples: P E.o 2 ^P E.o 1 ^P E.o 0 E.o 1 0 1 0 E.o 0 0 1 0 1 0 Get 1 st masked E.s value, 010 b Mask corresponding S tuples: P’ S.s 2 ^P S.s 1 ^P’ S.s 0 S’.s 2 1 0 1 0 S.s 1 0 1 0 S’.s 0 1 0 1 0 SM 0 1 0 Get S.n-value(s), C, pair it with C.n-value(s), S, output concatenation, C.n S.n There was just one masked tuple at each stage in this example. In general, one would loop through the masked portion of the extant domain at each level (thus, Indexed Horizontal Nested Loop or IHNL) E.o 2 0 1 0 EM 0 1 0 S C
16
Vertical Select-Project-Join-Classification Query Given previous SCORE Training Database (not presented as just one training table), predict what course a male student will register for, who got an A in a previous course in Room with a capacity of 20. This is a matter of applying the previous complex SPJ query first to get the pertinent Training table and then classifying the above unclassified sample (e.g., using, 1-nearest neighbour classification). The result of the SPJ is the single row Training Set, (S,C) and so the prediction is course=C.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.