APPENDIX: Data Mining DUALITIES : 1. PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Given any set, S: A Partition is a decomposition of a set.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Frequent Closed Pattern Search By Row and Feature Enumeration
Query Optimization: Relational Queries to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they.
Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used.
Chapter 7 Relations : the second time around
The Relational Model System Development Life Cycle Normalisation
Datamining_3 Clustering Methods Clustering a set is partitioning that set. Partitioning is subdividing into subsets which mutually exclusive (don't overlap)
RoloDex Model The Data Cube Model gives a great picture of relationships, but can become gigantic (instances are bitmapped rather than listed, so there.
Fast Algorithms for Association Rule Mining
Chapter 5 Normalization Transparencies © Pearson Education Limited 1995, 2005.
Entity Tables, Relationship Tables We Classify using any Table (as the Training Table) on any of its columns, the class label column. Medical Expert System:
Combinational Digital Circuits. Measurement Our world is an analog world. Measurements that we make of the physical objects around us are never in discrete.
Basic Data Mining Techniques
Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 4 The Relational Model.
Chapter 3 The Relational Model Transparencies Last Updated: Pebruari 2011 By M. Arief
Slides for “Data Mining” by I. H. Witten and E. Frank.
Concepts and Terminology Introduction to Database.
Toward a Unified Theory of Data Mining DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely labeled components.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Chapter 9. Chapter Summary Relations and Their Properties Representing Relations Equivalence Relations Partial Orderings.
Normalization Transparencies
Data Mining 1 Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query Processing, rather than the "please.
Systems Analysis and Design in a Changing World, 6th Edition 1 Chapter 4 - Domain Classes.
1 The Relational Database Model. 2 Learning Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical.
CIS552Relational Model1 Structure of Relational Database Relational Algebra Extended Relational-Algebra-Operations Modification of the Database.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Logical Database Design (1 of 3) John Ortiz Lecture 6Logical Database Design (1)2 Introduction  The logical design is a process of refining DB schema.
11/07/2003Akbar Mokhtarani (LBNL)1 Normalization of Relational Tables Akbar Mokhtarani LBNL (HENPC group) November 7, 2003.
Query Optimization: Relational Queries to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know.
Chapter 2 : Entity-Relationship Model Entity Sets Relationship Sets Design Issues Mapping Constraints Keys E-R Diagram Extended E-R Features Design of.
Lecture 5 Normalization. Objectives The purpose of normalization. How normalization can be used when designing a relational database. The potential problems.
Chapter 10 Normalization Pearson Education © 2009.
Query Optimization to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know it or not. The main.
Discrete Mathematics Relation.
UNIT_2 1 DATABASE MANAGEMENT SYSTEM[DBMS] [Unit: 2] Prepared By Lavlesh Pandit SPCE MCA, Visnagar.
1234 G Exp G So as not to duplicate axes, this copy of G should be folded over to coincide with the other copy, producing a "conical" unipartite.
******************************************************************** *These notes contain NDSU confidential and proprietary material. * *Patents are pending.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Sets.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Chapter 8: Relations. 8.1 Relations and Their Properties Binary relations: Let A and B be any two sets. A binary relation R from A to B, written R : A.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Query Optimization: Relational Queries to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know.
Chapter 8 Relational Database Design Topic 1: Normalization Chuan Li 1 © Pearson Education Limited 1995, 2005.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Framework Unifying Association Rule Mining, Clustering and Classification Anne Denton and William Perrizo Dept of Computer Science North Dakota State University.
Chapter 2: Entity-Relationship Model
COP Introduction to Database Structures
Toward a Unified Theory of Data Mining DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH
Lecture 2 The Relational Model
The vertex-labelled, edge-labelled graph
MYRRH A hop is a relationship, R, hopping from one entity, E, to another entity, F. Strong Rule Mining (SRM) finds all frequent and confident rules, AC.
Program layers of a DBMS
3. Vertical Data LECTURE 2 Section 3.
Clustering Methods Clustering a set is partitioning that set.
Electrical and Computer Engineering Department
3. Vertical Data LECTURE 2 Section 3.
The Multi-hop closure theorem for the Rolodex Model using pTrees
國立臺北科技大學 課程:資料庫系統 2015 fall Chapter 14 Normalization.
Entity-Relationship Diagram (ERD)
PARTIALLY ORDERED SET DIRECTED ACYCLIC GRAPH
Integrating Query Processing and Data Mining in Relational DBMSs
Presentation transcript:

APPENDIX: Data Mining DUALITIES : 1. PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Given any set, S: A Partition is a decomposition of a set into subsets which are mutually exclusive (non- overlapping) and collectively exhaustive (every set point is in one subset). We assume the partition has unique labels on each of its component subsets (required for unambiguous reference). An Equivalence Relation equivocates pairs of S-points, x ~ y such that x~x; x~y  y~x; x~y and y~z  x~z. A Partition of S Induces the Function which takes each point in the set to the label of its component, Induces the Equivalence Relation which equates two points iff they are in the same component, Induces the Undirected Graph with an edge connecting each S-pair from the same component. A Function, f, from S to R Induces the Partition into the pre-image components of its R-points, {f -1 (r)} r  R Induces the Equivalence Relation which equates x~y iff f(x)=f(y)=r (labeling that component as r), Induces the Undirected Graph with an edge connecting x with y iff f(x)=f(y). An Equivalence Relation on S, x~y Induces the Partition into its equivalence sets, {y  x comp  S iff y~x} (x s = canonical comp. reps. ), Induces the function, f(y)=x iff y  x comp, Induces the Undirected Graph with an edge connecting x with y iff x~y. 2. PARTIALLY ORDERED SET CLOSED DIRECTED ACYCLIC GRAPH A Closed Directed Acyclic Graph on S Induced thePartially Ordered Set containing (x, y) iff there is an edge from x to y in the graph. A Partially Ordering, , on S Induced the Directed Acyclic Graph with an edge running form x to y iff x  y.

APPENDIX: RoloDex Model : The dualities on the previous slide apply to unary relationships (relationships in which there is one entity and we are relating pairs of instances from that entity). The relationships we had been talking about were all Bipartite Relationships in which we have two separate entities (or two disjoint subsets from the same entity), and we only relate x to y if x and y are instance from different entities. Often, we need to analyze an even more complex situation in which we have a combination of bipartite and unipartite relationships called, Bipartite - Unipartite on Part (BUP) relationships. Examples: In Bioinformatics, bipartite relationships between genes and experiments (a gene is related to an experiment iff it expresses at a threshold level in that experiment) are studied in conjunction with unipartite relationships gene pairs (e.g., gene-gene or protein-protein interactions). In Market Research, bipartite relationships between items and customers are studied in conjunction with unipartite relationships on the customers (e.g., x~y iff x and y are males under 25). For these BUP situation, we suggest the RoloDex Model. In this model, we each relationship is expressed as a "card" in a rolodex revolving around the entities involved in that relationship.

1234 G=Genes E=Expiments G=Genes So as not to duplicate axes, this copy of G=Genes should ideally be folded over to coincide with the other copy, producing a "conical" unipartite card. The Bipartite, Unipartite-on-Part (BUP), Experiment-Gene-Gene Relationship, EGG 1234 G=Genes E=Expiments Each conical RoloDex card revolving about the Gene Entity axis is a separate Gene-Gene (or Protien-Protein) interactions Each rectangular RoloDex card revolving about the Gene Entity axis and about the Experiment axis is a separate Gene- Experiment relationship.

 Axis-Card pair (Entity-Relationship pair), a  c(a,b),  a support count for AxisSets (or ratio or %) :  A, for a graph relationship, supp G  (A, a  c(a,b))=|{b:  a  A, (a,b)  c}| and for a multigraph, supp MG  is the histogram over b of (a,b)-EdgeCounts, a  A. Other quantifiers can be used also (e.g., the universal,  is used in MBR)  Customer Item t Gene Doc Gene Exp Author 1234 G 56 term  People  Doc 2345 PI People  cust item card authordo c card termdoc card docdoc termterm card (share stem?) expgen e card gene gene card (ppi) expPI card Most interestingness measure are based on one of these supports. In IR, df(t) = supp G  ({t}, t  c(t,d)); tf(t,d) is the one histogram bar in supp MG  ({t}, t  c(t,d)) In MBR supp(I)=supp G  (I. i  c(i,t)) In MDA, supp MG  (GSet, g  c(g,e)) Of course all supports are inherited redundantly by the card, c(a,b) ItemSet Supp(A) = CusFreq(ItemSet) gene gene card (ppi) RoloDex Model ItemSet antecede nt itemset itemset card Conf(A  B) =Supp(A  B)/Supp(A)

 card (RELATIONSHIP), c(I,T), one has I-Association Rules among disjoint Isets, A  C,  A,C  I, with A∩C= ∅ and T-Association Rules among disjoint Tsets, A  C,  A,C  T, with A∩C= ∅ Two measures of quality of A  C are: SUPP(A  C) where e.g., for any Iset, A, SUPP(A) ≡ |{ t | (i,t)  E  i  A}| CONF(A  C) = SUPP(A  C)/SUPP(A) First Cousin Association Rules: Given any card, c(T,U) sharing axis, T, with the bipartite relationship, b(T,I), Cousin Association Rules are those in which the antecedent, T-sets is generated by a subset, S, of U as follows: {t  T|  u  S such that (t,u)  C} (note this should be called an "existential first cousin AR" since we are using the existential quantifier. One can use the universal quantifier (as was used in MBR ARM) ) E.g., S  U, A=C(S), A'  T then A  A' is a CAR and we can also label it S  A' First Cousin Association Rules Once Removed (FCAR-1r) are those in which both Tsets are generated by another bipartite relationship and we can label antecedent and or the consequent using the generating set or the Tset. Cousin Association Rules (CARs)

The Cousin Association Rules Second Cousin Association Rules are those in which the antecedent Tset is generated by a subset of an axis which shares a card with T, which shares the card, B, with I. 2CAR s can be denoted using the generating (second cousin) set or the Tset antecedent. Second Cousin Association Rules once removed are those in which the antecedent Tset is generated by a subset of an axis which shares a card with T, which shares the card, B, with I and the consequent is generated by C(T,U) (a first cousin, Tset). 2CAR-1r s can be denoted using any combination of the generating (second cousin) set or the Tset antecedent and the generating (first cousin) or Tset consequent. Second Cousin Association Rules twice removed are those in which the antecedent Tset is generated by a subset of an axis which shares a card with T, which shares the card, B, with I and the consequent is generated by a subset of an axis which shares a card with T, which shares another first cousin card with I. 2CAR-2r s can be denoted using any combination of the generating (second cousin) set or the Tset antecedent and the generating (second cousin) or Tset consequent. Note 2CAR-2r s are also 2CAR-1r s so they can be denoted as above also. Third Cousin Association Rules are those.... We note that these definitions give us many opportunities to define quality measures

 Customer Item t Gene Doc Gene Exp Author 1234 G 56 term  People  Doc 2345 PI People  cust item card authordo c card termdoc card docdoc termterm card (share stem?) expgen e card gene gene card (ppi) expPI card For Distance CARMA relationships, quality (e.g., supp or conf or???) can be measured using information on any/all cards along the relationship (multiple cards can contribute factors or terms or in some other way???) gene gene card (ppi) Measuring Quality in the RoloDex Model

First, we propose definition of Generalized Association Rules (GARs) which contains the standard "1 Entity Itemset" AR definition as a special case. Association Pathway Mining (APM) is a DM technique (with application to bioinformatics?) Given Relationships, R 1,R 2 (RoloDex cards) with shared Entity,F, (axis), E  R 1  F  R 2  G and given A  E and C  G, then A  C, is a Generalized F Association Rule, with Support R 1 R 2 (A  C) = | {t  E 2 |  a  A, (a,t)  R 1 and  c  C, (c,t)  R 2 } | Confidence R 1 R 2 (A  C) = Support R 1 R 2 (A  C) / Support R 1 (A) where as always, Support R 1 (A) = |{t  F|  a  A, (a,t)  R 1 }|. E=G, the GAR is a standard AR iff A  C= . Association Pathway Mining (APM) is the identification and assessment (e.g., support, confidence, etc.)of chains of GARs in a RoloDex. Restricting to the mining of cousin GARs reduces the number of strong rules or pathways links. Generalized CARs:

Downward closure property of Support Sets: SS(A‘,C')  SS(A,C)  A'  A, C'  C Therefore, if all labels are non-negative, then LWS(A,C)  LWS(A‘,C') (in order for LSW(A,C) to exceed a threshold is that all LWS(A‘,C') exceed that threshold  A'  A, C'  C). So an Apriori-like frequent set pair mine would go as: Start with pairs of 1-sets (in E and G). The only candidate 2-antecedents with 1-consequents (equiv, 2-consequents with 1- antecedents) would be those formed by joining... The weighted support concept can be extended to the case there R 1 and/or R 2 have labels as well. Vertical methods can be applied by converting F to vertical format (F instances are the rows and pertinent features from other cards/axes are "rolled over" to F as derived feature attributes More generally, for entities E, F, G and relationships, R 1 (E,F) and R 2 (F,G), A  E  R 1  F  R 2  G  C Support-Set R 1 R 2 (A,C) = SS R 1, R 2 (A,C) = {t  E 2 |  a  A (a,t)  R 1,  c  C (c,t)  R 2 } If E 2 has real labels, Label-Weighted-Support R 1 R 2 (A,C) = LWS R 1 R 2 (A,C) =  t  SS R 1 R 2 label(t) (the un-generalized case occurs when all weights are 1) R1R l 2,2 l 2,3 F R3R3 E G A C SS R1R2

VERTIGO: A Vertically Structured Rendition of the GO (Gene Ontology)? How do we include GO data into the Data Mining processes? 1. Treat it as a horizontally structured dataset. 2. View GO as a Gene Set hierarchy (that seems to be how it is used, often?) with the other aspects of it as node labels. One could then minimize it - as a subset of the Set Enumeration Tree with highly structured labels? 3. Preprocess pertinent GO information into derived attributes on a Gene Table. 4. Use the RoloDex Model for it? Preliminary thoughts on this alternative include: Each of the three major annotation areas (Molecular Function, Cellular Location, Biological Process) is a Gene-to-Annotation Card.

APPENDIX: Vertical Select-Project-Join (SPJ) Queries A Select-Project-Join query has joins, selections and projections. Typically there is a central fact relation to which several dimension relations are to be joined (standard STAR DW) E.g., Student(S), Course(C), Enrol(E) STAR DB below (bit encoding is shown in reduced font italics for certain attributes) S|s____|name_|gen| C|c____|name|st|term| E|s____|c____|grade | |0 000|CLAY |M 0| |0 000|BI |ND|F 0| |0 000|1 001|B 10| |1 001|THAIS|M 0| |1 001|DB |ND|S 1| |0 000|0 000|A 11| |2 010|GOOD |F 1| |2 010|DM |NJ|S 1| |3 011|1 001|A 11| |3 011|BAID |F 1| |3 011|DS |ND|F 0| |3 011|3 011|D 00| |4 100|PERRY|M 0| |4 100|SE |NJ|S 1| |1 001|3 011|D 00| |5 101|JOAN |F 1| |5 101|AI |ND|F 0| |1 001|0 000|B 10| |2 010|2 010|B 10| |2 010|3 011|A 11| |4 100|4 100|B 10| |5 101|5 101|B 10| Vertical bit sliced (uncompressed) attrs stored as: S.s 2 S.s 1 S.s 0 S.gC.c 2 C.c 1 C.c 0 C.tE.s 2 E.s 1 E.s 0 E.c 2 E.c 1 E.c 0 E.g 1 E.g Vertical (un-bit-sliced) attributes are stored: S.name C.name C.st |CLAY | |BI | |ND| |THAIS| |DB | |ND| |GOOD | |DM | |NJ| |BAID | |DS | |ND| |PERRY| |SE | |NJ| |JOAN | |AI | |ND|

O :o c r |0 000|0 00|0 01| |1 001|0 00|1 01| |2 010|1 01|0 00| |3 011|1 01|1 01| |4 100|2 10|0 00| |5 101|2 10|2 10| |6 110|2 10|3 11| |7 111|3 11|2 10| C:c n cred |0 00|B|1 01| |1 01|D|3 11| |2 10|M|3 11| |3 11|S|2 10| Vertical preliminary Select-Project-Join Query Processing (SPJ) R:r cap |0 00|30 11| |1 01|20 10| |2 10|30 11| |3 11|10 01| SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S:s n gen |0 000|A|M| |1 001|T|M| |2 100|S|F| |3 111|B|F| |4 010|C|M| |5 011|J|F| E:s o grade |0 000|1 001|2 10| |0 000|0 000|3 11| |3 011|1 001|3 11| |3 011|3 011|0 00| |1 001|3 011|0 00| |1 001|0 000|2 10| |2 010|2 010|2 10| |2 010|7 111|3 11| |4 100|4 100|2 10| |5 101|5 101|2 10| S.s E.s C.c R.r S.s S.s S.n A T S B C J S.g M F M F C.c C.n B D M S C.r C.r R.r R.c R.c E.s E.s E.o E.o E.o E.g E.g In the SCORE database (Students, Courses, Offerings, Rooms, Enrollments), numeric attributes are represente vertically as P-trees (not compressed). Categorical are projected to a 1 column vertical file O.o O.o O.o O.c O.c O.r O.r decimal binary.

SM For selections, S.g=M=1 b C.r=2=10 b E.g=A=11 b R.c=20=10 b create the selection masks using ANDs and COMPLEMENTS. S.s S.s S.s S.n A T S B C J S.g E.s E.s E.s E.o E.o E.o E.g E.g C.c C.c C.n B D M S C.r C.r O.o O.o O.o O.c O.c O.r O.r R.r R.r R.c R.c SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; C.r C.r’ Cr2 0 1 E.g E.g EgA R.c R.c’ Rc Apply these selection masks (Zero out numeric values, blanked out others). S.s 2 0 S.s S.s S.n A T C E.s 2 0 E.s E.s E.o E.o E.o C.c C.c C.n S O.o O.o O.c O.c O.r O.r R.r 1 0 R.r

SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S.s 2 0 S.s S.s S.n A T C E.s 2 0 E.s E.s E.o E.o E.o C.c C.c C.n S O.o O.o O.o O.c O.c O.r O.r R.r 1 0 R.r For the joins, S.s=E.s C.c=O.c O.o=E.o O.r=R.r, one approach is to follow an indexed nested loop like method. (Noting that attribute P-trees ARE an index for that attribute). The join O.r=R.r is simply part of a selection on O (R doesn’t contribute output nor participate in any further operations) Use the Rc20-masked R as the outer relation Use O as the indexed inner relation to produce that O-selection mask. Rc Get 1 st R.r value, 01 b (there's only 1) Mask the O tuples: P O.r 1 ^P’ O.r 0 O.r O’.r OM This is the only R.r value (if there were more, one would do the same for each, then OR those masks to get the final O-mask). Next, we apply the O-mask, OM to O O.o O.o O.o O.c O.c 0 0 1

SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S.s 2 0 S.s S.s S.n A T C E.s 2 0 E.s E.s E.o E.o E.o C.c C.c C.n S For the final 3 joins C.c=O.c O.o=E.o E.s=S.s the same indexed nested loop like method can be used. O.o O.o O.o O.c O.c Get 1 st masked C.c value, 11 b Mask corresponding O tuples: P O.c 1 ^P O.c 0 O.c O.c OM 0 1 Get 1 st masked O.o value, 111 b Mask corresponding E tuples: P E.o 2 ^P E.o 1 ^P E.o 0 E.o E.o Get 1 st masked E.s value, 010 b Mask corresponding S tuples: P’ S.s 2 ^P S.s 1 ^P’ S.s 0 S’.s S.s S’.s SM Get S.n-value(s), C, pair it with C.n-value(s), S, output concatenation, C.n S.n There was just one masked tuple at each stage in this example. In general, one would loop through the masked portion of the extant domain at each level (thus, Indexed Horizontal Nested Loop or IHNL) E.o EM S C

Vertical Select-Project-Join-Classification Query Given previous SCORE Training Database (not presented as just one training table), predict what course a male student will register for, who got an A in a previous course in Room with a capacity of 20. This is a matter of applying the previous complex SPJ query first to get the pertinent Training table and then classifying the above unclassified sample (e.g., using, 1-nearest neighbour classification). The result of the SPJ is the single row Training Set, (S,C) and so the prediction is course=C.