PARTIALLY ORDERED SET DIRECTED ACYCLIC GRAPH

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Text-Book Chapters (7 and 8) Entity-Relationship Model
Spring 2003Data Mining by H. Liu, ASU1 5. Association Rules Market Basket Analysis and Itemsets APRIORI Efficient Association Rules Multilevel Association.
SE561 Math Foundations Week 11 Graphs I
1 Mining Quantitative Association Rules in Large Relational Database Presented by Jin Jin April 1, 2004.
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
RoloDex Model The Data Cube Model gives a great picture of relationships, but can become gigantic (instances are bitmapped rather than listed, so there.
Fast Algorithms for Association Rule Mining
Selected Topics in Data Networking Graph Representation.
Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.
Toward a Unified Theory of Data Mining DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely labeled components.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
©Silberschatz, Korth and Sudarshan2.1Database System Concepts Chapter 2: Entity-Relationship Model Entity Sets Relationship Sets Design Issues Mapping.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
GRAPHS THEROY. 2 –Graphs Graph basics and definitions Vertices/nodes, edges, adjacency, incidence Degree, in-degree, out-degree Subgraphs, unions, isomorphism.
Chapter 2 : Entity-Relationship Model Entity Sets Relationship Sets Design Issues Mapping Constraints Keys E-R Diagram Extended E-R Features Design of.
1234 G Exp G So as not to duplicate axes, this copy of G should be folded over to coincide with the other copy, producing a "conical" unipartite.
******************************************************************** *These notes contain NDSU confidential and proprietary material. * *Patents are pending.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Sets.
Chapter 8: Relations. 8.1 Relations and Their Properties Binary relations: Let A and B be any two sets. A binary relation R from A to B, written R : A.
Chapter 9: Graphs.
Great Theoretical Ideas in Computer Science for Some.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
APPENDIX: Data Mining DUALITIES : 1. PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Given any set, S: A Partition is a decomposition of a set.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Framework Unifying Association Rule Mining, Clustering and Classification Anne Denton and William Perrizo Dept of Computer Science North Dakota State University.
Relations and Their Properties
Chapter 2: Entity-Relationship Model
COP Introduction to Database Structures
Entity-Relationship Model
Entity Relationship Model
Copyright © Zeph Grunschlag,
Toward a Unified Theory of Data Mining DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely.
Applied Discrete Mathematics Week 10: Relations
Entity-Relationship Model
Chapter 2: Entity-Relationship Model
Entity-Relationship Model
Graph theory Definitions Trees, cycles, directed graphs.
Outline of the ER Model By S.Saha
DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH
Lecture 2 The Relational Model
The vertex-labelled, edge-labelled graph
MYRRH A hop is a relationship, R, hopping from one entity, E, to another entity, F. Strong Rule Mining (SRM) finds all frequent and confident rules, AC.
Association Rules Zbigniew W. Ras*,#) presented by
Program layers of a DBMS
Market Basket Many-to-many relationship between different objects
Chapter 2: Entity-Relationship Model
All Shortest Path pTrees for a unipartite undirected graph, G7 (SP1, SP2, SP3, SP4, SP5)
Graphs.
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Mining Association Rules from Stars
Module 8 – Database Design Using the E-R Model
Entity-Relationship Model
Electrical and Computer Engineering Department
Chapter 9: Graphs Basic Concepts
V11 Metabolic networks - Graph connectivity
The Multi-hop closure theorem for the Rolodex Model using pTrees
Lecture 11 (Market Basket Analysis)
Chapter 7: Entity-Relationship Model
Entity-Relationship Diagram (ERD)
V11 Metabolic networks - Graph connectivity
HW 3 (Due Wednesday Feb 6) Create slide(s) for your 1 minute presentation on a graph theory application. Make sure your slide(s) include (1) Define the.
V11 Metabolic networks - Graph connectivity
Chapter 2: Entity-Relationship Model
Chapter 9: Graphs Basic Concepts
Chapter 6b: Database Design Using the E-R Model
Data Mining CSCI 307, Spring 2019 Lecture 18
Presentation transcript:

PARTIALLY ORDERED SET DIRECTED ACYCLIC GRAPH Toward a Unified Theory of Data Mining DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely labeled components (required for unambiguous reference) Partition-Induced Function takes a point to the label of its component, Function-Induced Equivalence Relation equates a pair iff they map to same value. Equivalence Relation-Induced Undirected Graph has edge for each equivalent pair. Undirected Graph-Induced Partition is its connectivity component partition. PARTIALLY ORDERED SET DIRECTED ACYCLIC GRAPH Directed Acyclic Graph-Induced Partially Ordered Set contains (s1,s2) iff it is an edge in closure Partially Ordered Set-Induced Directed Acyclic Graph contains (s1,s2) iff it is in the POSET.

DUALITIES: GRAPH MODEL CUBE MODEL (multidimensional) TABLE MODEL (Relation) CUBE MODEL TABLE MODEL: (Cube Model: every potential tuple is bitmapped GRAPH MODEL CUBE MODEL: Given a GRAPH G=(N,E): N=nodes E=edges Assuming N and E have attributes (and therefore primary keys, alternate keys..), each node and each edge has associated with it, a structure or tuple of descriptive information (its LABEL). If uniformly structured relational tuple labels suffice (we will assume that), N(NL1..NLn) E(EL1..ELm) Given a LABELLED GRAPH G(N(L1..Ln), E(L1..Lm) ), it can be expressed losslessly as a DATA CUBE (in fact, gainfully). If it is a multi-hyper-graph with up to h k-edges (h k-polygonal edges) then we build a k-cube with each dimension as N ( i.e., Nk ). For a k-partite graph build a k-cube, either with each dimension as N (sparse cube) or with each part, Ni, i=1..k as a dimension (dense cube). (For a Directed Graph, the horizontal dimension will always be taken to be the Initial edge.)

RoloDex Model The Data Cube Model gives a great picture of relationships, but can become gigantic (instances are bitmapped rather than listed, so there needs to be a position for each potential instance, not just each extant instance). The inefficiency described above is especially severe in the very common Bipartite - Unipartite on Part (BUP) relationships. Examples: In Bioinformatics, bipartite relationships between genes (one entity) and experiments or treatments (another entity) are studied in conjunction with unipartite relationships on one of the gene part (e.g., gene-gene or protein-protein interactions). In Market Research, bipartite relationships between items and customers are studied in conjunction with unipartite relationships on the customer part (or on the product part, or both). For this situation, the Relational Model provides no picture and the Data Cube Model is too inefficient (requires that the unipartite relationship be redundantly replicated for every instance of the other bi-part). We suggest the RoloDex Model.

The Bipartite, Unipartite-on-Part Experiment Gene Relationship, EGG So as not to duplicate axes, this copy of G should be folded over to coincide with the other copy, producing a "conical" unipartite card. 4 3 2 1 G G 1 2 3 4 1 1 3 Exp

 Axis-Card pair (Entity-Relationship pair), ac(a,b),  a support count for AxisSets (or ratio or %):  A, for a graph relationship, suppG(A, ac(a,b))=|{b:aA, (a,b)c}| and for a multigraph, suppMG is the histogram over b of (a,b)-EdgeCounts, aA. Other quantifiers can be used also (e.g., the universal,  is used in MBR) Supp(A) = CusFreq(ItemSet) Conf(AB) =Supp(AB)/Supp(A) 5 6 16 ItemSet ItemSet antecedent 1 2 3 4 5 6 16 itemset itemset card  Customer 1 2 3 4 Item cust item card 5 6 7 People  1 2 3 4 Author 2 3 4 5 PI 1 Doc termdoc card authordoc card 1 3 2 Doc 1 2 3 4 Gene Most interestingness measure are based on one of these supports. In IR, df(t) = suppG({t}, tc(t,d)); tf(t,d) is the one histogram bar in suppMG({t}, tc(t,d)) In MBR supp(I)=suppG(I. ic(i,t)) In MDA, suppMG(GSet, gc(g,e)) Of course all supports are inherited redundantly by the card, c(a,b). genegene card (ppi) docdoc People  term  7 1 2 3 4 G 5 6 7 6 5 4 3 2 t 1 1 3 Exp 6 5 4 3 Gene expPI card expgene card genegene card (ppi) RoloDex Model termterm card (share stem?)

Cousin Association Rule Mining Approach (CARMA)  card (RELATIONSHIP) c(I,T) one has Association Rules among disjoint Isets, AC,  A,C I, with A∩C=∅ and Association Rules among disjoint Tsets, AC, A,C T, with A∩C=∅ Two measures of quality of AC are: SUPP(AC) where e.g., for any Iset, A, SUPP(A) ≡ |{ t | (i,t)E iA}| CONF(AC) = SUPP(AC)/SUPP(A) First Cousin Association Rules: (inspired by Dietmar's work!) Given any card sharing an axis with the bipartite relationship, B(T,I), e.g., C(T,U) Cousin Association Rules are those in which the antecedent, Tsets is generated by a subset, S, of U as follows: {tT|uS such that (t,u)C} (note this should be called an "existential first cousin AR" since we are using the existential quantifier. One can use the universal quantifier (used in MBR ARs)) E.g., S  U, A=C(S), A'T then AA' is a CAR and we can also label it SA' First Cousin Association Rules Once Removed (FCAR1Rs) are those in which both Tsets are generated by another bipartite relationship and we can label antecedent and or the consequent using the generating set or the Tset.

The Cousin Association Rule Mining Approach (CARMA) Second Cousin Association Rules are those in which the antecedent Tset is generated by a subset of an axis which shares a card with T, which shares the card, B, with I. 2CARs can be denoted using the generating (second cousin) set or the Tset antecedent. Second Cousin Association Rules once removed are those in which the antecedent Tset is generated by a subset of an axis which shares a card with T, which shares the card, B, with I and the consequent is generated by C(T,U) (a first cousin, Tset) . 2CAR-1rs can be denoted using any combination of the generating (second cousin) set or the Tset antecedent and the generating (first cousin) or Tset consequent. Second Cousin Association Rules twice removed are those in which the antecedent Tset is generated by a subset of an axis which shares a card with T, which shares the card, B, with I and the consequent is generated by a subset of an axis which shares a card with T, which shares another first cousin card with I. 2CAR-2rs can be denoted using any combination of the generating (second cousin) set or the Tset antecedent and the generating (second cousin) or Tset consequent. Note 2CAR-2rs are also 2CAR-1rs so they can be denoted as above also. Third Cousin Association Rules are those.... We note that these definitions give us many opportunities to define quality measures

Measuring CARMA Quality in the RoloDex Model  Customer 1 2 3 4 Item cust item card 5 6 7 People  1 2 3 4 Author 2 3 4 5 PI 1 Doc termdoc card authordoc card 1 3 2 Doc 1 2 3 4 Gene genegene card (ppi) docdoc People  term  7 1 2 3 4 G 5 6 For Distance CARMA relationships, quality (e.g., supp or conf or???) can be measured using information on any/all cards along the relationship (multiple cards can contribute factors or terms or in some other way???) 7 6 5 4 3 2 t 1 1 3 Exp 6 5 4 3 Gene expPI card expgene card genegene card (ppi) termterm card (share stem?)

Generalized CARMA: First, we propose definition of Generalized Association Rules (GARs) which contains the standard "1 Entity Itemset" AR definition as a special case. Association Pathway Mining (APM) is a DM technique (with application to bioinformatics?) Given Relationships, R1,R2 (RoloDex cards) with shared Entity,F, (axis), ER1FR2G and given AE and CG, then AC , is a Generalized F Association Rule, with SupportR1R2(AC) = | {tE2 | aA, (a,t)R1 and cC, (c,t)R2} | ConfidenceR1R2(AC) = SupportR1R2(AC) / SupportR1(A) where as always, SupportR1(A) = |{tF|aA, (a,t)R1}|. E=G, the GAR is a standard AR iff AC=. Association Pathway Mining (APM) is the identification and assessment (e.g., support, confidence, etc.)of chains of GARs in a RoloDex. Restricting to the mining of cousin GARs reduces the number of strong rules or pathways links.

(the un-generalized case occurs when all weights are 1) More generally, for entities E, F, G and relationships, R1(E,F) and R2(F,G), A  ER1FR2G  C Support-SetR1R2(A,C) = SSR1,R2(A,C) = {tE2|aA (a,t)R1,cC (c,t)R2} If E2 has real labels, Label-Weighted-SupportR1R2(A,C) = LWSR1R2(A,C) =tSSR1R2label(t) (the un-generalized case occurs when all weights are 1) Downward closure property of Support Sets: SS(A‘,C')  SS(A,C) A'A, C'C Therefore, if all labels are non-negative, then LWS(A,C)  LWS(A‘,C') (in order for LSW(A,C) to exceed a threshold is that all LWS(A‘,C') exceed that threshold A'A, C'C). So an Apriori-like frequent set pair mine would go as: Start with pairs of 1-sets (in E and G). The only candidate 2-antecedents with 1-consequents (equiv, 2-consequents with 1-antecedents) would be those formed by joining ... The weighted support concept can be extended to the case there R1 and/or R2 have labels as well. Vertical methods can be applied by converting F to vertical format (F instances are the rows and pertinent features from other cards/axes are "rolled over" to F as derived feature attributes 1 1 l2,3 SSR1R2 l2,2 R1 F R3 E G A C

VERTIGO: A Vertically Structured Rendition of the GO (Gene Ontology)? How do we include GO data into the Data Mining processes? 1. Treat it as a horizontally structured dataset. 2. View GO as a Gene Set hierarchy (that seems to be how it is used, often?) with the other aspects of it as node labels. One could then minimize it - as a subset of the Set Enumeration Tree with highly structured labels? 3. Preprocess pertinent GO information into derived attributes on a Gene Table. 4. Use the RoloDex Model for it? Preliminary thoughts on this alternative include: Each of the three major annotation areas (Molecular Function, Cellular Location, Biological Process) is a Gene-to-Annotation Card.