DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH

Slides:



Advertisements
Similar presentations
Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used.
Advertisements

RoloDex Model The Data Cube Model gives a great picture of relationships, but can become gigantic (instances are bitmapped rather than listed, so there.
Selected Topics in Data Networking Graph Representation.
Entity Tables, Relationship Tables We Classify using any Table (as the Training Table) on any of its columns, the class label column. Medical Expert System:
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Toward a Unified Theory of Data Mining DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely labeled components.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
©Silberschatz, Korth and Sudarshan2.1Database System Concepts Chapter 2: Entity-Relationship Model Entity Sets Relationship Sets Design Issues Mapping.
Query and Analysis on the document and customer/item bag card of the DataDex Kellie Erickson.
GRAPHS THEROY. 2 –Graphs Graph basics and definitions Vertices/nodes, edges, adjacency, incidence Degree, in-degree, out-degree Subgraphs, unions, isomorphism.
Association Rule.. Association rule mining  It is an important data mining model studied extensively by the database and data mining community.  Assume.
Association Rule Mining
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
1234 G Exp G So as not to duplicate axes, this copy of G should be folded over to coincide with the other copy, producing a "conical" unipartite.
******************************************************************** *These notes contain NDSU confidential and proprietary material. * *Patents are pending.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Great Theoretical Ideas in Computer Science for Some.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
APPENDIX: Data Mining DUALITIES : 1. PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Given any set, S: A Partition is a decomposition of a set.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Framework Unifying Association Rule Mining, Clustering and Classification Anne Denton and William Perrizo Dept of Computer Science North Dakota State University.
Entity-Relationship Data Model
Chapter 2: Entity-Relationship Model
COP Introduction to Database Structures
Let try to identify the conectivity of these entity relationship
Outline Symmetric datasets and application
Outline Symmetric datasets and application
Entity-Relationship Model
Entity Relationship Model
Copyright © Zeph Grunschlag,
Toward a Unified Theory of Data Mining DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely.
Entity-Relationship Model
Chapter 2: Entity-Relationship Model
Chapter 7: Entity-Relationship Model
Chapter 7 Entity-Relationship Model
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
The vertex-labelled, edge-labelled graph
Mean Shift Segmentation
MYRRH A hop is a relationship, R, hopping from one entity, E, to another entity, F. Strong Rule Mining (SRM) finds all frequent and confident rules, AC.
Program layers of a DBMS
Introduction to Data Mining, 2nd Edition by
Using a 3-dim DSR(Document Sender Receiver) matrix and
Chapter 2: Entity-Relationship Model
All Shortest Path pTrees for a unipartite undirected graph, G7 (SP1, SP2, SP3, SP4, SP5)
Association Rule Mining
Graphs.
Module 8 – Database Design Using the E-R Model
UML Class Diagram.
Entity Tables, Relationship Tables is in Course Student Enrollments
Session 2 Welcome: The seventh learning sequence
Electrical and Computer Engineering Department
Lectures on Graph Algorithms: searching, testing and sorting
Event-Based Architecture Definition Language
Weak Entity Sets An entity set that does not have a primary key is referred to as a weak entity set. The existence of a weak entity set depends on the.
The Multi-hop closure theorem for the Rolodex Model using pTrees
Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.
Chapter 7: Entity-Relationship Model
Entity-Relationship Diagram (ERD)
PARTIALLY ORDERED SET DIRECTED ACYCLIC GRAPH
MIS2502: Data Analytics Relational Data Modeling 2
Database Management system
Database Management system
Chapter 2: Entity-Relationship Model
Entity-Relationship Data Model
Test Design Techniques Software Testing: IN3240 / IN4240
Chapter 6b: Database Design Using the E-R Model
Data Mining CSCI 307, Spring 2019 Lecture 18
Presentation transcript:

DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely labeled components (required for unambiguous reference) Partition-Induced Function takes a point to the label of its component, Function-Induced Equivalence Relation equates a pair iff they map to same value. Equivalence Relation-Induced Undirected Graph has edge for each equivalent pair. Undirected Graph-Induced Partition is its connectivity component partition.

DUALITIES: PARTIAL ORDERED SET DIRECTED ACYCLIC GRAPH The Directed Acyclic Graph-Induced Partially Ordered Set contains (s1,s2) iff it is an edge in the closure. The Patially Ordered Set-Induced Directed Acyclic Graph contains (s1,s2) iff it is in the POSET.

DUALITIES: GRAPH MODEL CUBE MODEL (multidimensional) TABLE MODEL (Relation) CUBE MODEL TABLE MODEL: (Cube Model: every potential tuple is bitmapped GRAPH MODEL CUBE MODEL: Given a GRAPH G=(N,E): N=nodes E=edges Assuming N and E have attributes (and therefore primary keys, alternate keys..), each node and each edge has associated with it, a structure or tuple of descriptive information (its LABEL). If uniformly structured relational tuple labels suffice (we will assume that), N(NL1..NLn) E(EL1..ELm)

DUALITIES: Given a LABELLED GRAPH G(N(L1..Ln), E(L1..Lm) ), it can be expressed losslessly as a DATA CUBE (in fact, gainfully): If it is a multi-hyper-graph with up to h k-edges (h k-polygonal edges) then we build a k-cube with each dimension as N ( i.e., Nk ). For a k-partite graph build a k-cube, either with each dimension as N (sparse cube) or with each part, Ni, i=1..k as a dimension (dense cube). (For a Directed Graph, the horizontal dimension will always be taken to be the Initial edge.)

DUALITIES: Given a LABELLED GRAPH G(N(L1..Ln), E(L1..Lm) ), it can be expressed losslessly as a DATA CUBE (in fact, gainfully): If it is a multi-hyper-graph with up to h k-edges (h k-polygonal edges) then we build a k-cube with each dimension as N ( i.e., Nk ). For a k-partite graph build a k-cube, either with each dimension as N (sparse cube) or with each part, Ni, i=1..k as a dimension (dense cube). (For a Directed Graph, the horizontal dimension will always be taken to be the Initial edge.)

RoloDex Model The Data Cube Model gives a great picture of relationships, but can become gigantic (instances are bitmapped rather than listed, so there needs to be a position for each potential instance, not just each extant instance). The inefficiency described above is especially severe in the very common Bipartite - Unipartite on Part (BUP) relationships. Examples: In Bioinformatics, bipartite relationships between genes (one entity) and experiments or treatments (another entity) are studied in conjunction with unipartite relationships on one of the gene part (e.g., gene-gene or protein-protein interactions). In Market Research, bipartite relationships between items and customers are studied in conjunction with unipartite relationships on the customer part (or on the product part, or both). For this situation, the Relational Model provides no picture and the Data Cube Model is too inefficient (requires that the unipartite relationship be redundantly replicated for every instance of the other bi-part). We suggest the RoloDex Model.

The Bipartite, Unipartite-on-Part Experiment Gene Relationship, EGG So as not to duplicate axes, this copy of G should be folded over to coincide with the other copy, producing a "conical" unipartite card. 4 3 2 1 G G 1 2 3 4 1 1 3 Exp

RoloDex Model Supp(A) = CusFreq(ItemSet) Conf(AB) =Supp(AB)/Supp(A) 5 6 16 ItemSet ItemSet antecedent 1 2 3 4 5 6 16 itemset itemset card  Customer 1 2 3 4 Item cust item card 5 6 7 People  1 2 3 4 Author 2 3 4 5 PI 1 Doc termdoc card authordoc card 1 3 2 Doc 1 2 3 4 Gene genegene card (ppi) docdoc card (hyperlink anal.) People  term  7 1 2 3 4 G 5 6 7 6 5 4 3 2 t 1 Each axis, a, inherits a frequency attribute from each of its cards, c(a,b), denoted bf(c.a)  "# of bs related to a" (e.g., df(t) = doc freq of term, t). Of course, bf(c.a) is inherited redundantly by c(a,b). Each card, c(a,b), inherits a frequency attribute from each of its axes, a [b], denoted af(a,b)"# times a is related to b in c" [bf(a,b)"# times b~a in c"] Each card, c(a,b), can be expanded by each of its axes, e.g., a, to a-sets (each a value is identified with the singleton, {a}) (e.g., itemsets in MBR) or a-sets can become a new axis (e.g., doc in IR. Note, if term is expanded by singleton termsets to be part of doc, then the termdoc card becomes a cone (see first slide)). Next we put some of the descriptive attributes in their places. Note: Conf / non-conf rules partition itemset-itemset card. Can we usefully list confident rules by specifying the boundary (SVM style)? That presupposes spatial continuity of conf rules (may not be correct assumption) but it may be on another similar card? 1 3 Exp 6 5 4 3 Gene expPI card expgene card genegene card (ppi) termterm card (share stem?)

   

RoloDex Model combining term  doc (A term is a document with 1 term in it) and item  itemset 16 Item 6 itemset itemset card 5 cust itemset card 4 3 2  PI 1 People  Author  Customer 1 2 3 4 5 6 7 ItemSet 1 2 3 4 5 6 16 4 3 genegene card (ppi) termdoc 1 3 author doc 1 1 1 1 ItemSet (antecedent) 1 1 1 1 2 termterm 1 Gene doctermgene ItemSet can be replaced by ItemBag (allowing duplicates and promoting count analysis). This is an accommodation of the multi-graph situation. 1 2 3 4 5 6 7 8 9 512 expgene card 1 3 Exp expPI card

RoloDex uncombining term-doc and item-itemset (using itembag (basket) so item count in a basket is defined. Item 5 6 ∞ 5 6 ∞ itembag itembag card 4 cust itembag card 3 2 1 People  Author 1 2 3 4 5 6 7  Customer ItemBag 1 2 3 4 termdoc card 1 1 1 1 1 1 1 1 authordoc card 1 1 1 ItemBag Doc 4 3 genegene card (ppi) 1 What is term frequency? doc frequency? 1. TD is a bag-edged graph, i.e., Edge(TD) is a bag, meaning an edge can occur multiple times (the same term "can occur in" a doc many times). If we don't distinguish those occurrences other than existence (could distinguish them into type classes, e.g., verb, noun... ) then TD can be realized as a set-edged graph with a count label, otherwise we must use a bag-edged graph with a type label. Usually, TD is the former and the count label is term frequency. Document frequency is a Term node label which is is the node degree (# of docs to which it relates). A market basket is also a bag-edged graph which is realized as a set-edged graph with a count label. Try to do this with the Data Cube Model! 2 1 1 1 docdoc card 1 1 Gene 1 1 People  2 PI 2 3 4 5 term  G 1 2 3 4 5 6 7 3 Doc expPI card 1 1 expgene card 3 Exp 3 4 genegene card (ppi) 5 6 Gene t 1 2 termterm card (share stem?) 3 4 5 6 7

Association Rule Mining Given any BIPARTITE RELATIONSHIP on N = T ! I a degree two bipartite relationship on N = T ! I generates I-Association Rules (I-AR), AC, A,C I, with A∩C=∅ (disjoint I-sets) and T-Association Rule (T-AR), AC, A,C T, with A∩C=∅ (disjoint T-sets). The main two measures of quality of AC are: T-frequent iff the T-SUPPORT(AC) ≥ MINSUPP and T-confident iff the T-SUPPORT(AC) / T-SUPPORT(A) ≥ MINCONF where T-SUPPORT(A) ≡ |{ t | (i,t)E iA}|, T-SUPPORT(AC) ≡ |{ t | (i,t)E iAC}| and MINSUPP and MINCONF are user chosen parameters. Likewise, AC is I-frequent iff I-SUPPORT(AC) ≥ MINSUPP, and AC is I-condfident iff I-SUPPORT(AC) / I-SUPPORT(A) ≥ MINCONF.

Clustering (partitioning) and Classification A CLUSTERING or PARTITION of N is a dual formulation of an equivalence relation on N, since the partition, {Ci}, into equivalence classes is a clustering and vice versa. A LABEL FUNCTION, L:{Ci}→Labels (assuming the cluster components, the Cis, are labeled by their IDs) is also dual to a clustering or partitioning on N in the sense that the Pre-image partition, {L-1(Lk)}, is a clustering and vice versa. CLASSIFYING sS(A1, … , An) using the training set, R(A1, … , An, L), is usually just a matter of identifying the best R→R[L] pre-image cluster for s based on R.