DUALITIES: PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Assume a Partition has uniquely labeled components (required for unambiguous reference) Partition-Induced Function takes a point to the label of its component, Function-Induced Equivalence Relation equates a pair iff they map to same value. Equivalence Relation-Induced Undirected Graph has edge for each equivalent pair. Undirected Graph-Induced Partition is its connectivity component partition.
DUALITIES: PARTIAL ORDERED SET DIRECTED ACYCLIC GRAPH The Directed Acyclic Graph-Induced Partially Ordered Set contains (s1,s2) iff it is an edge in the closure. The Patially Ordered Set-Induced Directed Acyclic Graph contains (s1,s2) iff it is in the POSET.
DUALITIES: GRAPH MODEL CUBE MODEL (multidimensional) TABLE MODEL (Relation) CUBE MODEL TABLE MODEL: (Cube Model: every potential tuple is bitmapped GRAPH MODEL CUBE MODEL: Given a GRAPH G=(N,E): N=nodes E=edges Assuming N and E have attributes (and therefore primary keys, alternate keys..), each node and each edge has associated with it, a structure or tuple of descriptive information (its LABEL). If uniformly structured relational tuple labels suffice (we will assume that), N(NL1..NLn) E(EL1..ELm)
DUALITIES: Given a LABELLED GRAPH G(N(L1..Ln), E(L1..Lm) ), it can be expressed losslessly as a DATA CUBE (in fact, gainfully): If it is a multi-hyper-graph with up to h k-edges (h k-polygonal edges) then we build a k-cube with each dimension as N ( i.e., Nk ). For a k-partite graph build a k-cube, either with each dimension as N (sparse cube) or with each part, Ni, i=1..k as a dimension (dense cube). (For a Directed Graph, the horizontal dimension will always be taken to be the Initial edge.)
DUALITIES: Given a LABELLED GRAPH G(N(L1..Ln), E(L1..Lm) ), it can be expressed losslessly as a DATA CUBE (in fact, gainfully): If it is a multi-hyper-graph with up to h k-edges (h k-polygonal edges) then we build a k-cube with each dimension as N ( i.e., Nk ). For a k-partite graph build a k-cube, either with each dimension as N (sparse cube) or with each part, Ni, i=1..k as a dimension (dense cube). (For a Directed Graph, the horizontal dimension will always be taken to be the Initial edge.)
RoloDex Model The Data Cube Model gives a great picture of relationships, but can become gigantic (instances are bitmapped rather than listed, so there needs to be a position for each potential instance, not just each extant instance). The inefficiency described above is especially severe in the very common Bipartite - Unipartite on Part (BUP) relationships. Examples: In Bioinformatics, bipartite relationships between genes (one entity) and experiments or treatments (another entity) are studied in conjunction with unipartite relationships on one of the gene part (e.g., gene-gene or protein-protein interactions). In Market Research, bipartite relationships between items and customers are studied in conjunction with unipartite relationships on the customer part (or on the product part, or both). For this situation, the Relational Model provides no picture and the Data Cube Model is too inefficient (requires that the unipartite relationship be redundantly replicated for every instance of the other bi-part). We suggest the RoloDex Model.
The Bipartite, Unipartite-on-Part Experiment Gene Relationship, EGG So as not to duplicate axes, this copy of G should be folded over to coincide with the other copy, producing a "conical" unipartite card. 4 3 2 1 G G 1 2 3 4 1 1 3 Exp
RoloDex Model Supp(A) = CusFreq(ItemSet) Conf(AB) =Supp(AB)/Supp(A) 5 6 16 ItemSet ItemSet antecedent 1 2 3 4 5 6 16 itemset itemset card Customer 1 2 3 4 Item cust item card 5 6 7 People 1 2 3 4 Author 2 3 4 5 PI 1 Doc termdoc card authordoc card 1 3 2 Doc 1 2 3 4 Gene genegene card (ppi) docdoc card (hyperlink anal.) People term 7 1 2 3 4 G 5 6 7 6 5 4 3 2 t 1 Each axis, a, inherits a frequency attribute from each of its cards, c(a,b), denoted bf(c.a) "# of bs related to a" (e.g., df(t) = doc freq of term, t). Of course, bf(c.a) is inherited redundantly by c(a,b). Each card, c(a,b), inherits a frequency attribute from each of its axes, a [b], denoted af(a,b)"# times a is related to b in c" [bf(a,b)"# times b~a in c"] Each card, c(a,b), can be expanded by each of its axes, e.g., a, to a-sets (each a value is identified with the singleton, {a}) (e.g., itemsets in MBR) or a-sets can become a new axis (e.g., doc in IR. Note, if term is expanded by singleton termsets to be part of doc, then the termdoc card becomes a cone (see first slide)). Next we put some of the descriptive attributes in their places. Note: Conf / non-conf rules partition itemset-itemset card. Can we usefully list confident rules by specifying the boundary (SVM style)? That presupposes spatial continuity of conf rules (may not be correct assumption) but it may be on another similar card? 1 3 Exp 6 5 4 3 Gene expPI card expgene card genegene card (ppi) termterm card (share stem?)
RoloDex Model combining term doc (A term is a document with 1 term in it) and item itemset 16 Item 6 itemset itemset card 5 cust itemset card 4 3 2 PI 1 People Author Customer 1 2 3 4 5 6 7 ItemSet 1 2 3 4 5 6 16 4 3 genegene card (ppi) termdoc 1 3 author doc 1 1 1 1 ItemSet (antecedent) 1 1 1 1 2 termterm 1 Gene doctermgene ItemSet can be replaced by ItemBag (allowing duplicates and promoting count analysis). This is an accommodation of the multi-graph situation. 1 2 3 4 5 6 7 8 9 512 expgene card 1 3 Exp expPI card
RoloDex uncombining term-doc and item-itemset (using itembag (basket) so item count in a basket is defined. Item 5 6 ∞ 5 6 ∞ itembag itembag card 4 cust itembag card 3 2 1 People Author 1 2 3 4 5 6 7 Customer ItemBag 1 2 3 4 termdoc card 1 1 1 1 1 1 1 1 authordoc card 1 1 1 ItemBag Doc 4 3 genegene card (ppi) 1 What is term frequency? doc frequency? 1. TD is a bag-edged graph, i.e., Edge(TD) is a bag, meaning an edge can occur multiple times (the same term "can occur in" a doc many times). If we don't distinguish those occurrences other than existence (could distinguish them into type classes, e.g., verb, noun... ) then TD can be realized as a set-edged graph with a count label, otherwise we must use a bag-edged graph with a type label. Usually, TD is the former and the count label is term frequency. Document frequency is a Term node label which is is the node degree (# of docs to which it relates). A market basket is also a bag-edged graph which is realized as a set-edged graph with a count label. Try to do this with the Data Cube Model! 2 1 1 1 docdoc card 1 1 Gene 1 1 People 2 PI 2 3 4 5 term G 1 2 3 4 5 6 7 3 Doc expPI card 1 1 expgene card 3 Exp 3 4 genegene card (ppi) 5 6 Gene t 1 2 termterm card (share stem?) 3 4 5 6 7
Association Rule Mining Given any BIPARTITE RELATIONSHIP on N = T ! I a degree two bipartite relationship on N = T ! I generates I-Association Rules (I-AR), AC, A,C I, with A∩C=∅ (disjoint I-sets) and T-Association Rule (T-AR), AC, A,C T, with A∩C=∅ (disjoint T-sets). The main two measures of quality of AC are: T-frequent iff the T-SUPPORT(AC) ≥ MINSUPP and T-confident iff the T-SUPPORT(AC) / T-SUPPORT(A) ≥ MINCONF where T-SUPPORT(A) ≡ |{ t | (i,t)E iA}|, T-SUPPORT(AC) ≡ |{ t | (i,t)E iAC}| and MINSUPP and MINCONF are user chosen parameters. Likewise, AC is I-frequent iff I-SUPPORT(AC) ≥ MINSUPP, and AC is I-condfident iff I-SUPPORT(AC) / I-SUPPORT(A) ≥ MINCONF.
Clustering (partitioning) and Classification A CLUSTERING or PARTITION of N is a dual formulation of an equivalence relation on N, since the partition, {Ci}, into equivalence classes is a clustering and vice versa. A LABEL FUNCTION, L:{Ci}→Labels (assuming the cluster components, the Cis, are labeled by their IDs) is also dual to a clustering or partitioning on N in the sense that the Pre-image partition, {L-1(Lk)}, is a clustering and vice versa. CLASSIFYING sS(A1, … , An) using the training set, R(A1, … , An, L), is usually just a matter of identifying the best R→R[L] pre-image cluster for s based on R.