Scalable Algorithms for Association Mining Mohammed J. Zaki IEEE Transactions on Knowledge and Data Engineering, 2000 2018/12/10 報告人:吳建良
Abstract Frequent itemset Vertical tid-list database format Lattice-theoretic approach Prefix-based and Maximal-clique-based partition Pattern search strategy Bottom-up, top-down and hybrid search Require a few databases scan
Symbol Definition I A set of items D Database of transactions tid Identifier of transaction itemset k-itemset An itemset with k items σ(X) The support of an itemset X frequent itemset Its support ≧ minimum support Fk The set of frequent k-itemsets A→B Association rule Support:σ(A∪B) Confidence:σ(A∪B) / σ(A)
Example
Itemset Enumeration: Lattice-theoretic approach Partial order Reflexive, Antisymmetric, Transitive Partial ordered set poset Lattice Poset Any two element have unique join and meet join= =least upper bound(a, b) meet= = greatest lower bound(a, b) Atom Immediately succeed least element
Power set lattice P(I) Gray circle: frequent itemset Black circle: maximal frequent itemset
Lemma Lemma1: Lemma2: All subsets of a frequent itemset are frequent All supersets of an infrequent itemset are infrequent Lemma2: The maximal frequent itemsets uniquely determine all frequent itemsets
Support Counting L(X): each database item X its tid-list Support of k-itemset Intersect the tid-list of any two of its (k-1)- itemset Example L(CD)=L(C) ∩ L(D) L(CDW)=L(CD) ∩ L(CW)
Example
Lattice Decomposition: Prefix-Based Classes Equivalence relation binary relation ≡ : reflexive, symmetric, transitive partitions the set P into disjoint subsets called equivalence classes An equivalence relation θk on the lattice P(I) where p(X, k)=X[1:k], the k length prefix of X θk : prefix-based equivalence relation Lemma: Each equivalence class [X]θk induced by the equivalence relation θk is a sublattice of P(I)
Example of Equivalence Class P(I) induced by θ1 [A]θ1 induced by θ2
Search for Frequent Itemsets Bottom-up Search Algorithm:
Search for Frequent Itemsets cont. Example for [A]θ1
Search for Frequent Itemsets cont. Top-down Search Algorithm:
Search for Frequent Itemsets cont. Example for [A]θ1 Gray circle: infrequent itemset Black circle: maximal frequent itemset White circle: minimal infrequent itemset
Search for Frequent Itemsets cont. Hybrid Search Algorithm:
Search for Frequent Itemsets cont. Example for [A]θ1, assume that AD and ADW are frequent
Generating Smaller Classes: Maximal Clique Approach Pseudoequivalence relation binary relation ≡ : reflexive, symmetric partitions the set P into possible overlapping subsets called pseudoequivalence classes k-association graph Gk=(V, E) Vertex set Edge set
Maximal Clique Approach cont. A complete subgraph of a graph Mk: the set of maximal cliques in Gk A pseudoequivalence relation φk on the lattice P(I) φk : maximal-clique-based pseudoequivalence relation Bottom-up search: reduce the number of intersections Top-down search: lead to smaller maximum element size
Example
Experiment 比較的演算法 Eclat Prefix-based, bottom-up search MaxEclat Prefix-based, hybrid search Clique Maximal-clique-based, bottom-up search MaxClique Maximal-clique-based, hybrid search Topdown Maximal-clique-based, top-down search AprClique Maximal-clique-based, horizontal data layout, hash tree Partition Decompose database into nonoverlapping partition Use vertical tid-list to generate local frequent itemsets Merge all local frequent itemsets and compute global counts
Experimental Result
Experimental Result cont.
Experimental Result cont.