Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute of Informatics, JAPAN Kyushu University, JAPAN Hokkaido University, JAPAN 4/Oct/2004 Discovery Science
Real world data is often large and sparse Transaction Database ・ Transaction database T : a database composed of transactions defined on itemset I i.e., T , t ∈T , t ⊆I - basket data - links of web pages - words in documents ・ A subset of I is called a pattern 1,2,5,6,7 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T = Real world data is often large and sparse
Occurrences of Pattern ・ For a pattern P, occurrence of P : a transaction in T including P denotation of P : set of occurrences of P ・ The size of denotation is called frequency of P 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 denotation of {1,2} = { {1,2,5,6,7,9}, {1,2,7,8,9} } T =
Frequent Pattern ・ Given a minimum support θ, Frequent pattern: a pattern s.t. (frequency) ≧ θ (a subset of items, which is included in at least θ transactions) Ex.) patterns included in at least 3 transactions {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {2,7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T = Important role in discovering interesting knowledge ・ However, # frequent patterns is often large…
Closed Pattern [Pasquier et. al. 1999] ・ Patterns having the same denotations quite similar ・ Classify patterns into equivalence classes by their denotations ・ Closed pattern: the maximal in an equivalence class (= intersection of occurrences in the denotation) ・ Closure of a pattern: the closed pattern belonging to its equivalence class φ 1,2,5 3,5,7 closed pattern non-closed patterns equivalence class
Advantages of Closed Pattern Completeness: [Mannila’96; Pasquier ‘99] - The set C of frequent closed patterns have the complete information on the set F of all frequent patterns and their frequencies - Any maximal association rule can be constructed from the set C - The set C is sufficient for building classification rules over itemsets Compactness: [Mannila ‘96] - |Maximal Frequent| ≦ |C| ≦ |F| - Frequent closed patterns are possibly exponentially fewer than |F| Th 1 [This paper]: For any n and m, we can construct a database of n items and m transactions that |C| = O(m2) while |F| = 2Ω(n+m).
Problem and Result ・ PROBLEM: given a transaction database, find all frequent closed patterns ・ Many existing studies, theory and practice We propose prefix preserving closure extension, and an efficient algorithm LCM (Linear time Closed pattern Miner) ・ Theoretical advantage: linear time in #frequent closed patterns, use small memory ・ Practical advantage: faster than the other algorithms for many datasets (almost all datasets of KDDcup and FIMI’03)
Existing Approach ・ Frequent pattern mining based approaches: enumerate frequent patterns, and output closed patterns among them ・ Reduce the computation time by avoiding non-closed patterns: During the enumeration, - eliminate unnecessary patterns from memory - prune unnecessary branches of the recursion (not complete)
Our Approach ・ Existing algorithms - possibly operate many non-closed patterns - require much memory for storing obtained patterns We propose closure extension based enumeration operate closed patterns only (linear time) prefix preserving closure extension no memory for previously obtained patterns (small memory) some algorithms for fast computation faster then other algorithms
Closure Extension [Pasquier et. al. ’99] ・ Closure extension: a rule for constructing a closed pattern from another closed pattern add an item, and take its closure closure closed pattern + item ・ Any closed pattern is a closure extension of at least one other closed pattern ・ Any closed pattern has strictly smaller size than any its closure extension
Acyclic Relation [essentially Pasquier et. al. ’99] Closure extension induces an acyclic search graph frequent ・We compute in linear time all closed patterns by closure extension ・ However, we still have to store obtained closed patterns in memory…
Prefix Preserving Closure Extension [new] ・ Prefix preserving closure extension (ppc extension) is a variation of closure extension Def. closure tail of a closed pattern P ⇔ the minimum j s.t. closure (P ∩ {1,…,j}) = P Def. H = closure(P∪{i}) (closure extension of P) is a ppc extension of P ⇔ i > closure tail and H ∩{1,…,i-1} = P ∩{1,…,i-1} no duplication occurs by depth-first search “Any” closed pattern H is generated from another “unique” closed pattern by ppc extension (i.e., from closure(H ∩{1,…,i-1}) )
Relation of ppc extension [new] ・ Any closed pattern is a ppc extension of unique closed pattern ppc extension forms a tree frequent We can proceed depth-first search by ppc extension, without storing closed patterns in memory
closure extension ppc extension Example φ ・ closure extension acyclic ・ ppc extension tree {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} T = {2,5} {2,7,9} {1,2,7,9} {2,3,4,5} closure extension ppc extension {1,2,7,8,9} {1,2,5,6,7,9}
We propose efficient algorithms for these tasks Fast Computation To generate a ppc extension for closed pattern P and item i, we 1. compute the denotation of P ∪{i} 2. compute the closure of P ∪{i} 3. compare the prefix We propose efficient algorithms for these tasks
Occurrence Deliver [new] ・ Compute the denotations of P ∪{i} for all i’s at once, by transposing the trimmed database ・ Trimmed database is composed of - items to be added - transactions including P 3 4 5 A B C 4 5 3 pattern: 1,2 denotation: A,B,C linear time in the size of trimmed database A B C denotation of 1,2,3 denotation of 1,2,4 denotation of 1,2,5 B C A ・ Efficient for sparse datasets
Anytime Database Reduction [new] ・ Reduce the database, by [fp-growth, etc] ◆ Remove item e, if e is included in less than θ transactions or included in all transactions ◆ merge identical transactions into one ・ Recursively apply trimming and this reduction, in the recursion database size becomes small in lower levels of the recursion ・ For taking closure, keep the intersection of merged transactions ← closure operation is to take the intersection of transactions
Experiments ・ Computational environment CPU, memory: AMD Athron XP 1600+, 224MB OS, Programming language, compiler: Linux, C, gcc ・Algorithms compared with, FP-growth, afopt, MAFIA, PATRICIAMINE, kDCI (All these marked high scores at competition FIMI03) ・ Datasets 13 dataset of real world, machine learning, artificial datasets used in FIMI03 and KDD-cup, with specified supports Result ・ Won 12 databases for every support (other than Accident dataset of middle supports) ・ outperfroms especially smaller supports
results
Conclusion Closed patterns: representatives of frequent patterns [Pasquier et.al.’00] - much fewer than frequent patterns (possibly exponentially) - useful in compact representation and rule induction ・ We proposed an algorithm LCM for mining closed patterns in databases - prefix preserving closure extension for tree-shaped search space - time complexity is linear in #closed patterns, and small memory footprint - practical speed up: occurrence deliver and anytime database reduction ・ Experiments show that LCM outperforms other algorithms in most instances, in KDDcup and FIMI datasets, especially with small supports Future work: closed patterns for sequences, trees, and other structures LCM is submitted to FIMI04 competition, be looking forward to it!
List of Datasets Machine learning benchmark Real datasets ・ Chess ・ BMS-WebVeiw-1 ・ BMS-WebVeiw-2 ・ BMS-POS ・ Retail ・ Kosarak ・ Accidents Machine learning benchmark ・ Chess ・ Mushroom ・ Pumsb ・ Pumsb* ・ Connect Aartificial datasets ・ T10I4D100K ・ T40I10D100K