Download presentation
Presentation is loading. Please wait.
Published byRudolf Lindsey Modified over 9 years ago
1
LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics, JAPAN Hokkaido University, JAPAN 1/Nov/2004 Frequent Itemset Mining Implementations ’04
2
Summary FI miningBacktracking with Hypercube decomposition (few freq. Counting) Back- tracking CI miningBacktracking with PPC-extension (complete enumeration) (small memory) Apriori with pruning MFI miningBacktracking with pruning (small memory) Apriori with pruning freq. countingOccurrence deliver (linear time computation) Down project database maintenance array with Anytime database reduction (simple) (fast initialization) Trie (FP- tree) maximality check More database reductions (small memory) store all itemsets Our approach Typical approach
3
Frequent Itemset Mining Almost all computation time is spent for frequency counting ⇒ ⇒ How to reduce FI miningBacktracking with Hypercube decomposition (few freq. Counting) Backtracking CI mining Backtracking with PPC-extension (complete enumeration)(small memory) Apriori with pruning MFI mining Backtracking with pruning (small memory) Apriori with pruning freq.counting Occurrence deliver (linear time computation) Down project database maintenance array with Anytime database reduction (simple) (fast initialization) Trie (FP-tree) maximality checkMore database reductions (small memory)store all itemsets #FI to be checked cost of frequency counting
4
Hypercube Decomposition [form Ver.1] Reduce #FI to be checked 1. 1. Decompose the set of all FI’s into hypercubes, each of which is included in an equivalence class 2. 2. Enumerate maximal and minimal of each hypercube (with frequency counting) 3. 3. Generate other FI’s between maximal and minimal (without frequency counting) Efficient when support is small
5
Occurrence Deliver [ver1] Compute the denotations of P ∪ {i} for all i’s at once, by transposing the trimmed database Trimmed database is composed of - - items to be added - - transactions including P linear time linear time in the size of trimmed database A B C 3 4 5 3333 45555 ABCABC denotation of 1,2,3 denotation of 1,2,4 denotation of 1,2,5 AAAA BBBB C itemset: 1,2 denotation: A,B,C Efficient for sparse datasets Trimmeddatabase 12 database
6
Loss of Occurrence Deliver [new] Avoiding frequency counting of infrequent itemset P ∪ {e} has been considered to be important However, the computation time for such itemsets is 1/3 of all computation cost on average, in our experiments (if we sort items by their frequency (size of tuple list)) 34567893456789 P∪P∪ AD ELM ABCEFGH JKLN ABDEFGI JKLMSTW BEGILT MTW ABCDFGH IKLMNST θ Occurrence deliver has an advantage of its simple structure
7
Anytime Database Reduction [new] Database reduction: Database reduction: Reduce the database, by [fp-growth, etc] ◆ ◆ Remove item e, if e is included in less than θ transactions or or included in all transactions ◆ ◆ merge identical transactions into one Anytime database reduction: Anytime database reduction: Recursively apply trimming and this reduction, in the recursion database size becomes small in lower levels of the recursion In the recursion tree, lower level iterations are exponentially many rather than upper level iterations. very efficient
8
Example of Anytime D. R. [new] trim anytime database reduction trim anytime database reduction…. i j
9
Array (reduced) vs. Trie (FP-tree) [new] Trie can compress the trimmed database [fp-growth, etc] By experiments for FIMI instances, we compute the average compression ratio by Trie for trimmed database over all iterations #items(cells) in Tries 1/2 average, 1/6 minimum (dense case) If Trie is constructed by a binary tree, it needs at least 3 pointers for each item. memory use (computation time) twice, minimum 2/3 initialization is fast (LCM O(||T||) : Trie O(|T|log|T| + ||T||) )
10
Results
11
Closed Itemset Mining avoid (prune) non-closed itemsets? (existing pruning is not complete) quickly operate closure? save memory use? (existing approach uses much memory) FI mining Backtracking with Hypercube decomposition (few freq. Counting) Backtracking CI miningBacktracking with PPC-extension (complete enumeration)(small memory) Apriori with pruning MFI mining Backtracking with pruning (small memory) Apriori with pruning freq.counting Occurrence deliver (linear time computation) Down project database maintenance array with Anytime database reduction (simple) (fast initialization) Trie (FP-tree) Maximality check More database reductions (small memory) store all itemsets How to
12
Prefix Preserving Closure Extension [ver1] Prefix preserving closure extension Prefix preserving closure extension (PPC-extension) is a variation of closure extension Def. closure tail Def. closure tail of a closed itemset P ⇔= ⇔ the minimum j s.t. closure (P ∩ {1,…,j}) = P Def. = Def. H = closure(P ∪ {i}) (closure extension of P) PPC-extension is a PPC-extension of P ⇔= ⇔ i > closure tail and H ∩{1,…,i-1} = P ∩{1,…,i-1} no duplication occurs by depth-first search unique “Any” closed itemset H is generated from another “unique” closed itemset by PPC-extension (i.e., from closure(H ∩{1,…,i-1}) )
13
Example of ppc-extension [ver1] closure extension ppc extension 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T = φ {1,7,9} {2,7,9} {1,2,7,9} {7,9} {2,5} {2} {2,3,4,5} {1,2,7,8,9}{1,2,5,6,7,9} closure extension acyclic ppc extension tree
14
Results
15
Maximal Frequent Itemset Mining How to FI mining Backtracking with Hypercube decomposition (few freq. Counting) Backtracking CI mining Backtracking with PPC-extension (complete enumeration)(small memory) Apriori with pruning MFI mining Backtracking with pruning (small memory) Apriori with pruning freq.counting Occurrence deliver (linear time computation) Down project database maintenance array with Anytime database reduction (simple) (fast initialization) Trie (FP-tree) maximality check More database reductions (small memory) store all itemsets avoid (prune) non-maximal imteset? check maximality quickly? save memory? (existing maximality check and pruning use much memory)
16
Backtracking-based Pruning [new] During backtracking algorithm for FI, : current itemset : a MFI including K re-sort items s.t. items of H locate end 45678910 456789 re-sort 312 We can avoid so many non-MFI’s Then, new MFI NEVER be found in recursive calls w.r.t. items in H omit such recursive calls rec. callno rec. call
17
Fast Maximality Check (CI,MFI) [new] To reduce the computation cost for maximality check, closedness check, we use more database reduction At anytime database reduction, we keep ◆ ◆ the intersection of merged transactions, for closure operation ◆ ◆ the sum of merged transactions as a weighted transaction database, for maximality check Closure is the intersection of transactions Frequency of one more larger itemsets are sum of transactions in the trimmed database By using these reduced databases, computation time becomes short (no more than frequency counting)
18
Results
19
Experiments CPU, memory, OS: AMD Athron XP 1600+, 224MB, Linux Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI (All these marked high scores at competition FIMI03) 13 datasets FIMI repository 13 datasets of FIMI repository Fast at large supports for all instances of FI, CI, MFI Fast for all instances for CI (except for Accidents) Fast for all sparse datasets of FI, CI, MFI Slow only for accidents, T40I10D100K of FI, MFI, and pumsbstar of MFI Result
20
Summary of Results largesupports FICIMFI sparse(7) LCM middle(5) dense(1) smallsupports FICIMFI sparse(7)LCM middle(5)BothLCMBoth dense(1)OthersLCMOthers
21
results
22
Conclusion When equivalence classes are large, PPC-extension and Hypercube decomposition works well Anytime database reduction and Occurrence deliver have advantages on initialization, sparse cases and simplicity compared to Trie and Down project Backtracking-based pruning saves memory usage More database reduction works well as much as memory storage approaches
23
Future Work LCM is weak at MFI mining and dense datasets LCM is weak at MFI mining and dense datasets More efficient Pruning for MFI Some new data structures for dense cases Fast radix sort for anytime database reduction IO optimization ?????
24
List of Datasets Real datasets ・ ・ BMS-WebVeiw-1 ・ ・ BMS-WebVeiw-2 ・ ・ BMS-POS ・ ・ Retail ・ ・ Kosarak ・ ・ Accidents Machine learning benchmark ・ ・ Chess ・ ・ Mushroom ・ ・ Pumsb ・ ・ Pumsb* ・ ・ Connect Aartificial datasets ・ ・ T10I4D100K ・ ・ T40I10D100K
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.