Download presentation
Presentation is loading. Please wait.
Published byVanessa Holt Modified over 9 years ago
1
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics, JAPAN Hokkaido University, JAPAN 20/Aug/2005 Open Source Data Mining ’05
2
Computation of Pattern Mining (#iterations) is not so larger than (#solutions) (linearly bounded!) linear time in #solution polynomial delay We focus on data structure and computation in an iteration TIME #iterations time of an iteration frequency counting, data structure reconstruction, closure operation, pruning,... frequency counting, data structure reconstruction, closure operation, pruning,... = × For frequent itemsets and closed itemsets, enumeration methods are almost optimal enumeration methods are almost optimal For frequent itemsets and closed itemsets, enumeration methods are almost optimal enumeration methods are almost optimal + I/O coding technique coding technique Goal: clarify feature of enumeration algorithms enumeration algorithms real-world data sets real-world data sets For what cases(parameter), which technique is good? “theoretical intuitions/evidences” are important “theoretical intuitions/evidences” are important Goal: clarify feature of enumeration algorithms enumeration algorithms real-world data sets real-world data sets For what cases(parameter), which technique is good? “theoretical intuitions/evidences” are important “theoretical intuitions/evidences” are important
3
MotivationMotivation Some data structures have been proposed for storing huge datasets, and accelerate the computation (frequency counting) Each has its own advantages and disadvantages 1. 1. Bitmap 2. 2. Prefix Tree 3. 3. Array List (with deletion of duplications) Datasets have both dense part and sparse part dense part and sparse part Datasets have both dense part and sparse part dense part and sparse part Good Good: dense data, large support Bad Bad: sparse data, small support Good Good: non-sparse data, structured data Bad Bad: sparse data, non-structured data Good Good: non-dense data Bad Bad: very dense data How can we fit? 10111 10011 11000 11011 a b d ce c eg 1ab 4acdef 5cd 1acef
4
ObservationsObservations Usually, databases satisfy power law the part of few items is dense, and the rest is very sparse Using reduced conditional databases, in almost all iterations, the size of the database is very small Quick operations for small database are very efficient dense sparse rec. depth... items transactions
5
Idea of Combination Choose a constant c = F = c items of largest frequency T Split each transaction T in two parts, dense part F dense part composed of items in F sparse part F sparse part composed of items not in F Store dense part by bitmap, and sparse part by array list Use bitmap and array lists for dense and sparse parts Use prefix tree of constant size for frequency counting Use bitmap and array lists for dense and sparse parts Use prefix tree of constant size for frequency counting We can take all their advantages c c items dense sparse items transactions
6
Complete Prefix Tree complete prefix tree: We use complete prefix tree: prefix tree including all patterns a c b d c c b d c d d d d d d
7
0011 0100 0000 0001 0111 0101 1100 1101 Complete Prefix Tree complete prefix tree: We use complete prefix tree: prefix tree including all patterns Parent of a pattern is obtained by clearing the highest bit 0 (Ex. 010110 000110 ) no pointer is needed 0010 1000 1010 01101110 1001 1111 1001 Ex) Ex) transactions {a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactions If c is small, then its size 2 c is not huge We construct the complete prefix tree for dense part of the transactions If c is small, then its size 2 c is not huge
8
0001 1111 1001 0011 0000 0111 0101 0100 1100 1101 Complete Prefix Tree complete prefix tree: We use complete prefix tree: prefix tree including all patterns Any prefix tree is its subtree Parent of a pattern is obtained by clearing the highest bit 0 (Ex. 010110 000110 ) no pointer is needed 0010 1000 1010 01101110 1001 Ex) Ex) transactions {a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactions c If c is small, then its size 2 c is not huge We construct the complete prefix tree for dense part of the transactions c If c is small, then its size 2 c is not huge Ex) Ex) transactions {a,b,c,d}, {a}, {a,d}
9
0011 0100 0000 0001 0111 0101 1100 1101 Frequency counting Frequency of a pattern (vertex) = = # descendant leaves i Occurrence by adding item i =i1 = patterns with ith bit = 1 Bottom up sweep is good 0010 1000 1010 01101110 1001 1111 1001 Linear time in the size of prefix tree 1100 1101 2 0111 0101 0100 0011 0001 3 12
10
“Constant Size” Dominates constant size database How much iterations input “constant size database” ? supportSmall supp.(1M solutions)Large supp. (1K solutions) constant size database strategy changes constant size database strategy changes pumsb 99.9%0.1%99%0.2% pumsb*, connect, chess, accidents 99%0.1 – 0.5%95 - 99%0.2 - 1% kosarak, mushroom, BMS- WebView2, T40I10D100K 90-99%1 - 2%30-90%2 - 4% Retail, BMS-pos, T10I4D100K 30-90%3 - 5%30-90%3 - 5% “Small iterations” dominate computation time, “Strategy change” is not a heavy task “Small iterations” dominate computation time, “Strategy change” is not a heavy task
11
More Advantages Reconstruction of prefix trees is a heavy task complete prefix tree needs no reconstruction Coding prefix trees is not easy complete prefix tree is easy to be coded Radix sort used for detecting the identical transactions is heavy when data is dense Bitmaps for dense parts accelerate the radix sort
12
For Closed/Maximal Itemsets Compute the closure/maximality by storing the previously obtained itemsets No additional function is needed Depth-first search (closure extension type) Need prefix of each itemsets 0011 0100 0000 0001 0111 0101 1100 1101 0010 1000 1010 01101110 1001 1111 1001 By taking intersection/weighted union of prefixes at each node of the prefix tree, we can compute efficiently (from LCM v2) prefix
13
ExperimentsExperiments CPU, memory, OS: Pentium4 3.2GHz, 2GB memory, Linux Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI, nonodrfp, aim2, DCI-closed (All these marked high scores at competition FIMI04) 14 datasets FIMI repository 14 datasets of FIMI repository Memory usage decreased to half, for dense datasets, but not for sparse datasets but not for sparse datasets Memory usage decreased to half, for dense datasets, but not for sparse datasets but not for sparse datasets We applied the data structure to LCM2
14
Experimental Results
16
Discussion and Future Work Combination of bitmaps and array lists reduces memory space efficiently, for dense datasets Using prefix trees for constant number of item is sufficient for speeding up frequency counting, for non-sparse datasets, The data structure is orthogonal to other methods for closed/maximal itemset mining, maximality check, pruning, closure operations etc. Bitmaps and prefix trees are not so efficient for semi-structured data (semi- structure gives huge variations, hardly represented by bits, and to be shared) Simplify the techniques so that they can be applied easily Stable memory allocation (no need to dynamic allocation) Bitmaps and prefix trees are not so efficient for semi-structured data (semi- structure gives huge variations, hardly represented by bits, and to be shared) Simplify the techniques so that they can be applied easily Stable memory allocation (no need to dynamic allocation) Future work: Future work: other pattern mining problems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.