Frequent Itemset Mining on Graphics Processors Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He 1, Qiong Luo Hong Kong Univ. of Sci. and Tech. Microsoft Research Asia 1 Presenter: Wenbin Fang
2/33 Outline Contribution Introduction Design Evaluation Conclusion
3/33 Contribution Accelerate the Apriori algorithm for Frequent Itemset Mining using Graphics Processors (GPUs). Two GPU implementations: 1.Pure Bitmap-based implementation (PBI): processing entirely on the GPU. 2.Trie-based implementation (TBI): GPU/CPU co-processing.
4/33 Frequent Itemset Mining (FIM) Finding groups of items, or itemsets that co-occur frequently in a transaction database. Transaction IDItems 1A, B, C, D 2A, B, D 3A, C, D 4C, D Minimum support: 2 1-itemsets (frequent items): A : 3 B : 2 C : 3 D : 4
5/33 Frequent Itemset Mining (FIM) Aims at finding groups of items, or itemsets that co-occur frequently in a transaction database. Transaction IDItems 1A, B, C, D 2A, B, D 3A, C, D 4C, D Minimum support: 2 1-itemsets (frequent items): A, B, C, D 2-itemsets: AB: 2 AC: 2 AD: 3 BD: 2 CD: 3
6/33 Frequent Itemset Mining (FIM) Aims at finding groups of items, or itemsets that co-occur frequently in a transaction database. Transaction IDItems 1A, B, C, D 2A, B, D 3A, C, D 4C, D Minimum support: 2 1-itemsets (frequent items): A, B, C, D 2-itemsets: AB, AC, AD, BD, CD 3-itemsets: ABD, ACD
7/33 Graphics Processors (GPUs) Exist in commodity machines, mainly for graphics rendering. Specialized for compute-intensive, highly data parallel apps. Compared with CPUs, GPUs provide 10x faster computational horsepower, and 10x higher memory bandwidth. CPUGPU --From NVIDA CUDA Programming Guide
8/33 Programming on GPUs OpenGL/DirectX AMD CTM NVIDIA CUDA The many-core architecture model of the GPU SIMD parallelism (Single Instruction, Multiple Data)
9/33 Hierarchical multi-threaded in NVIDIA CUDA … A warp = 32 GPU threads => SIMD schedule unit. …………… Thread Block # of threads in a thread block # of thread blocks Thread Block Warp
10/33 General Purpose GPU Computing (GPGPU) Applications utilizing GPUs Scientific computing Molecular Dynamics Simulation Weather forecasting Linear algebra Computational finance Database applications Basic DB Operators [SIGMOD’04] Sorting [SIGMOD’06] Join [SIGMOD’08]
11/33 Our work As a first step, we consider the GPU-based Apriori, with intention to extend to another efficient FIM algorithm -- FP-growth. Why Apriori? 1.a classic algorithm for mining frequent itemsets. 2.also applied in other data mining tasks, e.g., clustering, and functional dependency.
12/33 The Apriori Algorithm Input: 1) Transaction Database 2) Minimum support Output: All frequent itemsets L 1 = {All frequent 1-itemsets} k = 2 While (L k-1 != empty) { //Generate candidate k-itemsets. C k <- Self join on L k-1 C k <- (K-1)-Subset test on C k //Generate frequent k-itemsets L k <- Support Counting on C k k += 1 } Frequent 1-itemsets Candidate 2-itemsets Frequent 2-itemsets Candidate 3-itemsets Frequent 3-itemsets … Candidate (K-1)-itemsets Frequent (K-1)-itemsets Candidate K-itemsets Frequent K-itemsets
13/33 Outline Contribution Introduction Design Evaluation Conclusion
14/33 L 1 = {All frequent 1-itemsets} k = 2 While (L k-1 != empty) { //Generate candidate k-itemsets. C k <- Self join on L k-1 C k <- (K-1)-Subset test on C k //Generate frequent k-itemsets L k <- Support Counting on C k k += 1 } GPU-based Apriori Input: 1) Transaction Database 2) Minimum support Output: All frequent itemsets L 1 = {All frequent 1-itemsets} k = 2 While (L k-1 != empty) { //Generate candidate k-itemsets. C k <- Self join on L k-1 C k <- (K-1)-Subset test on C k //Generate frequent k-itemsets L k <- Support Counting on C k k += 1 } Pure Bitmap-based Impl. (PBI) Itemsets: bitmap Candidate generation on the GPU Transactions: bitmap Support counting on the GPU Trie-based Impl. (TBI) Transactions: bitmap Support counting on the GPU Itemsets: Trie Candidate generation on the CPU
15/33 Horizontal and Vertical data layout Horizontal data layout TIDItems 1A, B, C, D 2A, B, D 3A, C, D 4C, D Vertical data layout ItemsetTID List AB1, 2 AC1, 3 AD1, 2, 3 BD1, 2, 3 CD1, 3, 4 ItemsetTID List ABD1, 2 ACD1, 3 Scan all transactions Support counting is done on specific itemsets. 1.Intersect two transaction lists. 2.Count the number of transactions in the intersection result.
16/33 Bitmap representation for transactions T1T2T3T4 AB1100 AC1010 AD1110 BD1110 CD1011 Intersection = bitwise AND operation # of transactions # of itemsets Counting = # of 1’s in a string of bits
17/33 Lookup table IndexCount … Lookup table T1T2T3T4 ABD1100 ACD1010 Bitmap representation for transactions Constant memory 1.Cacheable 2.64 KB 3.Shared by all GPU threads IndexCount (0) (1)1 … (65534) (65535)16 1 byte 2 16 = # of 1’s = TABLE[12]; // decimal: 12 // binary: 1100 // (a string of bits)
18/33 Support Counting on the GPU (Cont.) Thread block 1 Thread block 2 T1T2T3T4 AB1100 AC1010 AD1110 BD1110 CD1011 T1T2T3T4 ABD1100 ACD Intersect two transaction lists. 2.Count the number of transaction in the intersection result. LOOKUP TABLE 2
19/33 Support Counting on the GPU (Cont.) int AND int LOOKUP TABLE Counts of 1’s for every 16-bit integer Parallel Reduce Support for this itemset Thread Block Thread 1 Thread 2 Access vector type int4 In one instruction Example: Counts: 2 Support:2 AD AB ABD
20/33 Input: 1) Transaction Database 2) Minimum support Output: All frequent itemsets L 1 = {All frequent 1-itemsets} k = 2 While (L k-1 != empty) { //Generate candidate k-itemsets. Join Subset test //Generate frequent k-itemsets Support Counting k += 1 } Support Counting on the GPU Candidate Generation 1.Join e.g., Join two 2-itemsets to obtain a candidate 3-itemset: AAA AC JOIN AD => ACD 2.Subset test e.g., Test all 2-subsets of ACD: {AC, AD, CD} GPU-based Apriori
21/33 L 1 = {All frequent 1-itemsets} k = 2 While (L k-1 != empty) { //Generate candidate k-itemsets. C k <- Self join on L k-1 C k <- (K-1)-Subset test on C k //Generate frequent k-itemsets L k <- Support Counting on C k k += 1 } GPU-based Apriori Input: 1) Transaction Database 2) Minimum support Output: All frequent itemsets L 1 = {All frequent 1-itemsets} k = 2 While (L k-1 != empty) { //Generate candidate k-itemsets. C k <- Self join on L k-1 C k <- (K-1)-Subset test on C k //Generate frequent k-itemsets L k <- Support Counting on C k k += 1 } Pure Bitmap-based Impl. (PBI) Itemsets: bitmap Candidate generation on the GPU Transactions: bitmap Support counting on the GPU Trie-based Impl. (TBI) Transactions: bitmap Support counting on the GPU Itemsets: Trie Candidate generation on the CPU Itemsets: bitmap Candidate generation on the GPU
Pure Bitmap-based Impl. (PBI) One GPU thread generates one candidate itemset. 22/33 ABCD ABD1101 ACD1011 ABCD AB1100 AC1010 AD1001 BD0101 CD0011 Bitwise OR In Join (e.g., AB JOIN AD = ABD) Binary search In Subset test (e.g., 2-subsets {AB, AD, BD}) # of items # of itemsets
23/33 L 1 = {All frequent 1-itemsets} k = 2 While (L k-1 != empty) { //Generate candidate k-itemsets. C k <- Self join on L k-1 C k <- (K-1)-Subset test on C k //Generate frequent k-itemsets L k <- Support Counting on C k k += 1 } GPU-based Apriori Input: 1) Transaction Database 2) Minimum support Output: All frequent itemsets L 1 = {All frequent 1-itemsets} k = 2 While (L k-1 != empty) { //Generate candidate k-itemsets. C k <- Self join on L k-1 C k <- (K-1)-Subset test on C k //Generate frequent k-itemsets L k <- Support Counting on C k k += 1 } Pure Bitmap-based Impl. (PBI) Itemsets: bitmap Candidate generation on the GPU Transactions: bitmap Support counting on the GPU Trie-based Impl. (TBI) Transactions: bitmap Support counting on the GPU Itemsets: Trie Candidate generation on the CPU Itemsets: Trie Candidate generation on the CPU
D 24/33 Trie-based Impl. (TBI) Root ABCD BCDDD 1-itemsets: {A, B, C, D} 2-itemsets: {AB, AC, AD, BD, CD} A AB B JOINAC C = ABC C {AB, AC, BC} B ABJOINAD= ABD {AB, AD, BD} D DD D ACJOINAD= ACD {AC, AD, CD} On CPU 1, Irregular memory access 2, Branch divergence C D Candidate 3-itemsets: { ABD, ACD} Depth 0 Depth 1 Depth 2
25/33 Outline Contribution Introduction Design Evaluation Conclusion
26/33 Experimental setup Intel Core2 quad-core CPU NV GTX 280 GPU Processors 2.66 GHz * GHz * 30 * 8 Memory Bandwidth (GB/sec) Development Env. Windows XP + Visual Studio CUDA Dataset#ItemAvg. Length #TranDensity T40I10D100K (synthetic) 1, ,0004% Retail16, ,1620.6% Chess % Platform: Experimental datasets: Density = Avg. Length / # items
27/33 Apriori Implementations Impl.Candidate Generation Support Countnig ItemsetsTransactions BORGELTSingle-threaded on the CPU Trie GOETHALSSingle-threaded on the CPU Multi-threaded on the CPU TrieHorizontal layout TBI-CPUSingle-threaded on the CPU Multi-threaded on the CPU TrieBitmap TBI-GPUSingle-threaded on the CPU Multi-threaded on the GPU TrieBitmap PBI-GPUMulti-threaded on the GPU Bitmap Best Apriori implementation in FIMI repository. (Frequent Itemset Mining Implementations Repository)
28/33 TBI-CPU vs GOETHALS Impl.Itemset/Candidat e Generation Transactions / Suport Counting TBI-CPU Trie / CPUBitmap / CPU GOETHALS Trie / CPUHorizontal layout / CPU Dense Dataset - Chess Sparse Dataset- Retail The impact of using bitmap representation for transactions in support counting. 1.2x ~ 25.7x
Sparse Dataset- Retail 29/33 TBI-GPU vs TBI-CPU Impl.Itemset/Candidat e Generation Transactions / Suport Counting TBI-GPU Trie / CPUBitmap / GPU TBI-CPU Trie / CPUBitmap / CPU Dense Dataset - Chess The impact of GPU acceleration in support counting. 1.1x ~ 7.8x
30/33 PBI-GPU vs TBI-GPU Impl.Itemset/Candidat e Generation Transactions / Suport Counting PBI-GPU Bitmap / GPU TBI-GPU Trie / CPUBitmap / GPU Sparse Dataset- Retail Dense Dataset - Chess The impact of bitmap-based itemset and trie-based itemset in candidate generation. PBI-GPU is faster in dense dataset. TBI-GPU is better in sparse dataset.
31/33 PBI-GPU/TBI-CPU vs BORGELT Impl.Itemset/Candidat e Generation Transactions / Suport Counting PBI-GPU Bitmap / GPU TBI-GPU Trie / CPUBitmap / GPU BORGELT Trie /CPU Sparse Dataset- Retail Dense Dataset - Chess Comparison to the best Apriori implementation in FIMI. 1.2x ~ 24.2x
32/33 Comparison to FP-growth With minsup 1%, 60%, and 0.01% PARSEC benchmark
33/33 Conclusion GPU-based Apriori Pure Bitmap-based impl. Bitmap Representation for itemsets. Bitmap Representation for transactions. GPU processing. Trie-based impl. Trie Representation for itemsets. Bitmap Representation for transactions. GPU + CPU co-processing. Better than CPU-based Apriori. Still worse than CPU-based FP-growth
Backup Slide Time breakdown on dense dataset Chess Time breakdown on dense dataset Retail Time Breakdown