林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang
Outline Introduction 1 FI-Growth algorithm Parallel FI-Growth Experiments and results Conclusion 5
Introduction Association rule mining is one of the most important techniques in data mining. consists of two main steps: frequent itemsets generation tries to extract the most frequent patterns; rule generation uses these frequent patterns to generate interesting rules. 3 林俊宏
Two fundamental algorithms proposed for finding the frequent itemsets from large databases Apriori algorithm Closed algorithm Proposed to reduce this cost. The Fp-growth algorithm FI-growth algorithm Introduction 4 林俊宏
Transaction-oriented databases are usually very large. Mining useful rules from such large and volatile databases is a challenging problem. Fast association rule mining inevitably requires large computing resources. cluster computing technology offers a potential solution parallel Apriori approach, parallel FP-growth approach Introduction 5 林俊宏
The objective of this paper utilize parallelization on a computing cluster environment for fast extraction of frequent itemsets from large dense databases. propose an alternative approach parallel association rule mining based on the FI- growth algorithm Introduction 6 林俊宏
Similar to the FP-growth algorithm, FI-growth represents the data set as a prefix sharing tree, called an “FI-tree”. It commonly consists of two phases: FI-tree construction Mining FI-Growth algorithm 7 林俊宏
FI-Growth algorithm Constructing an FI-tree requires scanning the database only twice: the first scan creates the header table the second scan creates the items-tree. A3 B1 C4 D2 E4 F4 A3 C4 D2 E4 F4 Note that : the items in all lists must be in the same relative order. 8 林俊宏
Combining operation the same sub-paths are grouped and their counts summed. The combining operation has the following properties. 1) Self-reflective property: tree(a) © tree(a) is equal to tree(a) itself. 2) Commutative property: tree(a 1 ) © tree(a 2 ) is equal to tree(a 2 ) © tree(a 1 ). 3) Associative property: (tree(a 1 ) © tree(a 2 )) © tree(a 3 ) is equal to tree(a 1 ) © (tree(a 2 ) © tree(a 3 )). FI-Growth algorithm e: 1 d:2 f: 1 e: 1 d:2 f: 1 e: 1 d:2 f: 1 9 林俊宏
The result (grey nodes) replaces the old one that is linked from root. 10 林俊宏
root a:3 c:2 e:1 d:2 c:2 e:1 e:2 f:2 f:1 f:4 f:3 e:4 e:1 d:2 f:1 e:1 d:2 f:1 f:2 FI-Growth algorithm Branching step Subset finding step Pruning step 11 林俊宏
Parallel FI-Growth a parallel version of the FI-growth algorithm employ a data parallelism technique on a PC cluster partition the transaction one-time synchronization to exchange their sub-trees 12 林俊宏
Hierarchical minimum support two solutions to avoid such a problem: All processors synchronize their lists of item counts utilizing two values of minimum support: min_supL1 is defined and used to prune the local header table min_supL2 is defined to prune the local items-tree. in this paper, we use the second approach. Parallel FI-Growth 13 林俊宏
Parallelization min_supL1 = 1(20%) min_supL2 = 2(40%) Parallel FI-Growth 14 林俊宏
FI-Tree synchronization Exchanging of local header table: To reduce the communication overhead, only the list of items is broadcast to other processors. Sending of local sub-tree: which local sub-tree(s) should be kept, and which should be sent to the target processors Parallel FI-Growth 15 林俊宏
Experiments and results Hardware and environment configuration: Tested on a cluster of x86-64 based SMP machines named “Bedrocks”. Each machine consists of dual 3.2GHz Intel quad-core processors, 4GB of main memory, and an 80GB SATA disk. equipped with the Linux-based operating system inter-connected via a 1000Base-TX Ethernet switch the parallel algorithm is written in the C language uses the MPICH message passing library version All experiments were run under no-load conditions 16 林俊宏
Data set: For the test data set, we utilized the standard “IBM synthetic data generator” to synthesize a transaction database unique items 16 million records (each has average transaction length of 10) Experiments and results 17 林俊宏
18 林俊宏
Conclusion research in many areas, including run-time memory requirements In this paper propose a parallel FI-growth algorithm to accelerate association rule mining. In future work, effects of partitioning memory requirements reduce the communication overhead load balancing 19 林俊宏
20 林俊宏