林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Outline Introduction 1 FI-Growth algorithm Parallel FI-Growth Experiments and results 2 3 4 Conclusion 5

Introduction  Association rule mining is one of the most important techniques in data mining.  consists of two main steps:  frequent itemsets generation tries to extract the most frequent patterns;  rule generation uses these frequent patterns to generate interesting rules. 3 林俊宏 2010.06.01

 Two fundamental algorithms proposed for finding the frequent itemsets from large databases  Apriori algorithm  Closed algorithm  Proposed to reduce this cost.  The Fp-growth algorithm  FI-growth algorithm Introduction 4 林俊宏 2010.06.01

 Transaction-oriented databases are usually very large.  Mining useful rules from such large and volatile databases is a challenging problem.  Fast association rule mining inevitably requires large computing resources.  cluster computing technology offers a potential solution  parallel Apriori approach,  parallel FP-growth approach Introduction 5 林俊宏 2010.06.01

 The objective of this paper  utilize parallelization on a computing cluster environment for fast extraction of frequent itemsets from large dense databases.  propose an alternative approach  parallel association rule mining based on the FI- growth algorithm Introduction 6 林俊宏 2010.06.01

 Similar to the FP-growth algorithm,  FI-growth represents the data set as a prefix sharing tree, called an “FI-tree”.  It commonly consists of two phases:  FI-tree construction  Mining FI-Growth algorithm 7 林俊宏 2010.06.01

FI-Growth algorithm  Constructing an FI-tree requires scanning the database only twice:  the first scan creates the header table  the second scan creates the items-tree. A3 B1 C4 D2 E4 F4 A3 C4 D2 E4 F4 Note that ： the items in all lists must be in the same relative order. 8 林俊宏 2010.06.01

 Combining operation  the same sub-paths are grouped and their counts summed.  The combining operation has the following properties.  1) Self-reflective property: tree(a) © tree(a) is equal to tree(a) itself.  2) Commutative property: tree(a 1 ) © tree(a 2 ) is equal to tree(a 2 ) © tree(a 1 ).  3) Associative property: (tree(a 1 ) © tree(a 2 )) © tree(a 3 ) is equal to tree(a 1 ) © (tree(a 2 ) © tree(a 3 )). FI-Growth algorithm e: 1 d:2 f: 1 e: 1 d:2 f: 1 e: 1 d:2 f: 1 9 林俊宏 2010.06.01

The result (grey nodes) replaces the old one that is linked from root. 10 林俊宏 2010.06.01

root a:3 c:2 e:1 d:2 c:2 e:1 e:2 f:2 f:1 f:4 f:3 e:4 e:1 d:2 f:1 e:1 d:2 f:1 f:2 FI-Growth algorithm  Branching step  Subset finding step  Pruning step 11 林俊宏 2010.06.01

Parallel FI-Growth  a parallel version of the FI-growth algorithm  employ a data parallelism technique on a PC cluster  partition the transaction  one-time synchronization to exchange their sub-trees 12 林俊宏 2010.06.01

 Hierarchical minimum support  two solutions to avoid such a problem:  All processors synchronize their lists of item counts  utilizing two values of minimum support: min_supL1 is defined and used to prune the local header table min_supL2 is defined to prune the local items-tree.  in this paper, we use the second approach. Parallel FI-Growth 13 林俊宏 2010.06.01

 Parallelization  min_supL1 = 1(20%)  min_supL2 = 2(40%) Parallel FI-Growth 14 林俊宏 2010.06.01

 FI-Tree synchronization  Exchanging of local header table: To reduce the communication overhead, only the list of items is broadcast to other processors.  Sending of local sub-tree: which local sub-tree(s) should be kept, and which should be sent to the target processors Parallel FI-Growth 15 林俊宏 2010.06.01

Experiments and results  Hardware and environment configuration:  Tested on a cluster of x86-64 based SMP machines named “Bedrocks”.  Each machine consists of dual 3.2GHz Intel quad-core processors, 4GB of main memory, and an 80GB SATA disk.  equipped with the Linux-based operating system  inter-connected via a 1000Base-TX Ethernet switch  the parallel algorithm is written in the C language  uses the MPICH message passing library version 1.2.7.  All experiments were run under no-load conditions 16 林俊宏 2010.06.01

 Data set:  For the test data set, we utilized the standard “IBM synthetic data generator” to synthesize a transaction database. 1000 unique items 16 million records (each has average transaction length of 10) Experiments and results 17 林俊宏 2010.06.01

18 林俊宏 2010.06.01

Conclusion  research in many areas, including  run-time  memory requirements  In this paper  propose a parallel FI-growth algorithm to accelerate association rule mining.  In future work,  effects of partitioning  memory requirements  reduce the communication overhead  load balancing 19 林俊宏 2010.06.01

20 林俊宏 2010.06.01

林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

Similar presentations

Presentation on theme: "林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

Similar presentations

Presentation on theme: "林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang."— Presentation transcript:

Similar presentations

About project

Feedback