林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić Department of Computer Engineering and Computer Science, School of Electrical.
A distributed method for mining association rules
CSE 634 Data Mining Techniques
FP-Growth algorithm Vasiljevic Vladica,
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Data Mining Association Analysis: Basic Concepts and Algorithms
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
1 Complexity of Network Synchronization Raeda Naamnieh.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Fast Algorithms for Association Rule Mining
SEG Tutorial 2 – Frequent Pattern Mining.
CLUSTER COMPUTING Prepared by: Kalpesh Sindha (ITSNS)
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
Sequential PAttern Mining using A Bitmap Representation
Secure Incremental Maintenance of Distributed Association Rules.
Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Tools for Privacy Preserving Distributed Data Mining
Mining High Utility Itemset in Big Data
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Mining Frequent Patterns without Candidate Generation.
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
Parallelization of Classification Algorithms For Medical Imaging on a Cluster Computing System 指導教授 : 梁廷宇 老師 系所 : 碩光通一甲 姓名 : 吳秉謙 學號 :
Association Rule Mining in Peer-to-Peer Systems Ran Wolff Assaf Shcuster Department of Computer Science Technion I.I.T. Haifa 32000,Isreal.
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者:林靜怡.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Association Analysis (3)
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
By Shivaraman Janakiraman, Magesh Khanna Vadivelu.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.3: Association Rules Rodney Nielsen Many / most of these slides were adapted from:
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Fast Mining Frequent Patterns with Secondary Memory Kawuu W. Lin, Sheng-Hao Chung, Sheng-Shiung Huang and Chun-Cheng Lin Department of Computer Science.
MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.
Indexing Structures for Files and Physical Database Design
Byung Joon Park, Sung Hee Kim
Market Basket Analysis and Association Rules
Communication and Memory Efficient Parallel Decision Tree Construction
A Parameterised Algorithm for Mining Association Rules
Mining Association Rules from Stars
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
732A02 Data Mining - Clustering and Association Analysis
Market Basket Analysis and Association Rules
FP-Growth Wenlong Zhang.
Finding Frequent Itemsets by Transaction Mapping
Presentation transcript:

林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Outline Introduction 1 FI-Growth algorithm Parallel FI-Growth Experiments and results Conclusion 5

Introduction  Association rule mining is one of the most important techniques in data mining.  consists of two main steps:  frequent itemsets generation tries to extract the most frequent patterns;  rule generation uses these frequent patterns to generate interesting rules. 3 林俊宏

 Two fundamental algorithms proposed for finding the frequent itemsets from large databases  Apriori algorithm  Closed algorithm  Proposed to reduce this cost.  The Fp-growth algorithm  FI-growth algorithm Introduction 4 林俊宏

 Transaction-oriented databases are usually very large.  Mining useful rules from such large and volatile databases is a challenging problem.  Fast association rule mining inevitably requires large computing resources.  cluster computing technology offers a potential solution  parallel Apriori approach,  parallel FP-growth approach Introduction 5 林俊宏

 The objective of this paper  utilize parallelization on a computing cluster environment for fast extraction of frequent itemsets from large dense databases.  propose an alternative approach  parallel association rule mining based on the FI- growth algorithm Introduction 6 林俊宏

 Similar to the FP-growth algorithm,  FI-growth represents the data set as a prefix sharing tree, called an “FI-tree”.  It commonly consists of two phases:  FI-tree construction  Mining FI-Growth algorithm 7 林俊宏

FI-Growth algorithm  Constructing an FI-tree requires scanning the database only twice:  the first scan creates the header table  the second scan creates the items-tree. A3 B1 C4 D2 E4 F4 A3 C4 D2 E4 F4 Note that : the items in all lists must be in the same relative order. 8 林俊宏

 Combining operation  the same sub-paths are grouped and their counts summed.  The combining operation has the following properties.  1) Self-reflective property: tree(a) © tree(a) is equal to tree(a) itself.  2) Commutative property: tree(a 1 ) © tree(a 2 ) is equal to tree(a 2 ) © tree(a 1 ).  3) Associative property: (tree(a 1 ) © tree(a 2 )) © tree(a 3 ) is equal to tree(a 1 ) © (tree(a 2 ) © tree(a 3 )). FI-Growth algorithm e: 1 d:2 f: 1 e: 1 d:2 f: 1 e: 1 d:2 f: 1 9 林俊宏

The result (grey nodes) replaces the old one that is linked from root. 10 林俊宏

root a:3 c:2 e:1 d:2 c:2 e:1 e:2 f:2 f:1 f:4 f:3 e:4 e:1 d:2 f:1 e:1 d:2 f:1 f:2 FI-Growth algorithm  Branching step  Subset finding step  Pruning step 11 林俊宏

Parallel FI-Growth  a parallel version of the FI-growth algorithm  employ a data parallelism technique on a PC cluster  partition the transaction  one-time synchronization to exchange their sub-trees 12 林俊宏

 Hierarchical minimum support  two solutions to avoid such a problem:  All processors synchronize their lists of item counts  utilizing two values of minimum support: min_supL1 is defined and used to prune the local header table min_supL2 is defined to prune the local items-tree.  in this paper, we use the second approach. Parallel FI-Growth 13 林俊宏

 Parallelization  min_supL1 = 1(20%)  min_supL2 = 2(40%) Parallel FI-Growth 14 林俊宏

 FI-Tree synchronization  Exchanging of local header table: To reduce the communication overhead, only the list of items is broadcast to other processors.  Sending of local sub-tree: which local sub-tree(s) should be kept, and which should be sent to the target processors Parallel FI-Growth 15 林俊宏

Experiments and results  Hardware and environment configuration:  Tested on a cluster of x86-64 based SMP machines named “Bedrocks”.  Each machine consists of dual 3.2GHz Intel quad-core processors, 4GB of main memory, and an 80GB SATA disk.  equipped with the Linux-based operating system  inter-connected via a 1000Base-TX Ethernet switch  the parallel algorithm is written in the C language  uses the MPICH message passing library version  All experiments were run under no-load conditions 16 林俊宏

 Data set:  For the test data set, we utilized the standard “IBM synthetic data generator” to synthesize a transaction database unique items 16 million records (each has average transaction length of 10) Experiments and results 17 林俊宏

18 林俊宏

Conclusion  research in many areas, including  run-time  memory requirements  In this paper  propose a parallel FI-growth algorithm to accelerate association rule mining.  In future work,  effects of partitioning  memory requirements  reduce the communication overhead  load balancing 19 林俊宏

20 林俊宏