Prabhanjan Kambadur, Amol Ghoting, Anshul Gupta and Andrew Lumsdaine. International Conference on Parallel Computing (ParCO),2009 Extending Task Parallelism.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Salvatore Ruggieri SIGKDD2010 Frequent Regular Itemset Mining 2010/9/2 1.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Frequent Closed Pattern Search By Row and Feature Enumeration
LOGO Association Rule Lecturer: Dr. Bo Yuan
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
1 Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.
Frequent Itemset Mining on Graphics Processors Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He 1, Qiong Luo Hong Kong Univ. of Sci.
PFunc: Modern Task Parallelism For Modern High Performance Computing Prabhanjan Kambadur, Open Systems Lab, Indiana University.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
CPS : Information Management and Mining
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Data Mining Association Analysis: Basic Concepts and Algorithms
Spring 2003Data Mining by H. Liu, ASU1 5. Association Rules Market Basket Analysis and Itemsets APRIORI Efficient Association Rules Multilevel Association.
Spring 2005CSE 572, CBS 598 by H. Liu1 5. Association Rules Market Basket Analysis and Itemsets APRIORI Efficient Association Rules Multilevel Association.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining - MaxMiner. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and.
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Mining Association Rules
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Performance and Scalability: Apriori Implementation.
 The PFunc Implementation of NAS Parallel Benchmarks. Presenter: Shashi Kumar Nanjaiah Advisor: Dr. Chung E Wang Department of Computer Science California.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Lecture 9 Query Optimization.
Frequent Itemset Mining on Graphics Processors, Fang et al., DaMoN Turbo-charging Vertical Mining of Large Databases, Shenoy et al., MOD NVIDIA.
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
9/03Data Mining – Association G Dong (WSU) 1 5. Association Rules Market Basket Analysis APRIORI Efficient Mining Post-processing.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Prabhanjan Kambadur, Amol Ghoting, Anshul Gupta and Andrew Lumsdaine. International Conference on Parallel Computing (ParCO),2009 Extending Task Parallelism.
Data Mining  Association Rule  Classification  Clustering.
1 The Strategies for Mining Fault-Tolerant Patterns Jia-Ling Koh Department of Information and Computer Education National Taiwan Normal University.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Mining the Smallest Association Rule Set for Predictions Jiuyong Li, Hong Shen, and Rodney Topor Proceedings of the 2001 IEEE International Conference.
CS685: Special Topics in Data Mining The UNIVERSITY of KENTUCKY Frequent Itemset Mining II Tree-based Algorithm Max Itemsets Closed Itemsets.
Reducing Number of Candidates
Prabhanjan Kambadur, Open Systems Lab, Indiana University
Data Mining: Concepts and Techniques
Association Rules Repoussis Panagiotis.
Frequent Pattern Mining
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Gyozo Gidofalvi Uppsala Database Laboratory
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
Association Analysis: Basic Concepts and Algorithms
Frequent-Pattern Tree
Association Analysis: Basic Concepts
Presentation transcript:

Prabhanjan Kambadur, Amol Ghoting, Anshul Gupta and Andrew Lumsdaine. International Conference on Parallel Computing (ParCO),2009 Extending Task Parallelism For Frequent Pattern Mining.

Overview Introduce Frequent Pattern Mining (FPM). Formal definition. Apriori algorithm for FPM. Task-parallel implementation of Apriori. Requirements for efficient parallelization. Cilk-style task scheduling Shortcomings w.r.t Apriori Clustered task scheduling policy Results

FPM: A Formal Definition Let I = {i ₁, i ₂, … i n } be a set of n items. Let D = { T ₁, T ₂ …, T m } be a set of m transactions such that T i ⊆  A set i ⊆ I of size k is called k-itemset Support of k-itemset is ∑j = 1, m ( 1: i ⊆  j ) The number of transactions in D having i as a subset. “Frequent Pattern Mining problem aims to find all i ∈ D that have a support are ≥ to a user supplied value”.

Apriori Algorithm for FPM TIDItem 1ABCE 2BCAF 3GHAC 4ADBH 5EDAB 6ABCD 7BDAG 8ACDB Transaction Database

Apriori Algorithm TIDItem 1ABCE 2BCAF 3GHAC 4ADBH 5EDAB 6ABCD 7BDAG 8ACDB A A B B C C D D E E F F G G H H Transaction Database TID List

Apriori Algorithm for FPM A A B B C C D D AB CD 68 Join Support (AB) = 87.5% Support (CD) = 25%

Apriori Algorithm for FPM Transaction Database A A B B C C D D E E F F G G H H Support = 37.5% (3/8) A A B B C C D D E E F F G G H H CD Spawn Wait All AB AC AD BC BD ABC ABD

Cilk-style parallelization Order of discovery Order of completion Depth-first discovery, post-order finish n n n-1 n-2 n-3 n-4 n-3 n-4 n-5 n-6 1 Thread

Cilk-style parallelization Thd 1Thd 2 n Thd 1Thd 2 n-2 n-1 n Thd 1Thd 2 n-2n-1 n Thd 1Thd 2 nn-4 n-3 n-2 n-1 1. Breadth-first theft. 2. Steal one task at a time. 3. Stealing is expensive. Steal (n-1)Steal (n-3) Thread-local Deques n n n-1 n-2 n-3 n-4 n-3 n-4 n-5 n-6 Thd 1Thd 2 n-3n-4 nn-2 n-1

Efficient Parallelization of FPM AB AC AD A A ABC ABD AB Shortcomings of Cilk-style w.r.t FPM: 1. Exploits data locality only b/w parent-child tasks. 2.Stealing does not consider data locality. 3. Tasks are stolen one at a time. Tasks with overlapping memory accesses: 1. Executed by the same thread. 2. Stolen together by the same thread.

Clustered Scheduling Policy Cluster k-itemset based on common (k-1) prefix AB AC AD ABC ABD 1. Hash Table - std::hash_map. Hash(A) Hash(A) xor Hash(B) Thread-local deque Thread-local hash table Hash Table 2. Hash - std::hash.

Clustered Scheduling Policy AB AC AD ABC ABD Hash(A) Hash(A) xor Hash(B) Thd 1 Hash Table Thd 2 Hash Table

Clustered Scheduling Policy AB AC AD Steal an entire bucket of tasks. Hash(A) Thd 1 Hash Table ABC ABD Hash(A) xor Hash(B) Thd 2 Hash Table

Where does PFunc fit in? Customizable task scheduling and priorities. Cilk-style, LIFO, FIFO, Priority-based scheduling built-in. Custom scheduling policies are simple to implement. Eg.,Clustered scheduling policy. Chosen at compile time. Much like STL (Eg., stl::vector ). namespace pfunc { struct hashS: public schedS{}; template struct scheduler { … }; } // namespace pfunc

So, how does it work? Select Scheduling Policy and priority Hash Table-Based Reference to itemset Task T; SetPriority (T, ref (ABD)); Spawn (T); Task T; SetPriority (T, ref (ABD)); Spawn (T); Program GetPriority (T) - ABC Generate Hash Key Hash(A) xor Hash(B) Generate Hash Key Hash(A) xor Hash(B) Place task Scheduler ABC ABD Task Queue BCD BCE

Performance Analysis 8 Threads Dual AMD 8356, Linux , GCC 4.3.2

Performance Analysis - IPC DatasetSupportIPC(Cilk)IPC(Clustered) accidents chess connect kosark pumsb pumsb_star mushroom T40I10D100K T10I4D100K Threads Higher the better! Dual AMD 8356, Linux , GCC 4.3.2

Performance Analysis – L1 DTLB Misses DatasetSupportCilk DTLB L1M/L2H Clustered DTLB L1M/L2H accidents chess connect kosark pumsb pumsb_star mushroom T40I10D100K T10I4D100K Threads Lower the better! Dual AMD 8356, Linux , GCC 4.3.2

Performance Analysis – L2 DTLB Misses DatasetSupportCilk DTLB L1M/L2M Clustered DTLB L1M/L2M accidents chess connect kosark pumsb pumsb_star mushroom T40I10D100K T10I4D100K Threads Lower the better! Dual AMD 8356, Linux , GCC 4.3.2

Conclusions For task parallel FPM. Clustered scheduling outperforms Cilk-style. Exploits data locality. Better work-stealing policy. PFunc provides support for facile customizations. Task scheduling policy, task priorities, etc. Being released under COIN-OR. Eclipse Public License version 1.0. Future work. Task queues based on multi-dimensional index structures. K-d trees.

Fibonacci 37 ThreadsCilk (secs)PFunc/Cil k TBB/CilkPFunc/TBB x faster than TBB 2x slower than Cilk. But provides more flexibility. Fibonacci is the worst case behavior!