Download presentation
Presentation is loading. Please wait.
Published byGeorgia Stokes Modified over 6 years ago
2
MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value pairs Reduce Phase: Summarizing output of Map phase into a single output Advantages: Reduces communication overheads Improves fault tolerance Disadvantages Not dynamic No synchronization between jobs
3
Our Challenge A number of data analytics algorithms use a recursive divide-and-conquer approach This approach requires synchronization It is therefore difficult to create distributed versions of such algorithms using the MapReduce Model Parent-Child MapReduce
4
Parent-Child MapReduce
It is an extension of the MapReduce programming model that allows Tasks to created dynamically Tasks to be synchronized in a hierarchical fashion
5
FP-Growth Using Parallel FP-Growth (PFP), as a reference, we show that
Parent-Child MapReduce can be used to parallelize recursive divide-and-conquer algorithms Using Parent-Child MapReduce can lead to performance gains
6
Frequent Pattern Mining
Problem Statement Given a transaction database DB and a minimum support threshold ξ, find all frequent patterns with support no less than ξ. Example Input: DB: Minimum support: ξ =3 Output: all frequent patterns, i.e., {bread},{diaper},…., {bread, milk}, ……
7
Methods Single Node Environment Distributed Environment Apriori
FP-Growth Eclat Distributed Environment MRApriori (Map-Reduce Apriori) PFP (Parallel FP-Growth) DistEclat (Distributed Eclat)
8
FP-Growth The FP-Growth algorithm has two phases
The first phase is the construction of the FP-Tree This is a data structure that summarizes the contents of the transaction database Consists of a header table of frequent items, a prefix tree and links between the items in the header table and their first occurrence in the prefix tree
9
FP-Growth The second phase is the discovery of the frequent patterns in the FP-Tree This is a recursive divide-and-conquer process Each recursive call breaks the database into smaller projections of the database These projections are represented by smaller FP-trees called conditional FP-Trees Process continues until the tree is empty or contains a single path tree
10
FP-Tree Summary of Transaction Database
Building a FP-Tree {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item head f c a b m p TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o} {b, c, k, s, p} {a, f, c, e, l, p, m, n} Minimum support: ξ =3 Transaction Database FP-Tree Summary of Transaction Database
11
Conditional FP-Tree (Item M)
Pattern Growth fca:2, fcab:1 {} f:3 c:3 a:3 f:4 b:1 m:2 m:1 Header Table Item head m 3 Projection of FP-Tree for paths containing item M Conditional Pattern base for item M Conditional FP-Tree for item M
12
Parallel FP-Growth Step 1a: Sharding Step 1b: Parallel Counting
Step 1c: Grouping Items Step 2: Parallel FP-Growth Mapper – Generate group-dependent transactions Reducer – FP-Growth on group-dependent transactions Step 3: Aggregation
13
Work Done Observations Proposed Solution
Computational time of Parallel FP-Growth increases exponentially with low minimum support counts Some Reducers in Step-2 of the PFP process take significantly longer times to execute Current parameters in PFP cannot mitigate the issue The properties of the database are more important than the size of the database – transaction length , sparseness of the data Proposed Solution Use Parent-Child MapReduce feature of Symphony to increase the level of parallelization in reduce phase of Step-2 of PFP.
14
Implementation Used Apache Mahout implementation of PFP
15
Framework Comparison
16
Evaluations We call our new algorithm R-PFP (Recursive-PFP)
Compared computation time of Mahout PFP against R-PFP Used two publicly available datasets: Kosarak and Twitter Same parameters for PFP and R-PFP Average over 5 runs Kosarak Twitter #Transactions 990,002 22,524,846 #Unique Items 41,270 3,380,085 Avg. Trans. Length 8 1 Max. Trans. Length 2,498 30
17
Results Up to 3 times reduction in computation time
18
Results
19
Results
20
Summary Work Done Future Work: Designed and developed R-PFP
R-PFP is capable of deep parallelization during processing as compared to PFP. Deep parallelization achieved using Parent-Child MapReduce feature of IBM platform Symphony. R-PFP achieved up to 3 times better computation time compared to PFP Future Work: Develop a better method to predict FP-Tree load Test Parent-Child MapReduce on other recursive divide-and-conquer algorithms
21
Paper Deep Parallelization of Parallel FP-Growth Using Parent-Child MapReduce - Adetokunbo Makanju, Zahra Farzanyar, Aijun An, Nick Cercone, Zane Hu, and Yonggang Hu- IEEE BigData 2016 Washington D.C.,USA
22
Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.