MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value pairs Reduce Phase: Summarizing output of Map phase into a single output Advantages: Reduces communication overheads Improves fault tolerance Disadvantages Not dynamic No synchronization between jobs
Our Challenge A number of data analytics algorithms use a recursive divide-and-conquer approach This approach requires synchronization It is therefore difficult to create distributed versions of such algorithms using the MapReduce Model Parent-Child MapReduce
Parent-Child MapReduce It is an extension of the MapReduce programming model that allows Tasks to created dynamically Tasks to be synchronized in a hierarchical fashion
FP-Growth Using Parallel FP-Growth (PFP), as a reference, we show that Parent-Child MapReduce can be used to parallelize recursive divide-and-conquer algorithms Using Parent-Child MapReduce can lead to performance gains
Frequent Pattern Mining Problem Statement Given a transaction database DB and a minimum support threshold ξ, find all frequent patterns with support no less than ξ. Example Input: DB: Minimum support: ξ =3 Output: all frequent patterns, i.e., {bread},{diaper},…., {bread, milk}, ……
Methods Single Node Environment Distributed Environment Apriori FP-Growth Eclat Distributed Environment MRApriori (Map-Reduce Apriori) PFP (Parallel FP-Growth) DistEclat (Distributed Eclat)
FP-Growth The FP-Growth algorithm has two phases The first phase is the construction of the FP-Tree This is a data structure that summarizes the contents of the transaction database Consists of a header table of frequent items, a prefix tree and links between the items in the header table and their first occurrence in the prefix tree
FP-Growth The second phase is the discovery of the frequent patterns in the FP-Tree This is a recursive divide-and-conquer process Each recursive call breaks the database into smaller projections of the database These projections are represented by smaller FP-trees called conditional FP-Trees Process continues until the tree is empty or contains a single path tree
FP-Tree Summary of Transaction Database Building a FP-Tree {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item head f c a b m p TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o} {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n} Minimum support: ξ =3 Transaction Database FP-Tree Summary of Transaction Database
Conditional FP-Tree (Item M) Pattern Growth fca:2, fcab:1 {} f:3 c:3 a:3 f:4 b:1 m:2 m:1 Header Table Item head m 3 Projection of FP-Tree for paths containing item M Conditional Pattern base for item M Conditional FP-Tree for item M
Parallel FP-Growth Step 1a: Sharding Step 1b: Parallel Counting Step 1c: Grouping Items Step 2: Parallel FP-Growth Mapper – Generate group-dependent transactions Reducer – FP-Growth on group-dependent transactions Step 3: Aggregation
Work Done Observations Proposed Solution Computational time of Parallel FP-Growth increases exponentially with low minimum support counts Some Reducers in Step-2 of the PFP process take significantly longer times to execute Current parameters in PFP cannot mitigate the issue The properties of the database are more important than the size of the database – transaction length , sparseness of the data Proposed Solution Use Parent-Child MapReduce feature of Symphony to increase the level of parallelization in reduce phase of Step-2 of PFP.
Implementation Used Apache Mahout implementation of PFP
Framework Comparison
Evaluations We call our new algorithm R-PFP (Recursive-PFP) Compared computation time of Mahout PFP against R-PFP Used two publicly available datasets: Kosarak and Twitter Same parameters for PFP and R-PFP Average over 5 runs Kosarak Twitter #Transactions 990,002 22,524,846 #Unique Items 41,270 3,380,085 Avg. Trans. Length 8 1 Max. Trans. Length 2,498 30
Results Up to 3 times reduction in computation time
Results
Results
Summary Work Done Future Work: Designed and developed R-PFP R-PFP is capable of deep parallelization during processing as compared to PFP. Deep parallelization achieved using Parent-Child MapReduce feature of IBM platform Symphony. R-PFP achieved up to 3 times better computation time compared to PFP Future Work: Develop a better method to predict FP-Tree load Test Parent-Child MapReduce on other recursive divide-and-conquer algorithms
Paper Deep Parallelization of Parallel FP-Growth Using Parent-Child MapReduce - Adetokunbo Makanju, Zahra Farzanyar, Aijun An, Nick Cercone, Zane Hu, and Yonggang Hu- IEEE BigData 2016 Washington D.C.,USA
Thank you!