MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.

MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value pairs Reduce Phase: Summarizing output of Map phase into a single output Advantages: Reduces communication overheads Improves fault tolerance Disadvantages Not dynamic No synchronization between jobs

Our Challenge A number of data analytics algorithms use a recursive divide-and-conquer approach This approach requires synchronization It is therefore difficult to create distributed versions of such algorithms using the MapReduce Model Parent-Child MapReduce

Parent-Child MapReduce
It is an extension of the MapReduce programming model that allows Tasks to created dynamically Tasks to be synchronized in a hierarchical fashion

FP-Growth Using Parallel FP-Growth (PFP), as a reference, we show that
Parent-Child MapReduce can be used to parallelize recursive divide-and-conquer algorithms Using Parent-Child MapReduce can lead to performance gains

Frequent Pattern Mining
Problem Statement Given a transaction database DB and a minimum support threshold ξ, find all frequent patterns with support no less than ξ. Example Input: DB: Minimum support: ξ =3 Output: all frequent patterns, i.e., {bread},{diaper},…., {bread, milk}, ……

Methods Single Node Environment Distributed Environment Apriori
FP-Growth Eclat Distributed Environment MRApriori (Map-Reduce Apriori) PFP (Parallel FP-Growth) DistEclat (Distributed Eclat)

FP-Growth The FP-Growth algorithm has two phases
The first phase is the construction of the FP-Tree This is a data structure that summarizes the contents of the transaction database Consists of a header table of frequent items, a prefix tree and links between the items in the header table and their first occurrence in the prefix tree

FP-Growth The second phase is the discovery of the frequent patterns in the FP-Tree This is a recursive divide-and-conquer process Each recursive call breaks the database into smaller projections of the database These projections are represented by smaller FP-trees called conditional FP-Trees Process continues until the tree is empty or contains a single path tree

FP-Tree Summary of Transaction Database
Building a FP-Tree {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item head f c a b m p TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o} {b, c, k, s, p} {a, f, c, e, l, p, m, n} Minimum support: ξ =3 Transaction Database FP-Tree Summary of Transaction Database

Conditional FP-Tree (Item M)
Pattern Growth fca:2, fcab:1 {} f:3 c:3 a:3  f:4 b:1 m:2 m:1 Header Table Item head m 3 Projection of FP-Tree for paths containing item M Conditional Pattern base for item M Conditional FP-Tree for item M

Parallel FP-Growth Step 1a: Sharding Step 1b: Parallel Counting
Step 1c: Grouping Items Step 2: Parallel FP-Growth Mapper – Generate group-dependent transactions Reducer – FP-Growth on group-dependent transactions Step 3: Aggregation

Work Done Observations Proposed Solution
Computational time of Parallel FP-Growth increases exponentially with low minimum support counts Some Reducers in Step-2 of the PFP process take significantly longer times to execute Current parameters in PFP cannot mitigate the issue The properties of the database are more important than the size of the database – transaction length , sparseness of the data Proposed Solution Use Parent-Child MapReduce feature of Symphony to increase the level of parallelization in reduce phase of Step-2 of PFP.

Implementation Used Apache Mahout implementation of PFP

Framework Comparison

Evaluations We call our new algorithm R-PFP (Recursive-PFP)
Compared computation time of Mahout PFP against R-PFP Used two publicly available datasets: Kosarak and Twitter Same parameters for PFP and R-PFP Average over 5 runs Kosarak Twitter #Transactions 990,002 22,524,846 #Unique Items 41,270 3,380,085 Avg. Trans. Length 8 1 Max. Trans. Length 2,498 30

Results Up to 3 times reduction in computation time

Results

Summary Work Done Future Work: Designed and developed R-PFP
R-PFP is capable of deep parallelization during processing as compared to PFP. Deep parallelization achieved using Parent-Child MapReduce feature of IBM platform Symphony. R-PFP achieved up to 3 times better computation time compared to PFP Future Work: Develop a better method to predict FP-Tree load Test Parent-Child MapReduce on other recursive divide-and-conquer algorithms

Paper Deep Parallelization of Parallel FP-Growth Using Parent-Child MapReduce - Adetokunbo Makanju, Zahra Farzanyar, Aijun An, Nick Cercone, Zane Hu, and Yonggang Hu- IEEE BigData 2016 Washington D.C.,USA

Thank you!

MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.

Similar presentations

Presentation on theme: "MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.

Similar presentations

Presentation on theme: "MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value."— Presentation transcript:

Similar presentations

About project

Feedback