MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.

Slides:

Advertisements

Similar presentations

Mining Association Rules

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

FP-Growth algorithm Vasiljevic Vladica,

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Data Mining Association Analysis: Basic Concepts and Algorithms

FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.

Data Mining Association Analysis: Basic Concepts and Algorithms

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.

Association Analysis: Basic Concepts and Algorithms.

Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.

Data Mining Association Analysis: Basic Concepts and Algorithms

Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.

Fast Algorithms for Association Rule Mining

SEG Tutorial 2 – Frequent Pattern Mining.

Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.

林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Mining High Utility Itemset in Big Data

Mining Frequent Patterns without Candidate Generation.

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Image taken from: slideshare

Reducing Number of Candidates

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Frequent Pattern Mining

任課教授：陳朝鈞教授學生：王志嘉、馬敏修

Market Basket Analysis and Association Rules

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Cloud Distributed Computing Environment Hadoop

Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,

Mining Frequent Itemsets over Uncertain Databases

Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.

Data Mining Association Analysis: Basic Concepts and Algorithms

Word Co-occurrence Chapter 3, Lin and Dyer.

COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong

732A02 Data Mining - Clustering and Association Analysis

Mining Frequent Patterns without Candidate Generation

Frequent-Pattern Tree

COMP60621 Fundamentals of Parallel and Distributed Systems

Market Basket Analysis and Association Rules

FP-Growth Wenlong Zhang.

Charles Tappert Seidenberg School of CSIS, Pace University

Privacy preserving cloud computing

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value pairs Reduce Phase: Summarizing output of Map phase into a single output Advantages: Reduces communication overheads Improves fault tolerance Disadvantages Not dynamic No synchronization between jobs

Our Challenge A number of data analytics algorithms use a recursive divide-and-conquer approach This approach requires synchronization It is therefore difficult to create distributed versions of such algorithms using the MapReduce Model Parent-Child MapReduce

Parent-Child MapReduce It is an extension of the MapReduce programming model that allows Tasks to created dynamically Tasks to be synchronized in a hierarchical fashion

FP-Growth Using Parallel FP-Growth (PFP), as a reference, we show that Parent-Child MapReduce can be used to parallelize recursive divide-and-conquer algorithms Using Parent-Child MapReduce can lead to performance gains

Frequent Pattern Mining Problem Statement Given a transaction database DB and a minimum support threshold ξ, find all frequent patterns with support no less than ξ. Example Input: DB: Minimum support: ξ =3 Output: all frequent patterns, i.e., {bread},{diaper},…., {bread, milk}, ……

Methods Single Node Environment Distributed Environment Apriori FP-Growth Eclat Distributed Environment MRApriori (Map-Reduce Apriori) PFP (Parallel FP-Growth) DistEclat (Distributed Eclat)

FP-Growth The FP-Growth algorithm has two phases The first phase is the construction of the FP-Tree This is a data structure that summarizes the contents of the transaction database Consists of a header table of frequent items, a prefix tree and links between the items in the header table and their first occurrence in the prefix tree

FP-Growth The second phase is the discovery of the frequent patterns in the FP-Tree This is a recursive divide-and-conquer process Each recursive call breaks the database into smaller projections of the database These projections are represented by smaller FP-trees called conditional FP-Trees Process continues until the tree is empty or contains a single path tree

FP-Tree Summary of Transaction Database Building a FP-Tree {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item head f c a b m p TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o} {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n} Minimum support: ξ =3 Transaction Database FP-Tree Summary of Transaction Database

Conditional FP-Tree (Item M) Pattern Growth fca:2, fcab:1 {} f:3 c:3 a:3  f:4 b:1 m:2 m:1 Header Table Item head m 3 Projection of FP-Tree for paths containing item M Conditional Pattern base for item M Conditional FP-Tree for item M

Parallel FP-Growth Step 1a: Sharding Step 1b: Parallel Counting Step 1c: Grouping Items Step 2: Parallel FP-Growth Mapper – Generate group-dependent transactions Reducer – FP-Growth on group-dependent transactions Step 3: Aggregation

Work Done Observations Proposed Solution Computational time of Parallel FP-Growth increases exponentially with low minimum support counts Some Reducers in Step-2 of the PFP process take significantly longer times to execute Current parameters in PFP cannot mitigate the issue The properties of the database are more important than the size of the database – transaction length , sparseness of the data Proposed Solution Use Parent-Child MapReduce feature of Symphony to increase the level of parallelization in reduce phase of Step-2 of PFP.

Implementation Used Apache Mahout implementation of PFP

Framework Comparison

Evaluations We call our new algorithm R-PFP (Recursive-PFP) Compared computation time of Mahout PFP against R-PFP Used two publicly available datasets: Kosarak and Twitter Same parameters for PFP and R-PFP Average over 5 runs Kosarak Twitter #Transactions 990,002 22,524,846 #Unique Items 41,270 3,380,085 Avg. Trans. Length 8 1 Max. Trans. Length 2,498 30

Results Up to 3 times reduction in computation time

Results

Results

Summary Work Done Future Work: Designed and developed R-PFP R-PFP is capable of deep parallelization during processing as compared to PFP. Deep parallelization achieved using Parent-Child MapReduce feature of IBM platform Symphony. R-PFP achieved up to 3 times better computation time compared to PFP Future Work: Develop a better method to predict FP-Tree load Test Parent-Child MapReduce on other recursive divide-and-conquer algorithms

Paper Deep Parallelization of Parallel FP-Growth Using Parent-Child MapReduce - Adetokunbo Makanju, Zahra Farzanyar, Aijun An, Nick Cercone, Zane Hu, and Yonggang Hu- IEEE BigData 2016 Washington D.C.,USA

Thank you!