Mining High Utility Itemset in Big Data

Slides:

Advertisements

Similar presentations

Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu SIG KDD 2010 UP-Growth: An Efficient Algorithm for High Utility Itemset Mining 2010/8/25.

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

A distributed method for mining association rules

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Frequent Closed Pattern Search By Row and Feature Enumeration

MapReduce-based Closed Frequent Itemset Mining with Efficient Redundancy Filtering Su-Qi Wang ∗, Yu-Bin Yang ∗, Guang-Peng Chen ∗, Yang Gao ∗ and Yao Zhang†

Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

1 UP-Growth: An Efficient Algorithm for High Utility Itemset Mining Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu SIG KDD 2010.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Sequential PAttern Mining using A Bitmap Representation

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

S.Sathya M.Victor Jose Department of Computer Science and Engineer Noorul Islam Centre for Higher Education Kumaracoil,Tamilnadu,IndiaPROCEEDINGS OF ICETECT.

Alva Erwin Department ofComputing Raj P. Gopalan, and N.R. Achuthan Department of Mathematics and Statistics Curtin University of Technology Kent St. Bentley.

Hidemoto Nakada, Hirotaka Ogawa and Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki ,

Mining Top-K High Utility Itemsets Date: 2013/04/08 Author: Cheng Wei Wu, Bai-En Shie, Philip S. Yu, Vincent S. Tseng Source: KDD ’12 Advisor: Dr. Jia-Ling.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.

Data Mining Find information from data data ? information.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Chung-hung.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Intelligent DataBase System Lab, NCKU, Taiwan Josh Jia-Ching Ying 1, Wang-Chien Lee 2, Tz-Chiao Weng 1 and Vincent S. Tseng 1 1 Department of Computer.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.

A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International.

By Shivaraman Janakiraman, Magesh Khanna Vadivelu.

University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Fast Mining Frequent Patterns with Secondary Memory Kawuu W. Lin, Sheng-Hao Chung, Sheng-Shiung Huang and Chun-Cheng Lin Department of Computer Science.

MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.

Mining Dependent Patterns

UP-Growth: An Efficient Algorithm for High Utility Itemset Mining

Frequent Pattern Mining

Hadoop Clusters Tess Fulkerson.

Data Mining Association Analysis: Basic Concepts and Algorithms

Mining Frequent Itemsets over Uncertain Databases

Association Rule Mining

A Parameterised Algorithm for Mining Association Rules

Mining Association Rules from Stars

CS110: Discussion about Spark

Parallel Processing Priority Trie-based IP Lookup Approach

Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang

Presentation transcript:

Mining High Utility Itemset in Big Data Ying Chun Lin, Cheng-Wei Wu, Vincent S. Tseng Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan Good morning Ladies and gentlemen Today I am going to talk about “Mining HUI in Big data” My name is Ying Chun Lin. I am from department of CS and information engineering of national cheng kung university. The co-author of this work is chengwei wu and my advisor Vincent Tseng Intelligent Database System Lab

Outline Introduction Definition PHUI-Growth Experiment Conclusion This is the outline of the presentation. In the part of introduction I will talk about what the high utility itemset mining is, why should we study high utility itemset mining in big data Second, I will briefly introduce the definition The third part is PHUI-Growth, which is the main framework of our work And, I will show the performance of PHUI-Growth in the Experiment part Last, close the presentation with brief conclusion.

Introduction Let’s start with introduction

Introduction What is High utility itemset (HUI) mining? It is one of the most important tasks of frequent pattern mining, which can be used to discover sets of items carrying high utilities (e.g., high profits) However, in the era of Big Data, there is no parallel solution for high utility itemset mining currently. 念第一點 Like you go to a shopping mall, and buy some of products which are called items also each product has profits we want to find out which set of items are the most profitable ones 念第二點

Major Problems for High Utility Mining in Big Data Large amount of transactions and varied items in big data High computational complexity Large search space Combination explosion Scalability issue Data cannot be held or processed in a single machine An parallel algorithm is needed What’s the major problems for HUI when we face with big data (1) the size of transactions is large and items vary across dataset (1-1) first high computational complexity It would face the large search space and the combination explosion problem. This leads the mining task to suffer from very expensive computational costs in practical. (1-2) second data cannot be processed in a single machine it may take months or years to solve a problem of big data in a single machine (2)As a result, a well-designed algorithm incorporated with parallel programming architecture is needed.

A New Framework for High Utility Mining Parallel mining High Utility Itemsets by pattern-Growth (PHUI-Growth) Implemented on Hadoop platform Store large dataset separately on HDFS Design a new pruning strategy, Discarding local unpromising items in MapReduce framework (DLU-MR) We proposed a new framework parallel mining HUIs by pattern Growth, PHUI-Growth (1) We implemented on Hadoop platform, thus it inherits several nice properties from Hadoop, such as easy deployment in high level language, fault tolerance, low communication overheads and high scalability on commodity hardware. (2) The dataset is distributed among HDFS (3) A new pruning strategy, discarding local unpromising items in MapReduce framework, DLU-MR, is proposed. In traditional itemset mining, pattern explosion is a common problem. And we propose a strategy for pruning unpromising pattern efficiently and in parallel

Definition Then, it the definition part

Transactional Database Problem Definition Unit Profit A 2 B 3 C 1 D E 4 F G 8 Transactional Database Total 𝑇 1 A(4), B(2), C(8), D(2) 28 𝑇 2 A(4), B(2), C(8) 22 𝑇 3 C(4), D(2), E(2), F(2) 26 𝑇 4 E(2), F(2), G(1) 24 Internal Utility Utility of the transaction Transaction Let’s introduce some definition by this example First, we have items A, B, C, E, F, and G. (press) These items are in the set of items And the right column shows their unit profit, like the unit profit of A is 2 dollars. (press) This is external utility. Then we move on to our transactional database. Take t4 as example. (press) This is an transaction. In t4, we know that it contains item E, F, and G and have the quantities 2, 2, and 1 respectively. (press) We can called those quantities internal utility How about the utility of t4. We simply multiply the quantities with the unit profits and sum them up. We can get 24. (press) This is the utility of the transaction. Last come to our main goal. (press) Which itemsets, or combination of items, are the high utility itemsets? External Utility Set of items Which itemsets are the high utility itemsets?

High Utility Itemset If the utility of an itemset 𝑋 is no less than a user-specified minimum utility threshold 𝜃, we call the itemset high utility itemset. 𝑢 𝑋 ≥𝜃→𝑋 𝑖𝑠 ℎ𝑖𝑔ℎ 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 𝑖𝑡𝑒𝑚𝑠𝑒𝑡 What is the high utility itemset? 念投影片上的字 (press) The following is the formal definition of the high utility itemset

High Utility Itemset Mining Unit Profit A 2 B 3 C 1 D E 4 F G 8 Transactional Database Total 𝑇 1 A(4), B(2), C(8), D(2) 28 𝑇 2 A(4), B(2), C(8) 22 𝑇 3 C(4), D(2), E(2), F(2) 26 𝑇 4 E(2), F(2), G(1) 24 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝜃=30 After the definition. Let’s do utility mining manually First, (press) we will set the minimum utility, like 30 in this example Then, I want to know (press) 念跳出來的字 So, we scan the transactional database to find which transactions contains both A and C. We can find it in t1 (point A and C in t1) and t2 t1(point A and C in t2) And we can calculate the transaction utility in t1(press) and t2 (press) Because the utility of A and C in the transactional database is higher than the minimum utility, 30, it can called high utility itemset in this transactional database. Is {A, C} a high utility itemset? 𝑢 𝐴,𝐶 𝑖𝑛 𝑡ℎ𝑒 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑎𝑙 𝑑𝑎𝑡𝑎𝑏𝑎𝑠𝑒 = 4×2+8×1 + 4×2+8×1 =32 >𝜃→𝐻𝑈𝐼 𝑇 1 𝑇 2

Basic Pruning Strategy – TWU Downward Closure Property Transaction-Weighted Utilization(TWU) of an itemset X E.g. 𝑇𝑊𝑈 𝐸,𝐹 =26+24=50 High TWU itemset E.g. 𝑇𝑊𝑈 𝐸,𝐹 ≥(𝜃=30) TWU downward closure property The TWU downward closure property states that any superset of a low TWU itemset is not high utility itemset [Y. Liu. et. al., UBDM, 2005] Transactional Database Total 𝑇 1 A(4), B(2), C(8), D(2) 28 𝑇 2 A(4), B(2), C(8) 22 𝑇 3 C(4), D(2), E(2), F(2) 26 𝑇 4 E(2), F(2), G(1) 24 Like frequent itemset mining, we can enumerate all the possible combination of items and calculate the utility of each itemset. Then, determine whether the itemset is HUI or not. But it may face the problem pattern explosion, especially with big data, so we need an efficient pruning strategy. Let’s define something before we go into our pruning strategy. First, transaction weighted utilization of an itemset X. When the itemset is contained in a transaction, the TWU of the itemset in the transaction is the utility of the transaction. Take E and F for example. They shows up in t3 (point E and F) and t4 (point E and F). So, the TWU of E and F in the transactional database is 26 (point t3) plus 24 (point t4) and equals 50. Second, what is high TWU itemset? the same, we take itemset E and F as example. The TWU of E and F is greater than the minimum utility threshold, 30, as usual. Consequently, they are high TWU itemset. Last, we can have the feeling what the TWU downward closure property is. As we know the TWU of an itemset is the maximum possible utility a itemset could get in the transactional database. As a result, low TWU itemset can be pruned, because they are impossible to become HUIs in the transactional database

Proposed method PHUI-Growth After the definition. Let’s talk about our method

Mining High Utility Itemset in Big Data PHUI-Growth Counting Phase Apply DCP to prune low TWU 1-itemsets Reorganize each transaction Mining Phase HUIs k-HUIs & conditional u-transactions Mapper 1 Reducer 1 Mapper 1 Reducer 1 Mapper 2 Reducer 2 Mapper 2 Reducer 2 Distributed Database Mapper 3 Reducer 3 Mapper 3 Reducer 3 This is the overview of our framework., (press) PHUI-Growth. First, during the counting phase, we can know the TWU of each item. Then, we prune the low TWU items, (press) according the TWU downward closure property and the database is transformed for the mining phase. In the mining phase, the high utility itemsets are discovered in each iteration and local unpromising items are prunes in reducer by (press) DLU-MR, which is introduced later. Mapper n Reducer m Mapper n Reducer m Iterative MapReduce Basic Pruning Strategy DLU-MR

Counting Phase Calculate the TWU of all items Key-value pair Key is the item in a transaction Value is the TWU of the key item Reducer – 1 Key Value Output A <A, 28>,<A, 22> <A, 50> B <B, 28>,<B, 22> <B, 50> Mapper – 1 ABCD <A, 28>,<B, 28> <C, 28>,<D, 28> Distributed Database T1 A(8) B(6) C(8) D(6) Mapper – 2 ABC <A, 22>,<B, 22> <C, 22> Reducer – 2 Key Value Output C <C, 28>, <C, 22>, <C, 26> <C, 76> D <D, 28>, <D, 26> <D, 54> Let’s look at the counting phase. The distributed transactional database is same as previous example, and the utilities of each item in each transaction have been calculated. The minimum utility threshold is 30. In the mapper phase, we take item as the key, and the TWU of the item in the transaction as value. Like t1, separate the transaction into key-value pair as <A, 28>, <B, 28>, <C, 28> and <D, 28> In the reduce phase, all the key-value pairs with the same key go to the same reducer and the TWU of items in the transactional database is counted. Like B is in t1 and t2 (point t1 and t2). And they goes to the mapper phase become <B, 28> and <B,22> (point to the pairs). Then both two pairs go to the same reducer (point to the reducer B). Simply add the TWU of B and the TWU of B is 50 in the transactional database. T2 A(8) B(6) C(8) T3 C(4) D(6) E(8) F(8) Mapper – 3 CDEF <C, 26>,<D, 26> <E, 26>,<F, 26> T4 E(8) F(8) G(8) Reducer – 3 Key Value Output E <E, 26>, <E, 24> <E, 50> F <F, 26>, <F, 24> <F, 50> G <G, 24> Mapper – 4 EFG <E, 24>,<F, 24>, <G, 24>

Database Transformation TWU A 50 B C 76 D 54 E F G 24 Original Database A(8), B(6), C(8), D(6) A(8), B(6), C(8) C(4), D(6), E(8),F(8) E(8), F(8), G(1) Transform Database A(8), B(6), D(6), C(8) A(8) B(6) C(8) E(8) F(8) D(6) C(4) E(8) F(8) Then the database is transformed for the mining phase. According to the result of counting phase, the TWUs of each items are obtained (point to the table). (press) Prune the low TWU items based on TWU downward closure property. In this example, G is a low TWU item. Last, (press) the transactional database is sorted in the TWU increasing order Prune the low TWU items Sort the items in TWU-increasing order 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝜃=30

Mining Phase All combination of items in a transactional database Pattern explosion problem Pruning strategy Discarding local unpromising items in MapReduce framework (DLU-MR) The utility of an itemsets in the transactional database With the help of conditional u-transactions, the utility of an itemset is calculated easily In the mining phase, if we solve the following two tasks in an efficiently way, the HUIs in the transactional database can be found easily.

Combination of Items Mapper of mining phase A transactions become conditional u-transactions Key contains an itemset Value contains the utility of the key itemset and the possible items for being combined into new itemset Combination of items is done in parallel Conditional u-transaction Mapper A(8), B(6), C(8) <{A}, 8, {B(6), C(8)}> <{B}, 6, {C(8)}> <{C}, 8, {ϕ}> Let’s look at the first task in the mining phase (press) the transaction is called u-transaction, which containing the utility of each item in the transaction. First we need to define the key-value pairs in the map phase. We separate the u-transaction into several key value pairs. Let’s look at an example directly. Like the u-transaction, it is sent to a mapper (point the mapper). Then, A, B and C is separated into three key value pairs. The key of the first pair is A and the value part is the utility of A in the transaction as well as the following items of the transaction. Also the key of second pair is B, and the value is the utility of B and the rest part of u-transaction. And the key of last pair is C, and there is no following item, as the result the value part only contains the utility of C and a empty set. And the combination of length-2 itemset is as following. We put B and C into key part and the utility of the key parts are updated and rest is the following part of the transaction. Then, we can solve the task of combination of items. And this task can be done in parallel However, this may face with pattern explosion problem. So efficient pruning strategy will be proposed latter. A(8) B(6) C(8) Mapper <{A}, 8, {B(6), C(8)}> <{A, B}, 14, {C(8)}> <{A, C}, 16, {ϕ}> <{A}, 8, {B(6), C(8)}>

Utility of An Itemset Reducer in the mining phase The utility in each transaction of an itemset goes to the same reducer The utility of key itemset is calculated by adding up the utility of key itemset Each itemset goes to different reducer, as a result the calculation of utility is done in parallel Before pruning strategy, let’s look at the second task first. Calculate the utility of an itemset with the help of conditional u-transaction(press). All the utility of an itemset in each transaction will go to the same reducer. Like the example, because all utilities in each transaction of A is in the reducer. We can easily calculate the utility of A simply adding these utilities up. So the utility of A in the transactional database is 16. Due to the structure, the calculation of the utility is also done in parallel Reducer Key Value A <{A}, 8, {B(6), D(6), C(8)}> <{A}, 8, {B(6), C(8)}>

Pruning Strategy Discarding local unpromising items in MapReduce framework (DLU-MR) The strategy states that any local superset of a low local TWU itemset is low utility itemset The pruning task is done in parallel Local TWU B 50 C D 28 In the mining phase of PHUI-Growth, the pruning strategy, DLU-MR is proposed for discarding local unpromising items in the framework. Take the conditional u-transactions of key A as example. First, the TWU of each items in the reducer is calculated . And, (press)those items with low TWU are pruned from the conditional u-transactions of the key. The TWU of each items will gradually decrease when there is an item being pruned in the conditional u-transaction. So the pattern explosion can be solved in parallel and efficiently Reducer Key Value A <{A}, 8, {B(6), D(6), C(8)}> <{A}, 8, {B(6), C(8)}> Prune local unpromising item

Experiment Then, we can evaluate the performance of our method

Experiment Settings Environment settings Dataset 5-node Hadoop Cluster CPU 2.6 GHz and 4 GB memory Dataset Dataset # Trans. # Items Average Trans. Length Maximum Trans. Length Retail 88,162 16,470 10 76 Chainstore 1,112,949 46,086 7 170 T10I4N10K|D| 2,00K 2,000,000 10,000 33 Chainstore x 5 5,564,745 All experiments were conducted on a five-node Hadoop Cluster. Each node is equipped with 2.60GHz CPU and 4 GB memory. Retail dataset was obtained from FIMI Repository. Chainstore is a real-life dataset. A synthetic dataset T10I4N10K|D|2M was generated from the IBM data generator. Then, the chain store of 5 times is for checking the scalability of our method.

Performance on Small Dataset Comparing algorithm HUI-Miner [M. Liu et. al., CIKM, 2012] PHUI-Growth (Baseline) PHUI-Growth(DLU-MR) In this section, we compare the performance of PHUI-Growth with HUI-Miner [7], a state-of-the-art non-parallel type of HUI mining algorithms. To evaluate the effectiveness of the DLU-MR strategy, we prepared two versions of PHUI-Growth, respectively called PHUI-Growth(Baseline) and PHUI-Growth(DLU-MR). First let’s look at figure of execution time. When the minimum utility is higher than 0.02, we do spend more time than HUI-Miner. That’s because when the size of data is small, the communication overhead will dominate the execution time. However when the minimum utility is 0.01, HUI-miner takes about 4429 seconds. On the other hand, our method takes 556 seconds. How about number of candidates? When the threshold decreases, the number of HUIs dramatically increases and HUI-Miner need to produce a large amount of intermediate itemsets. However, the number of candidates produced by PHUI-Growth(DLU-MR) is up to two orders of magnitude smaller than that produced by HUI-Miner. Execution Time Number of Candidate

Performance on Large Datasets PHUI-Growth(Baseline) and Growth(DLU-MR) outperform HUI-Miner significantly Mining HUI in parallel greatly improve the performance About the execution time of large dataset Results show that PHUI-Growth(Baseline) and Growth(DLU-MR) outperform HUI-Miner significantly. The reason why they perform so well is that they effectively use nodes of a cluster to parallel process HUIs across multiple machines, while HUI-Miner is executed on non-parallel single machine. Chainstore T10I4N10K|D|2,000K Chainstore x 5

Conclusion

Conclusion A new parallel framework, PHUI-Growth, for mining high utility itemsets in big data Parallel discover HUIs from distributed data across multiple computers. DLU-MR is proposed to prune the search space in parallel and greatly improve the performance for mining HUIs Empirical evaluations show that PHUI-Growth has good scalability on large datasets we propose a new framework, PHUI-Growth, for mining high utility itemsets in big data. The proposed algorithm is for efficiently parallel mining high utility itemsets from distributed data across multiple commodity computers. A novel strategy called DLU-MR is proposed to effectively prune the search space and greatly improve the performance for mining HUIs. Empirical evaluations of different types of real and synthetic datasets show that PHUI-Growth has good scalability on large datasets and outperforms the state-of-the-art algorithms.

Q & A