Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang

Slides:



Advertisements
Similar presentations
Advance Database Systems and Applications COMP 6521
Advertisements

Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu SIG KDD 2010 UP-Growth: An Efficient Algorithm for High Utility Itemset Mining 2010/8/25.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić Department of Computer Engineering and Computer Science, School of Electrical.
A distributed method for mining association rules
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
gSpan: Graph-based substructure pattern mining
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Frequent Closed Pattern Search By Row and Feature Enumeration
Visual Data Mining: Concepts, Frameworks and Algorithm Development Student: Fasheng Qiu Instructor: Dr. Yingshu Li.
LOGO Association Rule Lecturer: Dr. Bo Yuan
Frequent Subgraph Pattern Mining on Uncertain Graph Data
MapReduce-based Closed Frequent Itemset Mining with Efficient Redundancy Filtering Su-Qi Wang ∗, Yu-Bin Yang ∗, Guang-Peng Chen ∗, Yang Gao ∗ and Yao Zhang†
Rakesh Agrawal Ramakrishnan Srikant
Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,
ACM SIGKDD Aug – Washington, DC  M. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada Inverted Matrix: Efficient Discovery.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
1 UP-Growth: An Efficient Algorithm for High Utility Itemset Mining Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu SIG KDD 2010.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
Mining High Utility Itemset in Big Data
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
Alva Erwin Department ofComputing Raj P. Gopalan, and N.R. Achuthan Department of Mathematics and Statistics Curtin University of Technology Kent St. Bentley.
Implementation of “A New Two-Phase Sampling Based Algorithm for Discovering Association Rules” Tokunbo Makanju Adan Cosgaya Faculty of Computer Science.
Mining Top-K High Utility Itemsets Date: 2013/04/08 Author: Cheng Wei Wu, Bai-En Shie, Philip S. Yu, Vincent S. Tseng Source: KDD ’12 Advisor: Dr. Jia-Ling.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Toward Efficient and Simplified Distributed Data Intensive Computing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011PPT.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.
By Shivaraman Janakiraman, Magesh Khanna Vadivelu.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Slides for KDD07 Mining statistically important equivalence classes and delta-discriminative emerging patterns Jinyan Li School of Computer Engineering.
Term Project Proposal By J. H. Wang Apr. 7, 2017.
MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.
Big Data is a Big Deal!.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Data Mining Association Analysis: Basic Concepts and Algorithms
New ideas on FP-Growth and batch incremental mining with FP-Tree
Sequential Pattern Mining Using A Bitmap Representation
UP-Growth: An Efficient Algorithm for High Utility Itemset Mining
15-826: Multimedia Databases and Data Mining
Data Mining and Its Applications to Image Processing
Byung Joon Park, Sung Hee Kim
Introduction to Spark.
TT-Join: Efficient Set Containment Join
Chang-Hung Lee, Jian Chih Ou, and Ming Syan Chen, Proc
CARPENTER Find Closed Patterns in Long Biological Datasets
Chao Zhang1, Yu Zheng2, Xiuli Ma3, Jiawei Han1
به نام خداوند جان و خرد الگوکاوي در پايگاه‌هاي تراکنش بسيار بزرگ با استفاده از رويکرد تقسيم وحل Frequent Pattern Mining on Very Large Transaction Databases.
Mining Frequent Itemsets over Uncertain Databases
On Efficient Graph Substructure Selection
Communication and Memory Efficient Parallel Decision Tree Construction
Gyozo Gidofalvi Uppsala Database Laboratory
Association Rule Mining
Transactional data Algorithm Applications
CS110: Discussion about Spark
Mining Sequential Patterns
Mining Path Traversal Patterns with User Interaction for Query Recommendation 龚赛赛
Geometrically Inspired Itemset Mining*
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China 浙江工商大学信电学院 23 June 2019

Content Motivation Problem Statement & Preliminaries High Utility Pattern Mining, Sequential Algorithms, Frameworks Our Mining Approach New Parallel Algorithm Based on Spark Experimental Evaluation Conclusion and Future Work References

Motivation High Utility Pattern Mining vs Frequent Pattern Mining Utility = user’s interest + statistical significance - HUP Support = statistical significance only - FP HUP Mining much harder than FP Mining Anti-monotonicity is satisfied for FP support of a pattern  support of its sub-pattern Anti-monotonicity is not satisfied with HUP utility of a pattern  ? utiltiy of its sub-pattern Parallelization to deal with hardness in mining big data

High Utility Pattern Mining Problem Statement & Preliminaries High Utility Pattern Mining High Utility Pattern Mining What products purchased together have high profits? The utility of a set of products = the profits of the products in transactions containing them and depending on quantity and price/cost FP: What products are frequently purchased together? Shopping Transactions Utility table Tid Items t1 b:1, c:2, d:1, g:1 t2 a:4, b:1 c:3, d:1,e:1 t3 a:4, c:2, d:1 t4 c:2, e:1,f:1 ... I U a 1 b 2 c d 5 ...

Well-known Sequential Mining Algorithms Problem Statement & Preliminaries Well-known Sequential Algorithms Well-known Sequential Mining Algorithms Algorithm References Search Strategy Candidates Pruning Strategy TwoPhase [1] KDD Breadth (Apriori) With TWU CTU-PROL [3] PAKDD IHUP [5] TKDE Depth (FP-Growth) UPGrowth [6] KDD, TKDE D2HUP [7] ICDM, TKDE Depth (OP) Without Tight bound HUI-Miner [8] CIKM Depth (Eclat) Tight bounds

Distributed Computing Frameworks [9,10] Problem Statement & Preliminaries Spark / MapReduce Framework Distributed Computing Frameworks [9,10] Data are distributed over a cluster One split on one node Represented as <key, value> pairs: input, output, and interim results Processing by a series of jobs Job is dispatched to where a data split reside, and executed in parallel Job is defined by a mapper and a reducer, and executed in two phases Resilient Dynamic Dataset (RDD): Memory based Transformations / Actions on RDD Master Slaves Cluster of servers (nodes)

Breadth-First Search, Improved Utility Lists Our Mining Approach Breadth-First Search, Improved Utility Lists Our Mining Approach Breadth-First Search adapting HUI-Miner derived from Eclat , which is Depth-First Improved vertical data structure - UtilityList Ordering items, e, c, b, a, d, in ascending transaction utilities  {e}, UL({e})   {b}, UL({b}) 

Our Mining Approach (cont) Join Utility Lists Our Mining Approach (cont) Mining high utility patterns by joining UtilityLists two k-patterns (k+1)-pattern  {e,b}, UL({e,b})   {e,a}, UL({e,a})  3.2 Enabling Our Opportunistic Vertical Mining  {e,b,a}, UL({e,b,a}) 

Phps: Parallel high utility pattern mining based on Spark New Parallel Algorithm Based on Spark Three phases Phps: Parallel high utility pattern mining based on Spark  i, (u(i,tid), u(t,tid) )  I  i, twu(i) )  II  i, (tid, iutil, rutil)   i, List(tid, iutil, rutil,piutil)   i, (List(,,,), iutilSum, rutilSum)  III  Pk,UL(Pk)   Pk, List(,,,)   Pk-2, (Pk-1, UL(Pk-1))   Pk-1,UL(Pk-1) 

Experimental Evaluation 2 algorithms Phps - our algorithm PhpMR - the competitor 4 datasets Dataset #Items #Trans. Trans Ave Len Chess 76 3,196 37 WebView-1 497 59,602 2.5 T10DI6N1KD1M 1,000 933,493 10 Chainstore 46,086 1,112,949 7.2

Running time with changing minUtil Experimental Evaluation Running time with changing minUtil Running time with changing minUtil

Running time with each iteration Experimental Evaluation Running time with each iteration Running time with each iteration

Conclusion Future Work Conclusion and Future Work Conclusion Phps: a parallel Eclat-like algorithm based on Spark An improved vertical data structure A three-phase parallel mining framework An efficient algorithm Future Work Hybrid Search : BF + DF More Pruning in Phase I (filtering irrelevant items) Algorithms parallelizing D2HUP Algorithms on new parallel programming frameworks

References [1] Y. Liu, W. Liao, and A. Choudhary. A fast high utility itemsets mining algorithm. In Proceedings of the Utility-Based Data MiningWorkshop in conjunction with the 11th ACM SIGKDD [C], 2005, p253-262. [2] Y.-C. Li, J.-S. Yeh, and C.-C. Chang. Isolated items discarding strategy for discovering high utility itemsets [J]. Data & Knowledge Engineering, 2008, 64(1): 198-217. [3] A. Erwin, R. P. Gopalan, and N. R. Achuthan. Efficient mining of high utility itemsets from large datasets [A]. In Proceedings of PAKDD 2008 [C], 2008, p554-561. [4] J. W. Han, J. Pei, Y. W. Yin, et al. Mining Frequent Patterns without Candidate Generation. In Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data, 2000, p1-12. [5] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, et al. Efficient tree structures for high utility pattern mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data Engineering, 2009, p1708-1721. [6] V. S. Tseng, C.-W. Wu, B.-E. Shie, et al. UP-Growth: an efficient algorithm for high utility itemset mining [A]. In Proceedings of the 16th ACM SIGKDD [C], 2010, p253-262. [7] I J. Liu, K. Wang, and B. Fung. Direct Discovery of High Utility temsets without Candidate Generation. In IEEE 12th International Conference on Data Mining, 2012, p101-109. [8] M. Liu, J. Qu. Mining high utility itemsets without candidate generation. In Proceedings of CIKM 2012, 2012, p55-64. [9] Matei Zaharia. An architecture for fast and general data processing on large clusters. Technical Report No. UCB/EECS-2014-12, University of California at Berkeley. [10] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In OSDI, 2004, p137-150.

Thank You ! Questions ? Gracias ! Pregunta?

IEEE DSC 2019 - IEEE International Conference on Data Science in Cyberspace BDMC 2019 - BIG DATA MINING FOR CYBERSPACE  23 June, 2019 8:30 - 9:30 Workshop Chair: Zhaoquan Gu and Jing Qiu http://www.ieee-dsc.org/2019/