Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China 浙江工商大学信电学院 23 June 2019
Content Motivation Problem Statement & Preliminaries High Utility Pattern Mining, Sequential Algorithms, Frameworks Our Mining Approach New Parallel Algorithm Based on Spark Experimental Evaluation Conclusion and Future Work References
Motivation High Utility Pattern Mining vs Frequent Pattern Mining Utility = user’s interest + statistical significance - HUP Support = statistical significance only - FP HUP Mining much harder than FP Mining Anti-monotonicity is satisfied for FP support of a pattern support of its sub-pattern Anti-monotonicity is not satisfied with HUP utility of a pattern ? utiltiy of its sub-pattern Parallelization to deal with hardness in mining big data
High Utility Pattern Mining Problem Statement & Preliminaries High Utility Pattern Mining High Utility Pattern Mining What products purchased together have high profits? The utility of a set of products = the profits of the products in transactions containing them and depending on quantity and price/cost FP: What products are frequently purchased together? Shopping Transactions Utility table Tid Items t1 b:1, c:2, d:1, g:1 t2 a:4, b:1 c:3, d:1,e:1 t3 a:4, c:2, d:1 t4 c:2, e:1,f:1 ... I U a 1 b 2 c d 5 ...
Well-known Sequential Mining Algorithms Problem Statement & Preliminaries Well-known Sequential Algorithms Well-known Sequential Mining Algorithms Algorithm References Search Strategy Candidates Pruning Strategy TwoPhase [1] KDD Breadth (Apriori) With TWU CTU-PROL [3] PAKDD IHUP [5] TKDE Depth (FP-Growth) UPGrowth [6] KDD, TKDE D2HUP [7] ICDM, TKDE Depth (OP) Without Tight bound HUI-Miner [8] CIKM Depth (Eclat) Tight bounds
Distributed Computing Frameworks [9,10] Problem Statement & Preliminaries Spark / MapReduce Framework Distributed Computing Frameworks [9,10] Data are distributed over a cluster One split on one node Represented as <key, value> pairs: input, output, and interim results Processing by a series of jobs Job is dispatched to where a data split reside, and executed in parallel Job is defined by a mapper and a reducer, and executed in two phases Resilient Dynamic Dataset (RDD): Memory based Transformations / Actions on RDD Master Slaves Cluster of servers (nodes)
Breadth-First Search, Improved Utility Lists Our Mining Approach Breadth-First Search, Improved Utility Lists Our Mining Approach Breadth-First Search adapting HUI-Miner derived from Eclat , which is Depth-First Improved vertical data structure - UtilityList Ordering items, e, c, b, a, d, in ascending transaction utilities {e}, UL({e}) {b}, UL({b})
Our Mining Approach (cont) Join Utility Lists Our Mining Approach (cont) Mining high utility patterns by joining UtilityLists two k-patterns (k+1)-pattern {e,b}, UL({e,b}) {e,a}, UL({e,a}) 3.2 Enabling Our Opportunistic Vertical Mining {e,b,a}, UL({e,b,a})
Phps: Parallel high utility pattern mining based on Spark New Parallel Algorithm Based on Spark Three phases Phps: Parallel high utility pattern mining based on Spark i, (u(i,tid), u(t,tid) ) I i, twu(i) ) II i, (tid, iutil, rutil) i, List(tid, iutil, rutil,piutil) i, (List(,,,), iutilSum, rutilSum) III Pk,UL(Pk) Pk, List(,,,) Pk-2, (Pk-1, UL(Pk-1)) Pk-1,UL(Pk-1)
Experimental Evaluation 2 algorithms Phps - our algorithm PhpMR - the competitor 4 datasets Dataset #Items #Trans. Trans Ave Len Chess 76 3,196 37 WebView-1 497 59,602 2.5 T10DI6N1KD1M 1,000 933,493 10 Chainstore 46,086 1,112,949 7.2
Running time with changing minUtil Experimental Evaluation Running time with changing minUtil Running time with changing minUtil
Running time with each iteration Experimental Evaluation Running time with each iteration Running time with each iteration
Conclusion Future Work Conclusion and Future Work Conclusion Phps: a parallel Eclat-like algorithm based on Spark An improved vertical data structure A three-phase parallel mining framework An efficient algorithm Future Work Hybrid Search : BF + DF More Pruning in Phase I (filtering irrelevant items) Algorithms parallelizing D2HUP Algorithms on new parallel programming frameworks
References [1] Y. Liu, W. Liao, and A. Choudhary. A fast high utility itemsets mining algorithm. In Proceedings of the Utility-Based Data MiningWorkshop in conjunction with the 11th ACM SIGKDD [C], 2005, p253-262. [2] Y.-C. Li, J.-S. Yeh, and C.-C. Chang. Isolated items discarding strategy for discovering high utility itemsets [J]. Data & Knowledge Engineering, 2008, 64(1): 198-217. [3] A. Erwin, R. P. Gopalan, and N. R. Achuthan. Efficient mining of high utility itemsets from large datasets [A]. In Proceedings of PAKDD 2008 [C], 2008, p554-561. [4] J. W. Han, J. Pei, Y. W. Yin, et al. Mining Frequent Patterns without Candidate Generation. In Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data, 2000, p1-12. [5] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, et al. Efficient tree structures for high utility pattern mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data Engineering, 2009, p1708-1721. [6] V. S. Tseng, C.-W. Wu, B.-E. Shie, et al. UP-Growth: an efficient algorithm for high utility itemset mining [A]. In Proceedings of the 16th ACM SIGKDD [C], 2010, p253-262. [7] I J. Liu, K. Wang, and B. Fung. Direct Discovery of High Utility temsets without Candidate Generation. In IEEE 12th International Conference on Data Mining, 2012, p101-109. [8] M. Liu, J. Qu. Mining high utility itemsets without candidate generation. In Proceedings of CIKM 2012, 2012, p55-64. [9] Matei Zaharia. An architecture for fast and general data processing on large clusters. Technical Report No. UCB/EECS-2014-12, University of California at Berkeley. [10] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In OSDI, 2004, p137-150.
Thank You ! Questions ? Gracias ! Pregunta?
IEEE DSC 2019 - IEEE International Conference on Data Science in Cyberspace BDMC 2019 - BIG DATA MINING FOR CYBERSPACE 23 June, 2019 8:30 - 9:30 Workshop Chair: Zhaoquan Gu and Jing Qiu http://www.ieee-dsc.org/2019/