Presentation is loading. Please wait.

Presentation is loading. Please wait.

Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang

Similar presentations


Presentation on theme: "Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang"— Presentation transcript:

1 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou , China 浙江工商大学信电学院 23 June 2019

2 Content Motivation Problem Statement & Preliminaries
High Utility Pattern Mining, Sequential Algorithms, Frameworks Our Mining Approach New Parallel Algorithm Based on Spark Experimental Evaluation Conclusion and Future Work References

3 Motivation High Utility Pattern Mining vs Frequent Pattern Mining
Utility = user’s interest + statistical significance HUP Support = statistical significance only FP HUP Mining much harder than FP Mining Anti-monotonicity is satisfied for FP support of a pattern  support of its sub-pattern Anti-monotonicity is not satisfied with HUP utility of a pattern  ? utiltiy of its sub-pattern Parallelization to deal with hardness in mining big data

4 High Utility Pattern Mining
Problem Statement & Preliminaries High Utility Pattern Mining High Utility Pattern Mining What products purchased together have high profits? The utility of a set of products = the profits of the products in transactions containing them and depending on quantity and price/cost FP: What products are frequently purchased together? Shopping Transactions Utility table Tid Items t1 b:1, c:2, d:1, g:1 t2 a:4, b:1 c:3, d:1,e:1 t3 a:4, c:2, d:1 t4 c:2, e:1,f:1 ... I U a 1 b 2 c d 5 ...

5 Well-known Sequential Mining Algorithms
Problem Statement & Preliminaries Well-known Sequential Algorithms Well-known Sequential Mining Algorithms Algorithm References Search Strategy Candidates Pruning Strategy TwoPhase [1] KDD Breadth (Apriori) With TWU CTU-PROL [3] PAKDD IHUP [5] TKDE Depth (FP-Growth) UPGrowth [6] KDD, TKDE D2HUP [7] ICDM, TKDE Depth (OP) Without Tight bound HUI-Miner [8] CIKM Depth (Eclat) Tight bounds

6 Distributed Computing Frameworks [9,10]
Problem Statement & Preliminaries Spark / MapReduce Framework Distributed Computing Frameworks [9,10] Data are distributed over a cluster One split on one node Represented as <key, value> pairs: input, output, and interim results Processing by a series of jobs Job is dispatched to where a data split reside, and executed in parallel Job is defined by a mapper and a reducer, and executed in two phases Resilient Dynamic Dataset (RDD): Memory based Transformations / Actions on RDD Master Slaves Cluster of servers (nodes)

7 Breadth-First Search, Improved Utility Lists
Our Mining Approach Breadth-First Search, Improved Utility Lists Our Mining Approach Breadth-First Search adapting HUI-Miner derived from Eclat , which is Depth-First Improved vertical data structure - UtilityList Ordering items, e, c, b, a, d, in ascending transaction utilities  {e}, UL({e})   {b}, UL({b}) 

8 Our Mining Approach (cont)
Join Utility Lists Our Mining Approach (cont) Mining high utility patterns by joining UtilityLists two k-patterns (k+1)-pattern  {e,b}, UL({e,b})   {e,a}, UL({e,a})  3.2 Enabling Our Opportunistic Vertical Mining  {e,b,a}, UL({e,b,a}) 

9 Phps: Parallel high utility pattern mining based on Spark
New Parallel Algorithm Based on Spark Three phases Phps: Parallel high utility pattern mining based on Spark  i, (u(i,tid), u(t,tid) )  I  i, twu(i) )  II  i, (tid, iutil, rutil)   i, List(tid, iutil, rutil,piutil)   i, (List(,,,), iutilSum, rutilSum)  III  Pk,UL(Pk)   Pk, List(,,,)   Pk-2, (Pk-1, UL(Pk-1))   Pk-1,UL(Pk-1) 

10 Experimental Evaluation
2 algorithms Phps our algorithm PhpMR the competitor 4 datasets Dataset #Items #Trans. Trans Ave Len Chess 76 3,196 37 WebView-1 497 59,602 2.5 T10DI6N1KD1M 1,000 933,493 10 Chainstore 46,086 1,112,949 7.2

11 Running time with changing minUtil
Experimental Evaluation Running time with changing minUtil Running time with changing minUtil

12 Running time with each iteration
Experimental Evaluation Running time with each iteration Running time with each iteration

13 Conclusion Future Work
Conclusion and Future Work Conclusion Phps: a parallel Eclat-like algorithm based on Spark An improved vertical data structure A three-phase parallel mining framework An efficient algorithm Future Work Hybrid Search : BF + DF More Pruning in Phase I (filtering irrelevant items) Algorithms parallelizing D2HUP Algorithms on new parallel programming frameworks

14 References [1] Y. Liu, W. Liao, and A. Choudhary. A fast high utility itemsets mining algorithm. In Proceedings of the Utility-Based Data MiningWorkshop in conjunction with the 11th ACM SIGKDD [C], 2005, p [2] Y.-C. Li, J.-S. Yeh, and C.-C. Chang. Isolated items discarding strategy for discovering high utility itemsets [J]. Data & Knowledge Engineering, 2008, 64(1): [3] A. Erwin, R. P. Gopalan, and N. R. Achuthan. Efficient mining of high utility itemsets from large datasets [A]. In Proceedings of PAKDD 2008 [C], 2008, p [4] J. W. Han, J. Pei, Y. W. Yin, et al. Mining Frequent Patterns without Candidate Generation. In Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data, 2000, p1-12. [5] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, et al. Efficient tree structures for high utility pattern mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data Engineering, 2009, p [6] V. S. Tseng, C.-W. Wu, B.-E. Shie, et al. UP-Growth: an efficient algorithm for high utility itemset mining [A]. In Proceedings of the 16th ACM SIGKDD [C], 2010, p [7] I J. Liu, K. Wang, and B. Fung. Direct Discovery of High Utility temsets without Candidate Generation. In IEEE 12th International Conference on Data Mining, 2012, p [8] M. Liu, J. Qu. Mining high utility itemsets without candidate generation. In Proceedings of CIKM 2012, 2012, p [9] Matei Zaharia. An architecture for fast and general data processing on large clusters. Technical Report No. UCB/EECS , University of California at Berkeley. [10] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In OSDI, 2004, p

15 Thank You ! Questions ? Gracias ! Pregunta?

16 IEEE DSC IEEE International Conference on Data Science in Cyberspace BDMC BIG DATA MINING FOR CYBERSPACE  23 June, 2019 8:30 - 9:30 Workshop Chair: Zhaoquan Gu and Jing Qiu


Download ppt "Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang"

Similar presentations


Ads by Google