USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

Slides:

Advertisements

Similar presentations

Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu SIG KDD 2010 UP-Growth: An Efficient Algorithm for High Utility Itemset Mining 2010/8/25.

Advertisements

Sequential PAttern Mining using A Bitmap Representation

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.

gSpan: Graph-based substructure pattern mining

Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo.

Frequent Closed Pattern Search By Row and Feature Enumeration

LOGO Association Rule Lecturer: Dr. Bo Yuan

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.

1 A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS* by Gökhan Yavaş Feb 22, 2005 *: To appear in Data and Knowledge Engineering, Elsevier.

Fast Algorithms for Association Rule Mining

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

Mining Association Rules

Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}

Mining Association Rules

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

1 UP-Growth: An Efficient Algorithm for High Utility Itemset Mining Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu SIG KDD 2010.

Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Abrar Fawaz AlAbed-AlHaq Kent State University October 28, 2011

Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:

Sequential PAttern Mining using A Bitmap Representation

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.

Mining High Utility Itemset in Big Data

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang

Sequential Pattern Mining

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

Generalized Sequential Pattern Mining with Item Intervals Yu Hirate Hayato Yamana PAKDD2006.

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.

Data Mining Association Rules: Advanced Concepts and Algorithms

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者：林靜怡.

Association Rule Mining

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.

Intelligent Database Systems Lab Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Salvatore Orlando Raffaele Perego Claudio Silvestri 國立雲林科技大學 National.

1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003.

Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Searching for Pattern Rules Guichong Li and Howard J. Hamilton Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor ： Jia-Ling Koh Speaker ： Tsui-Feng Yen.

Gspan: Graph-based Substructure Pattern Mining

Rapid Association Rule Mining Amitabha Das, Wee-Keong Ng, Yew-Kwong Woon, Proc. of the 10th ACM International Conference on Information and Knowledge Management(CIKM’01),2001.

Sequential Pattern Mining Using A Bitmap Representation

CARPENTER Find Closed Patterns in Long Biological Datasets

Market Basket Analysis and Association Rules

Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.

COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong

Mining Sequential Patterns

Finding Frequent Itemsets by Transaction Mapping

Presentation transcript:

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, p Presenter: 江怡蕙薛筑軒

Outline Introduction Related work Problem Statement USpan algorithm Experiment Conclusions & Discussions 2

Outline Introduction Background Definition Challenges Related work Problem Statement USpan algorithm Experiment Conclusions & Discussions 3

Introduction Sequential pattern mining has proven to be very essential for handling order-based critical business problems. EX: structures and functions of molecular or DNA sequences 4

Background The selection of interesting sequences is generally based on the frequency/support framework: sequences of high frequency are treated as significant. Under this framework, the downward closure property (also known as Apriori property) plays a fundamental role. 5

Utility Internal utility = quantity ; External utility = quality High utility pattern mining Minimum utility The utility of in sequence 2 is {(6 × × 2), (6 × × 2)} = {8, 10} 6 Definition

The concept of sequence utility by considering the quality and quantity associated with each item in a sequence, and define the problem of mining high utility sequential patterns; A complete lexicographic quantitative sequence tree (LQS-tree) to construct utility-based sequences; two concatenation mechanisms I-Concatenation and S-Concatenation generate newly concatenated sequences; 7 Definition

Two pruning methods, width and depth pruning, substantially reduce the search space in the LQS- tree; USpan traverses LQS-tree and outputs all the high utility sequential patterns. 8

Outline Introduction Related work Utility Itemset/Pattern Mining Utility-based Sequential Pattern Mining Problem Statement USpan algorithm Experiment Conclusions & Discussions 9

Mining high utility itemsets is much more challenging than discovering frequent itemsets, because the fundamental downward closure property in frequent itemset mining does not hold in utility itemsets. The addition of ordering information in sequences makes it fundamentally different and much more challenging than mining utility itemsets 10 Utility Itemset/Pattern Mining

Utility-based Sequential Pattern Mining Mining frequent sequences  many patterns being mined; Patterns with frequencies lower than minimum support are filtered 11

Outline Introduction Related work Problem Statement USpan algorithm Experiment Conclusions & Discussions 12

Sequence Utility Framework I = {i1, i2,..., in}  a set of distinct items Each item ik ∈ I(1<= k<=n) is associated with a quality (or external utility), denoted as p(ik) A quantitative item, or q-item, is an ordered pair (i, q), where i ∈ I represents an item and q is a positive number representing the quantity or internal utility 13

A quantitative itemset, or q-itemset, consists of more than one q-item, which is denoted and defined as l = [(ij1, q1)(ij2, q2)...(ijn, qn )] A quantitative sequence, or q-sequence, is an ordered list of qitemsets, which is denoted and defined as s = A q-sequence database S consists of sets of tuples 14 Sequence Utility Framework

Sequence Utility Framework- Definitions 15 EX: (a, 4), [(a, 4)(e, 2)] and [(a, 4)(b, 1)(e, 2)] ⊆ [(a, 4)(b, 1)(e, 2)] But [(a, 2)(e, 2)] or [(a, 4)(c, 1)] not contained in [(a, 4)(b, 1)(e, 2)],,

Sequence Utility Framework- Definitions 16 is a 4-q-sequence with size 3. is a 2-sequence with size 2.

17 Sequence Utility Framework- Definitions

18 Sequence Utility Framework- Definitions

19 Sequence Utility Framework- Definitions t = t’s utility in the s4 sequence in Table 2 is v(t, s4) = {u( ), u(<(e, 2) (a, 4)>)} = {16, 10}. t’s utility in S is v(t) = {u(t, s2), u(t, s4),u(t, s5)} = {{8, 10}, {16, 10}, {15, 7}}

20 High Utility Sequential Pattern Mining

Definition 10. (High Utility Sequential Pattern) Because a sequence may have multiple utility values in the q-sequence context, we choose the maximum utility as the sequence’s utility. The maximum utility of a sequence t is denoted and defined as umax(t): Sequence t is a high utility sequential pattern if and only if ξ  user-specified minimum utility 21 The utility of sequence ea is umax( ) = = 41. If the minimum utility is ξ = 40, then sequence s = is a high utility sequential pattern since umax(s) = 41 ≥ ξ

Outline Introduction Related work Problem Statement USpan algorithm Lexicographic Q-Sequence Tree Concatenations Width Pruning Depth Pruning USpan Algorithm Experiment Conclusions & Discussions 22

USpan Algorithm USpan is composed of a lexicographic q-sequence tree two concatenation mechanisms two pruning strategies 23

Lexicographic Q-Sequence Tree Adapt the concept of the Lexicographic Sequence Tree Suppose we have a k-sequence t, we call the operation of appending a new item to the end of t to form (k+1)- sequence concatenation. If the size of t does not change, we call the operation I-Concatenation. Otherwise, if the size increases by one, we call it S-Concatenation ’s I-Concatenate and S-Concatenate with b result in and, respectively. 24

Lexicographic Q-Sequence Tree Assume two k-sequences ta and tb are concatenated from sequence t, then ta < tb if i) ta is I-Concatenated from t, and tb is S-Concatenated from t, ii) both ta and tb are I-Concatenated or S-Concatenated from t, but the concatenated item in ta is alphabetically smaller than that of tb.,, and 25

Lexicographic Q-Sequence Tree Definition 11. (Lexicographic Q-sequence Tree) An lexicographic q-sequence tree (LQS-Tree) T is a tree structure satisfying the following rules: Each node in T is a sequence along with the utility of the sequence, while the root is empty Any node’s child is either an I-Concatenated or S- concatenated sequence node of the node itself All the children of any node in T are listed in an incremental and alphabetical order 26

v(ea) = {{8, 10}, {16, 10}, {15, 7}} and umax(ea) = 41. “Can any ’s child’s maximum utility be calculated by simply adding the highest utility of the q-items after to umax(ea)?” 27 Lexicographic Q-Sequence Tree no

Depth-first search How can we generate the node’s children’s utilities by concatenating the corresponding items? How can we avoid checking unpromising children? When should USpan stop the search of deeper nodes? 28 Lexicographic Q-Sequence Tree Concatenations Width pruning Depth pruning

Concatenations Utility matrix (utility, remaining utility) 29

Concatenations Utility matrix (utility, remaining utility) 30

Concatenations Utility matrix (utility, remaining utility) 31

Concatenations: I-Concatenation 32

Concatenations: S-Concatenation 33

Concatenations 34

Concatenations 35

Width Pruning 36

Width Pruning 37

Depth Pruning 38

Depth Pruning 39

USpan Algorithm // includes depth pruning strategy // width pruning strategy // generate candidates // deal with I-Concatenation // deal with S-Concatenation 40

Outline Introduction Related work Problem Statement USpan algorithm Experiment Settings Results Conclusions & Discussions 41

Experimental Settings Data Sets DS1: C10 T2.5 S4 I2.5 DB10k N1k DS2: C8 T2.5 S6 I2.5 DB10k N10k The average number of elements in a sequence is 10 (8). The average number of items in an element is 2.5 (2.5). The average length of a maximal pattern consists of 4 (6) elements and each element is composed of 2.5 (2.5) items average. The data set contains 10k (10k) sequences. The number of items is 1k (10k). 42

Experimental Settings Data Sets DS3: online shopping transactions There are 811 distinct products, 350,241 transactions and 59,477 customers. The average number of elements in a sequence is 5. The max length of a customer’s sequence is 82. The most popular product has been ordered 2176 times. DS4: mobile communication transactions The dataset is a 100,000 mobile-call history. There are 67,420 customers in the dataset. The maximum length of a sequence is

Experimental Results – Execution Time & (#Patterns) 44

Experimental Results – Execution Time & (#Patterns) 45

Experimental Results – Distribution in Terms of Length 46

Experimental Results – Distribution in Terms of Length 47

Experimental Results – Pruning 48

Experimental Results – Scalability 49

Experimental Results – Utility vs Frequent 50

Outline Introduction Related work Problem Statement USpan algorithm Experiment Conclusions & Discussions 51

Conclusions Provide a systematic statement of a generic framework for high utility sequential pattern mining. Propose an efficient algorithm, Uspan I-Concatenation, S-Concatenation Width pruning, depth pruning USpan can efficiently identify high utility sequences in large-scale data with low minimum utility. 52

Discussions Strongest part of this paper USpan grows tree by DFS and needs not to store the whole LQS-Tree in memory. Two pruning strategies are proposed and work well in their experiments. Only need to calculate the tables once at beginning. Weak points of this paper Each sequence needs a table to store it values and all the tables are stored in memory. Each single tree node contains much information. 53

Discussions Possible improvement Design algorithms for even bigger datasets and better pruning strategies. Shrink the number of tables or shrink the number of elements in a table. Possible extension The metric of “utility” Items with positive and negative unit profits Time constraints (as in GSP) Possible Application Business decision-making Analysis of game records of experts But need to specify “item” and “utility” first 54

END & Thanks for your attention