Sequential PAttern Mining using A Bitmap Representation

Slides:

Advertisements

Similar presentations

Association Rule Mining

Advertisements

Sequential PAttern Mining using A Bitmap Representation

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

A distributed method for mining association rules

Data Mining Techniques Association Rule

Association rules and frequent itemsets mining

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Frequent Closed Pattern Search By Row and Feature Enumeration

Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.

FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.

Data Mining Association Analysis: Basic Concepts and Algorithms

Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

Association Analysis: Basic Concepts and Algorithms.

Data Mining Association Analysis: Basic Concepts and Algorithms

Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,

Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.

Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.

Fast Algorithms for Association Rule Mining

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Generalized Sequential Pattern Mining with Item Intervals Yu Hirate Hayato Yamana PAKDD2006.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.

1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

1 Mining the Smallest Association Rule Set for Predictions Jiuyong Li, Hong Shen, and Rodney Topor Proceedings of the 2001 IEEE International Conference.

Gspan: Graph-based Substructure Pattern Mining

Fast Mining Frequent Patterns with Secondary Memory Kawuu W. Lin, Sheng-Hao Chung, Sheng-Shiung Huang and Chun-Cheng Lin Department of Computer Science.

Mining Sequential Patterns With Item Constraints

Reducing Number of Candidates

Data Mining Association Analysis: Basic Concepts and Algorithms

Sequential Pattern Mining Using A Bitmap Representation

Frequent Pattern Mining

CARPENTER Find Closed Patterns in Long Biological Datasets

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

DIRECT HASHING AND PRUNING (DHP) ALGORITHM

A Parameterised Algorithm for Mining Association Rules

Mining Association Rules from Stars

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Warehousing Mining & BI

Finding Frequent Itemsets by Transaction Mapping

Association Analysis: Basic Concepts

Presentation transcript:

Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Johannes Gehrke, Tomi Yiu and Jason Flannick Dept. of Computer Science Cornell University Presenter 0259636 林哲存 0259639 林庭宇 0159638 徐敏容

Outline Introduction The SPAM Algorithm Data Representation Lexicographic Tree for Sequences Depth First Tree Traversal Pruning Data Representation Data Structure Candidate Generation Experimental Evaluation Synthetic data generation Comparison With SPADE and PrefixSpan Consideration of space requirements Conclusion

Introduction I = {i1,i2,...,in} be a set of items a subset X ⊆ I an itemset |X| is the size of X A sequence s = (s1, s2, . . . , sm) is an ordered list of itemsets, where si ⊆ I, i ∈ {1,...,m} The length l of a sequence s = (s1, s2, . . . , sm) is defined as

Introduction Sa：A Sequence. Example：({a}, {b,c}) sup D(Sa)：The support of Sa in database D. Example：Sa = ({a},{b, c}), we know sup D(Sa) = 2 Given a support threshold minSup, a Sequence Sa is called a frequent sequential pattern on D if sup D(Sa) >=minSup.

Introduction Consider the sequence of customer 2 the size of this sequence is 2. the length of this sequence is 4. the support of Sa is 2, or 0.67. If the minSup is < 0.67, then Sa is deemed frequent. {b, c, d} 7 3 {a, b} 5 {a, b, c} 4 2 {b} 6 1 {a, b, d} Itemset TID CID Sa：({a}, {b,c}) ({a, b}, {b, c, d}) 3 ({b}, {a, b, c}) 2 ({a, b, d}, {b, c, d}, {b, c, d}) 1 Sequence CID Table 1 Table 2

Introduction Contributions of SPAM Novel depth-first search strategy that integrates a depth-first traversal of the search space with effective pruning mechanisms. Vertical bitmap representation of the database with efficient support counting. SPAM outperforms previous work by up to an order of magnitude.

The SPAM Algorithm The Lexicographic Sequence Tree Consider all sequences arranged in a sequence tree with the following structure： The root of the tree is labeled with null I={a, b} Assume that there is a lexicographical ordering i<= of the items I in the database. If item i occurs before item j in the ordering, then we denote this by i <= j. This ordering can be extended to sequences by defining sa I <= sb if sa is a subsequence of sb. If sa is not a subsequence of sb, then there is no relationship in this ordering.

The SPAM Algorithm The Lexicographic Sequence Tree Each sequence in the sequence tree can be considered as either a sequence-extended sequence or an itemset-extended sequence. (a, b) {} a a, a a, b a, a, a a, a, b a, (a, b) = S-Step = I-Step

The SPAM Algorithm For example: Sa= ({a, b, c}, {a, b}) The Lexicographic Sequence Tree For example: Sa= ({a, b, c}, {a, b}) sequence-extended sequence of Sa itemset-extended sequence of Sa = S-Step = I-Step ({a, b, c}, {a, b}) ({a, b, c}, {a, b}, {d}) ……… S-Step ({a, b, c}, {a, b}) ({a, b, c}, {a, b, d}) I-Step

The SPAM Algorithm A standard depth-first manner. Depth First Tree Traversal A standard depth-first manner. sup D (S)  minSup, store S and repeat DFS. (a, b) {} a a, a a, b a, a, a a, a, b a, (a, b) sup D (S) < minSup, not need to repeat DFS. A huge search space.

The SPAM Algorithm S-step Pruning By Apriori principle Suppose we are at node ({a}) in the tree and suppose that S({a}) = {a, b, c, d}, I({a}) = {b, c, d}. Suppose that ({a}, {c}) and ({a}, {d}) are not frequent. By Apriori principle ({a},{a},{c}) ({a},{b},{c}) ({a},{a,c}) ({a},{b,c})({a},{a},{d}) ({a},{b},{d}) ({a},{a,d}) ({a},{b,d})are not frequent Hence, when we are at node ({a}, {a}) or ({a}, {b}), we do not have to perform I-step or S-step using items c and d, i.e. S({a},{a}) = S({a},{b}) = {a, b} I({a},{a}) = {b}, and I({a}, {b}) = null .

The SPAM Algorithm I-step Pruning Let us consider the same node ({a}) described in the previous section. The possible itemset-extended sequences are ({a, b}), ({a, c}), and ({a, d}). If ({a, c}) is not frequent, then ({a, b, c}) must also not be frequent by the Apriori principle. Hence, I({a, b}) = {d}, S({a,b}) = {a, b}, and S({a, d}) = {a, b}.

Pseudocode for DFS with pruning

Data Representation Data Structure Our algorithm uses a vertical bitmap representation of the data. A vertical bitmap is created for each item in the dataset, and each bitmap has a bit corresponding to each transaction in the dataset. A bit is set to 1 if the transaction it represents contains the last itemset in the sequence, and previous transactions contain all previous itemsets in the sequence (i.e. the customer contains the sequence of itemsets)

If item i appears in transaction j, then the bit corresponding to transaction j of the bitmap for item i is set to one; otherwise, the bit is set to zero. If the size of a sequence is between 2k + 1 and 2k+1, we consider it as a 2K+1-bit sequence. (the size of a sequence =3), k=1, 21+1-bit=4-bit

Data Representation Data Structure Dataset sorted by CID and TID {b, c, d} 7 3 {a, b} 5 {a, b, c} 4 2 {b} 6 1 {a, b, d} Itemset TID CID Dataset sorted by CID and TID 1 0 0 0 0 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 0 {a} {c} {d} CID TID 1 1 1 3 1 6 - - 2 2 2 4 - - - - 3 5 3 7 - - - -

Data Representation (Candidate Generation) Process S-step requires that we set the first 1 in the current sequence’s bit slice to 0, and all bits afterward to 1, to indicate that the new item can only come in a transaction after the last transaction in the current sequence. I-step, only the AND is necessary, since the item on the end occurs within the same transaction as the last itemset in the current sequence:

S-step Process & S-step result process 1 0 0 0 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 0 ({a})s ({a}) {b} ({a},{b}) S-step process & result ({a, b}, {b, c, d}) 3 ({b}, {a, b, c}) 2 ({a, b, d}, {b, c, d}, {b, c, d}) 1 Sequence CID

I-step Process & result 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 {d} ({a},{b, d}) & result ({a},{b}) ({a, b}, {b, c, d}) 3 ({b}, {a, b, c}) 2 ({a, b, d}, {b, c, d}, {b, c, d}) 1 Sequence CID

Experimental Evaluation Synthetic data generation Using the 「IBM AssocGen program」 to generate dataset. We also compared the performance of the algorithms as the minimum support was varied for several datasets of different sizes.

Experimental Evaluation Comparison With SPADE and PreﬁxSpan We compared the three algorithms on several small, medium, and large datasets for various minimum support values. This set of tests shows that SPAM outperforms SPADE by about a factor of 2.5 on small datasets and better than an order of magnitude for reasonably large datasets. PrefixSpan outperforms SPAM slightly on very small datasets, but on large datasets SPAM outperforms PrefixSpan by over an order of magnitude.

Sample running time graph Figure 6: Varying support for small dataset #1 Figure 7: Varying support for small dataset #2

Sample running time graph Figure 8: Varying support for medium-sized dataset #1 Figure 9: Varying support for medium-sized dataset #2

Sample running time graph Figure 10: Varying support for large dataset #1 Figure 11: Varying support for large dataset #2

Experimental Evaluation Comparison With SPADE and PrefixSpan SPAM performs so well for large datasets. PreﬁxSpan runs slightly faster for small datasets. The SPAM excels at ﬁnding the frequent sequences for many different types of large datasets.

Figure 12: Varying number of customers Figure 13: Varying number of transactions per customer

Figure 14: Varying number of items per transaction Figure 15: Varying average length of maximal sequences

Figure 16: Varying average length of transactions 0.19 0.29 0.39 0.49 Figure 16: Varying average length of transactions within maximal sequences Figure 17: Varying support with large number of customers in dataset

Experimental Evaluation Consideration of space requirements Uses a depth-first traversal of the search space, it is quite space-inefficient in comparison to SPADE. Using the vertical bitmap representation to store transactional data is inefficient when an item is not present in a transaction, storing a zero to represent that fact. The representation of the data that SPADE uses is more efficient.

Experimental Evaluation Consideration of space requirements SPAM is less space-efficient then SPADE as long as 16T < N(the total number of items across all of the transactions). Thus the choice between SPADE and SPAM is clearly a space-time trade off.

Conclusion Consideration of space requirements We presented an algorithm to quickly find all frequent sequences in a list of transactions. The algorithm utilizes a depth-first traversal of the search space combined with a vertical bitmap representation to store each sequence. Experimental results demonstrated that our algorithm outperforms SPADE and PrefixSpan on large datasets.

Discussion Strongest part of this paper Weak points of this paper SPAM outperforms other algorithms on large dataset. Weak points of this paper Space utility may not good. Need to load all data into main memory. Possible improvement Using linking list to improve the vertical bitmap space utility. Possible extension & applications To forecast in the stock market or financial investment.

Sequential PAttern Mining using A Bitmap Representation Q&A Sequential PAttern Mining using A Bitmap Representation Presenter 0259636 林哲存 0259639 林庭宇 0159638 徐敏容