Mining Sequential Patterns With Item Constraints

Slides:



Advertisements
Similar presentations
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
PREFIXSPAN ALGORITHM Mining Sequential Patterns Efficiently by Prefix- Projected Pattern Growth
Frequent Closed Pattern Search By Row and Feature Enumeration
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign.
Generalized Sequential Pattern (GSP) Step 1: – Make the first pass over the sequence database D to yield all the 1-element frequent sequences Step 2: Repeat.
Ex. 11 (pp.409) Given the lattice structure shown in Figure 6.33 and the transactions given in Table 6.24, label each node with the following letter(s):
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Mining Association Rules
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
Sequential PAttern Mining using A Bitmap Representation
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
Discovering RFM Sequential Patterns From Customers’ Purchasing Data 中央大學資管系 陳彥良 教授 Date: 2015/10/14.
Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)
Generalized Sequential Pattern Mining with Item Intervals Yu Hirate Hayato Yamana PAKDD2006.
Data Mining Association Rules: Advanced Concepts and Algorithms
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
Mining temporal interval relational rules from temporal data Yong Joon Lee, Jun Wook Lee, Duck Jin Chai, Bu Hyun Hwang, Keun Ho Ryu JSS (The Journal of.
Association Rule Mining
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Mining Progressive Confident Rules M. Zhang, W. Hsu and M.L. Lee Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor : Jia-Ling Koh Speaker : Tsui-Feng.
1 Discovering Calendar-based Temporal Association Rules SHOU Yu Tao May. 21 st, 2003 TIME 01, 8th International Symposium on Temporal Representation and.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
Rapid Association Rule Mining Amitabha Das, Wee-Keong Ng, Yew-Kwong Woon, Proc. of the 10th ACM International Conference on Information and Knowledge Management(CIKM’01),2001.
Sequential Pattern Mining
Sequential Pattern Mining Using A Bitmap Representation
Data Mining and Its Applications to Image Processing
Frequent Pattern Mining
Advanced Pattern Mining 02
Association Rules.
Chang-Hung Lee, Jian Chih Ou, and Ming Syan Chen, Proc
Market Basket Analysis and Association Rules
Spatio-temporal Rule Mining: Issues and Techniques
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Mining Sequential Patterns
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Association Rule Mining
A Parameterised Algorithm for Mining Association Rules
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
Data Mining Association Analysis: Basic Concepts and Algorithms
Amer Zaheer PC Mohammad Ali Jinnah University, Islamabad
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Data Warehousing Mining & BI
Mining Sequential Patterns
Market Basket Analysis and Association Rules
Lecture 11 (Market Basket Analysis)
Presentation transcript:

Mining Sequential Patterns With Item Constraints Show-Jane Yen and Yue-Shi Lee dawak2004

Outline Motivation Introduction Data Mining Language Type1 Question conclusion

Motivation It is very time consuming to find all the sequential patterns from a large database IS s={A B C} is a frequent sequence ? User only interested in some items proposed to extract the sequential patterns according to the users requests Mining sequential patterns is to discover sequential purchasing It is very time consuming to find all the sequential patterns from a large database and users may be only interested in some items. Many uninteresting sequential patterns for the user requirements can be generated when traditional mining methods are applied. In this paper,users can specify the interested items and the criteria of the sequential patterns to be discovered. Also, an efficient data mining technique is proposed to extract the sequential patterns according to the users` requests.

Introduction A customer sequence is the list of all the transactions of a customer, which is ordered by increasing transaction-time. The support for a sequence s (or an itemset i)= (the number of customer sequences that contain this sequence)/ (the total number of customer sequences). If the support for a sequence s (or an itemset i) satisfies the user specified minimum support threshold, then s (or i) is called frequent sequence

Introduction The length of an itemset X is the number of items in the itemset X The length of a sequence s is the number of itemsets in the sequence An itemset of length k k-itemset a frequent itemset of length k a frequent k-itemset. a sequence of length k a k-sequence a frequent sequence of length k a frequent k-sequence A sequential pattern is a frequent sequence that is not contained in any other frequent sequence

2 Data Mining Language and Database Transformation Mining <Sequential Patterns> From <CSD> With <{D1},{D2},……,{Dm}> Support <s%> 1.<Sequential Patterns> is specified because the discovered knowledge is sequential patterns. 2. <CSD> is used to specify the database name to which users query the sequential patterns. 3. <{D1},{D2 },…..,{Dm }> are user-specified items which ordered by increasing purchasing time, Besides, the notation ” * ” can be in the itemsets Di , which denotes any itemsets and {Di} can be the notation “ * ”, which represents any sequence. 4. Support clause is followed by the user-specified minimum support s%.

2.2 Database transformation

2.3 Sequential bit-string operation Suppose a customer sequence contains the two sequences S1 and S2. We present an operation called sequential bit-string operation to check if the sequence S1S2 is also contained in this customer sequence.. For example: consider Table 1. We want to check if sequence {A}{C} is contained in customer sequence CID 1. the bit string of items A and C in CID 1 are BA=011 and BC=111. Let the bit string for sequence S1 in customer sequence c is B1, and for sequence S2 is B2. Bit string B1 is scanned from left to right until a bit value 1 is visited. We set this bit and all bits on the left hand side of this bit to 0 and set all bits on the right hand side of this bit to 1, and assign the resultant bit string to a template Tb. Then, the bit string for sequence S1S2 in c can be obtained by performing logical AND operation on bit strings Tb and B2. If the number of 1’s in the bit string for sequence S1S2 is not zero, then S1S2 is contained in customer sequence c. Otherwise, the customer sequence c does not contain S1S2 0 0 1 1 1 1 1 1 之後bit 不管是 0或1全部都改成1 and 0 0 1

Mining interesting sequential patterns Type I query:For a user’s query, if there is no notation “*” specified in the With clause, then this query is to check if the sequence followed by the With clause is a frequent sequence. Type II query. If the user would like to extract the sequential patterns which contain other sequences except the sequences specified in the With clause, then the notation “*”s have to be specified in the With clause.

3.1 Query processing for Type I query Suppose the specified sequence S={D1}{D2}…{Dm} in the With clause, where Di is an itemset. The method to check if sequence S is a frequent sequence Step 1. Scan the bit-string database and find the number of customer sequences which contain the sequence S. 1.1 For each record in the bit-string database, if the customer sequence contains all items in sequence S. 1.2 scan sequence S from left to right. For each itemset Di (1 ≤i ≤ m) in sequence S, perform the logical AND operation on the bit strings for all items in Di, and the resultant bit string is the bit string for itemset Di

3.1 Query processing for Type I query 1.3.If the bit string for itemset Di is not zero, then perform the sequential bit-string operation on the bit strings for Di and Di+1. The resultant bit string is the bit-string for sequence{Di}{Di+1}. Then, perform the sequential bit-string operation on the bit strings forsequence {Di}{Di+1} and itemset Di+2 , and so on. During performing those operations: If the resultant bit string is zero the customer sequence does not contain sequence S. If the final resultant bit string is not zero  the customer sequences contain the sequence S. , then we do not need to continue the process, because we can sure that the customer sequence does not contain sequence S. , then we increase the number of customer sequences which contain the sequence S.

3.1 Query processing for Type I query Step 2. Determine if the sequence S is a frequent sequence =The number of customer sequences which contain the sequence S / number of total customer sequences

3.1 Query processing for Type I query S={AE}{ACE}{CE} CID1:D1=001,D2=001,D3=001 D1,D2 perform the sequential bit-string operation=000 CID2:D1=1010,D2=0010,D3=0011 D1,D2 perform the sequential bit-string operation=0010 0010,D3 perform the sequential bit-string operation=0001 CID3:X CID4=000000 CID5=000000 Support=1/5=20%

3.2 Query processing for Type II query For Type II query, there is the notation “*” specified in the With clause. EX: Query 1: Mining <Sequential Patterns> From <CSD> With <*,{E},*,{A},*,{B},*> support <40%> For Type II query, there is the notation “*” specified in the With clause. For example ,in Query 1, the user would like to find all the sequential patterns which contain the sequence {E}{A}{B} from the customer sequence database (Table 1) and the minimum support threshold is set to 40%.

3.2 Query processing for Type II query Step 1. Find all the frequent (m+1)-sequences. Step 1.1. Scan the bit-string database, if all items in S are contained in a record, then output the items in this record and the bit string for each item into 1-itemset database. The method to generate the candidate (k+1)-itemsets : For every two k-frequent itemsets A={a1, …, ak-1, r} and B= {a1, …, ak1 ,t}, the candidate (k+1)-itemset {a1, …, ak-1, r, t} can be generated. Suppose the two frequent k- itemsets X and Y in a record generate candidate (k+1)-itemset Z. We perform AND operation on the two bit strings for the two frequent k- itemsets X and Y The candidate (k+1)-itemsets are generated, and scan the (k+1)-itemset database to find (k+1)-frequent itemsets. For each record in the k itemset database, we use the k-frequent itemsets in this record and apply the above method to generate candidate (k+1)-itemsets. , and the resultant bit string is the bit string for the candidate (k+1)-itemset Z. If this bit string is not zero, then output the candidate (k+1)-itemset Z and its bit string into (k+1)-itemset database. Besides, we also output the frequent k-itemsets and its bit string in each record into the frequent itemset database.

3.2 Query processing for Type II query For example, in Table 2, the records which contain the sequence {E}{A}{B} in the With clause in Query 1 are CID 4 and CID 5, Hence, the 1-itemset database can be generated, which is shown in Table 3. Then, the 1-itemset database is scanned to generate frequent 2-itemsets and 2-itemset database. The 2-itemset database is shown in Table 4. Finally, we can generate the frequent itemsets {A}, {B}, {C}, {D}, {E}, {F} and {B, D}, and the frequent database which is shown in Table 5.

3.2 Query processing for Type II query Step 1.2. Each frequent itemset (i.e., frequent 1-sequence) is given a unique number,and replace the frequent itemsets in the frequent itemset database with their numbers to form a 1-sequence database.

3.2 Query processing for Type II query Step 1.3. Generate candidate 2-sequences, and scan 1sequence database to generate 2-sequence database and find all the frequent 2-sequences. Generate candidate 2-itemsets: 1.If there is a notation “*” appears before the itemset D1 in the With clause, then the candidate 2-sequence {f}{D1} is generated. If the notation “*”appears after the itemset D1, then the candidate 2- sequence {D1}{f} is generated. 2.If the reverse order of a candidate 2-sequence is contained in the specified sequence S, then this candidate 2-sequence can be pruned. .For each frequent 1-sequence f except D1, the itemset D1 is combined with the frequent 1-sequence to generate a candidate 2-sequence.

3.2 Query processing for Type II query For example, in Query 1, the first itemset specified in the With clause is {E} whose number is 5, and there are notation “*”s which appear before and after the itemset {E}. (E是前面的D1,f是1-sequence除了E之外的sequence) candidate 2-sequences are {1}{5}, {5}{1}, {2}{5},{5}{2}, {3}{5}, {5}{3}, {4}{5}, {5}{4}, {6}{5}, {5}{6}, {7}{5}, {5}{7}. From these candidate 2-sequences, {1}{5} and{2}{5} can be pruned , because the reverse order of the two sequences are contained in the specified sequence {5}{1}{2}.

3.2 Query processing for Type II query scanning 1-sequence database (Table 6), the generated 2-sequence database is shown in Table 7, and the frequent 2-sequences are {5}{1},{5}{2}, {4}{5}, {5}{3}and {5}{6}.

3.2 Query processing for Type II query Step 1.4. Generate candidate 3-sequences, and scan 2-sequence database to generate 3-sequence database and find all the frequent 3-sequences. generate candidate 3-sequences: S1={D1}{r} which is a sub-sequence of S and S2={D1}{t} (or S1={D1}{r} and S2={t}{D1}), we can generate the candidate 3-sequences {D1}{r}{t} and {D1}{t}{r}(or {t}{D1}{r}).

3.2 Query processing for Type II query frequent 2-sequences are {5}{1},{5}{2}, {4}{5}, {5}{3}and {5}{6}. candidate 3-sequences are {5}{1},{5}{2}{5}{1}{2}, {5}{2}{1} {5}{1},{4}{5}{4}{5}{1}, {5}{2},{4}{5}{4}{5}{2}, {5}{1},{5}{3}{5}{1}{3},{5}{3}{1}, {5}{2},{5}{3}{5}{3}{2},{5}{2}{3}, {5}{1},{5}{6}{5}{1}{6},{5}{6}{1}, {5}{2},{5}{6}{5}{2}{6},{5}{6}{2}.

3.2 Query processing for Type II query After scanning each record in 2-sequence database (Table 7) frequent 3-sequences are: {5}{1}{2}, {4}{5}{1}, {4}{5}{2}, {5}{3}{1},{5}{3}{2}, {5}{1}{6},{5}{2}{6}. CID 3-sequence Bit-string 4 {5}{1}{2},{4}{5}{1} {4}{5}{2} {5}{3}{1} {5}{3}{2} {5}{1}{6} {5}{2}{6} 000010,000110 001010,000110 000010,000001 000001 5 000010,000100

3.2 Query processing for Type II query Step 1.5. Frequent (h+1)-sequences (3≤h≤m) are generated in each iteration. For the(h-2)th iteration, we use frequent h-sequences to generate candidate (h+1)-sequence,and scan h-sequence database to generate (h+1)-sequence database, and find all the frequent (h+1)-sequences.

3.2 Query processing for Type II query For any two frequent h-sequence S1={s1}{s2}…{sh-1}{r} and S2={s1}{s2}…{sh-1}{t}, in which {s1}{s2}…{sh-1} is a sub-sequence of S or {r} and {t} are contained in S, the candidate (h+1)-sequences {s1}{s2}…{sh-1}{r}{t} and {s1}{s2}…{sh-1}{t}{r} can be generated. EX: frequent 3-sequences are: {5}{1}{2}, {4}{5}{1}, {4}{5}{2}, {5}{3}{1},{5}{3}{2}, {5}{1}{6},{5}{2}{6}. generated candidate 4-sequences :{5}{1}{2}{6}, {5}{1}{6}{2}, {4}{5}{1}{2}, {4}{5}{2}{1}, {5}{3}{1}{2}, {5}{3}{2}{1}. {4}{5}{2}{1}, {5}{3}{2}{1}are pruned

3.2 Query processing for Type II query the generated candidate 4-sequences are {4}{5}{1}{2}, {5}{3}{1}{2}, {5}{1}{6}{2} and {5}{1}{2}{6} After scanning 3-sequence database, the generated frequent 4-sequences are {4}{5}{1}{2},{5}{3}{1}{2} and {5}{1}{2}{6}. If there are frequent (m+1)-sequences generated, then step 2 need to be performed. Otherwise, step 3 is performed directly.

3.2 Query processing for Type II query Step 2. The frequent (m+n+1)-sequences (n≥1) which contain the specified sequence S are generated in each iteration. For the nth iteration, we use the frequent (m+n)-sequences to generate candidate (m+n+1)-sequences and scan the (m+n)-sequence database and 1-sequence database to generate (m+n+1)-sequence database in which the candidate (m+n+1)-sequences are contained in each record but the bit string are not, and find the frequent (m+n+1)-sequences. If there are frequent (m+1)-sequences generated, then step 2 need to be performed. Otherwise, step 3 is performed directly.

3.2 Query processing for Type II query The method to generate candidate (m+n+1)-sequences is as follows: For every two frequent (m+n)-sequences S1 ={s1}{s2}…{si}{r}{si+1}…{sm+n-1} and S2={s1}{s2}…{sj}{t}{sj+1}…{sm+n-1} (i≤j), in which {r} is not contained in S2 and {t} is not contained in S1 a candidate (m+n+1)-sequence {s1}{s2}…{r}…{t}…{sm+n-1} can be generated. perform the sequential bit-string operations on the bit strings for the itemsets in the candidate (m+n+1)-sequence by scanning the 1-sequence database.

3.2 Query processing for Type II query frequent 4-sequences are {4}{5}{1}{2},{5}{3}{1}{2} and {5}{1}{2}{6}. according to step 2, the generated candidate 5-sequences are {4}{5}{3}{1}{2}, {4}{5}{1}{2}{6} and {5}{3}{1}{2}{6}. After scanning 1-sequence database the frequent 5-sequences are {4}{5}{3}{1}{2},{4}{5}{1}{2}{6} and{5}{3}{1}{2}{6}. candidate 6-sequence {4}{5}{3}{1}{2}{6}. After scanning 1-sequence database the frequent 6-sequence is also {4}{5}{3}{1}{2}{6},and there is no candidate 7-sequence generated. Hence, the algorithm for mining frequent sequences terminates.

3.2 Query processing for Type II query Step 3. For the above example, the frequent sequences which satisfy the user requirement in Query 1 are {E}{A}{B}, {D}{E}{A}{B}, {E}{C}{A}{B}, {E}{A}{B}{F}, {D}{E}{C}{A}{B}, {D}{E}{A}{B}{F}, {E}{C}{A}{B}{F} and {D}{E}{C}{A}{B}{F}, and the sequential pattern is {D}{E}{C}{A}{B}{F}. For each frequent sequence, the code for each itemset in the frequent sequence is replaced with the itemset itself. If a frequent sequence is not contained in another frequent sequences, then this frequent sequence is a sequential pattern.

Experimental our algorithm outperforms PrefixSpan algorithm .

CONCLUSION The steps of generate candidate sequence are not consistent Notion is intuitive