Mining Sequential Patterns With Item Constraints Show-Jane Yen and Yue-Shi Lee dawak2004
Outline Motivation Introduction Data Mining Language Type1 Question conclusion
Motivation It is very time consuming to find all the sequential patterns from a large database IS s={A B C} is a frequent sequence ? User only interested in some items proposed to extract the sequential patterns according to the users requests Mining sequential patterns is to discover sequential purchasing It is very time consuming to find all the sequential patterns from a large database and users may be only interested in some items. Many uninteresting sequential patterns for the user requirements can be generated when traditional mining methods are applied. In this paper,users can specify the interested items and the criteria of the sequential patterns to be discovered. Also, an efficient data mining technique is proposed to extract the sequential patterns according to the users` requests.
Introduction A customer sequence is the list of all the transactions of a customer, which is ordered by increasing transaction-time. The support for a sequence s (or an itemset i)= (the number of customer sequences that contain this sequence)/ (the total number of customer sequences). If the support for a sequence s (or an itemset i) satisfies the user specified minimum support threshold, then s (or i) is called frequent sequence
Introduction The length of an itemset X is the number of items in the itemset X The length of a sequence s is the number of itemsets in the sequence An itemset of length k k-itemset a frequent itemset of length k a frequent k-itemset. a sequence of length k a k-sequence a frequent sequence of length k a frequent k-sequence A sequential pattern is a frequent sequence that is not contained in any other frequent sequence
2 Data Mining Language and Database Transformation Mining <Sequential Patterns> From <CSD> With <{D1},{D2},……,{Dm}> Support <s%> 1.<Sequential Patterns> is specified because the discovered knowledge is sequential patterns. 2. <CSD> is used to specify the database name to which users query the sequential patterns. 3. <{D1},{D2 },…..,{Dm }> are user-specified items which ordered by increasing purchasing time, Besides, the notation ” * ” can be in the itemsets Di , which denotes any itemsets and {Di} can be the notation “ * ”, which represents any sequence. 4. Support clause is followed by the user-specified minimum support s%.
2.2 Database transformation
2.3 Sequential bit-string operation Suppose a customer sequence contains the two sequences S1 and S2. We present an operation called sequential bit-string operation to check if the sequence S1S2 is also contained in this customer sequence.. For example: consider Table 1. We want to check if sequence {A}{C} is contained in customer sequence CID 1. the bit string of items A and C in CID 1 are BA=011 and BC=111. Let the bit string for sequence S1 in customer sequence c is B1, and for sequence S2 is B2. Bit string B1 is scanned from left to right until a bit value 1 is visited. We set this bit and all bits on the left hand side of this bit to 0 and set all bits on the right hand side of this bit to 1, and assign the resultant bit string to a template Tb. Then, the bit string for sequence S1S2 in c can be obtained by performing logical AND operation on bit strings Tb and B2. If the number of 1’s in the bit string for sequence S1S2 is not zero, then S1S2 is contained in customer sequence c. Otherwise, the customer sequence c does not contain S1S2 0 0 1 1 1 1 1 1 之後bit 不管是 0或1全部都改成1 and 0 0 1
Mining interesting sequential patterns Type I query:For a user’s query, if there is no notation “*” specified in the With clause, then this query is to check if the sequence followed by the With clause is a frequent sequence. Type II query. If the user would like to extract the sequential patterns which contain other sequences except the sequences specified in the With clause, then the notation “*”s have to be specified in the With clause.
3.1 Query processing for Type I query Suppose the specified sequence S={D1}{D2}…{Dm} in the With clause, where Di is an itemset. The method to check if sequence S is a frequent sequence Step 1. Scan the bit-string database and find the number of customer sequences which contain the sequence S. 1.1 For each record in the bit-string database, if the customer sequence contains all items in sequence S. 1.2 scan sequence S from left to right. For each itemset Di (1 ≤i ≤ m) in sequence S, perform the logical AND operation on the bit strings for all items in Di, and the resultant bit string is the bit string for itemset Di
3.1 Query processing for Type I query 1.3.If the bit string for itemset Di is not zero, then perform the sequential bit-string operation on the bit strings for Di and Di+1. The resultant bit string is the bit-string for sequence{Di}{Di+1}. Then, perform the sequential bit-string operation on the bit strings forsequence {Di}{Di+1} and itemset Di+2 , and so on. During performing those operations: If the resultant bit string is zero the customer sequence does not contain sequence S. If the final resultant bit string is not zero the customer sequences contain the sequence S. , then we do not need to continue the process, because we can sure that the customer sequence does not contain sequence S. , then we increase the number of customer sequences which contain the sequence S.
3.1 Query processing for Type I query Step 2. Determine if the sequence S is a frequent sequence =The number of customer sequences which contain the sequence S / number of total customer sequences
3.1 Query processing for Type I query S={AE}{ACE}{CE} CID1:D1=001,D2=001,D3=001 D1,D2 perform the sequential bit-string operation=000 CID2:D1=1010,D2=0010,D3=0011 D1,D2 perform the sequential bit-string operation=0010 0010,D3 perform the sequential bit-string operation=0001 CID3:X CID4=000000 CID5=000000 Support=1/5=20%
3.2 Query processing for Type II query For Type II query, there is the notation “*” specified in the With clause. EX: Query 1: Mining <Sequential Patterns> From <CSD> With <*,{E},*,{A},*,{B},*> support <40%> For Type II query, there is the notation “*” specified in the With clause. For example ,in Query 1, the user would like to find all the sequential patterns which contain the sequence {E}{A}{B} from the customer sequence database (Table 1) and the minimum support threshold is set to 40%.
3.2 Query processing for Type II query Step 1. Find all the frequent (m+1)-sequences. Step 1.1. Scan the bit-string database, if all items in S are contained in a record, then output the items in this record and the bit string for each item into 1-itemset database. The method to generate the candidate (k+1)-itemsets : For every two k-frequent itemsets A={a1, …, ak-1, r} and B= {a1, …, ak1 ,t}, the candidate (k+1)-itemset {a1, …, ak-1, r, t} can be generated. Suppose the two frequent k- itemsets X and Y in a record generate candidate (k+1)-itemset Z. We perform AND operation on the two bit strings for the two frequent k- itemsets X and Y The candidate (k+1)-itemsets are generated, and scan the (k+1)-itemset database to find (k+1)-frequent itemsets. For each record in the k itemset database, we use the k-frequent itemsets in this record and apply the above method to generate candidate (k+1)-itemsets. , and the resultant bit string is the bit string for the candidate (k+1)-itemset Z. If this bit string is not zero, then output the candidate (k+1)-itemset Z and its bit string into (k+1)-itemset database. Besides, we also output the frequent k-itemsets and its bit string in each record into the frequent itemset database.
3.2 Query processing for Type II query For example, in Table 2, the records which contain the sequence {E}{A}{B} in the With clause in Query 1 are CID 4 and CID 5, Hence, the 1-itemset database can be generated, which is shown in Table 3. Then, the 1-itemset database is scanned to generate frequent 2-itemsets and 2-itemset database. The 2-itemset database is shown in Table 4. Finally, we can generate the frequent itemsets {A}, {B}, {C}, {D}, {E}, {F} and {B, D}, and the frequent database which is shown in Table 5.
3.2 Query processing for Type II query Step 1.2. Each frequent itemset (i.e., frequent 1-sequence) is given a unique number,and replace the frequent itemsets in the frequent itemset database with their numbers to form a 1-sequence database.
3.2 Query processing for Type II query Step 1.3. Generate candidate 2-sequences, and scan 1sequence database to generate 2-sequence database and find all the frequent 2-sequences. Generate candidate 2-itemsets: 1.If there is a notation “*” appears before the itemset D1 in the With clause, then the candidate 2-sequence {f}{D1} is generated. If the notation “*”appears after the itemset D1, then the candidate 2- sequence {D1}{f} is generated. 2.If the reverse order of a candidate 2-sequence is contained in the specified sequence S, then this candidate 2-sequence can be pruned. .For each frequent 1-sequence f except D1, the itemset D1 is combined with the frequent 1-sequence to generate a candidate 2-sequence.
3.2 Query processing for Type II query For example, in Query 1, the first itemset specified in the With clause is {E} whose number is 5, and there are notation “*”s which appear before and after the itemset {E}. (E是前面的D1,f是1-sequence除了E之外的sequence) candidate 2-sequences are {1}{5}, {5}{1}, {2}{5},{5}{2}, {3}{5}, {5}{3}, {4}{5}, {5}{4}, {6}{5}, {5}{6}, {7}{5}, {5}{7}. From these candidate 2-sequences, {1}{5} and{2}{5} can be pruned , because the reverse order of the two sequences are contained in the specified sequence {5}{1}{2}.
3.2 Query processing for Type II query scanning 1-sequence database (Table 6), the generated 2-sequence database is shown in Table 7, and the frequent 2-sequences are {5}{1},{5}{2}, {4}{5}, {5}{3}and {5}{6}.
3.2 Query processing for Type II query Step 1.4. Generate candidate 3-sequences, and scan 2-sequence database to generate 3-sequence database and find all the frequent 3-sequences. generate candidate 3-sequences: S1={D1}{r} which is a sub-sequence of S and S2={D1}{t} (or S1={D1}{r} and S2={t}{D1}), we can generate the candidate 3-sequences {D1}{r}{t} and {D1}{t}{r}(or {t}{D1}{r}).
3.2 Query processing for Type II query frequent 2-sequences are {5}{1},{5}{2}, {4}{5}, {5}{3}and {5}{6}. candidate 3-sequences are {5}{1},{5}{2}{5}{1}{2}, {5}{2}{1} {5}{1},{4}{5}{4}{5}{1}, {5}{2},{4}{5}{4}{5}{2}, {5}{1},{5}{3}{5}{1}{3},{5}{3}{1}, {5}{2},{5}{3}{5}{3}{2},{5}{2}{3}, {5}{1},{5}{6}{5}{1}{6},{5}{6}{1}, {5}{2},{5}{6}{5}{2}{6},{5}{6}{2}.
3.2 Query processing for Type II query After scanning each record in 2-sequence database (Table 7) frequent 3-sequences are: {5}{1}{2}, {4}{5}{1}, {4}{5}{2}, {5}{3}{1},{5}{3}{2}, {5}{1}{6},{5}{2}{6}. CID 3-sequence Bit-string 4 {5}{1}{2},{4}{5}{1} {4}{5}{2} {5}{3}{1} {5}{3}{2} {5}{1}{6} {5}{2}{6} 000010,000110 001010,000110 000010,000001 000001 5 000010,000100
3.2 Query processing for Type II query Step 1.5. Frequent (h+1)-sequences (3≤h≤m) are generated in each iteration. For the(h-2)th iteration, we use frequent h-sequences to generate candidate (h+1)-sequence,and scan h-sequence database to generate (h+1)-sequence database, and find all the frequent (h+1)-sequences.
3.2 Query processing for Type II query For any two frequent h-sequence S1={s1}{s2}…{sh-1}{r} and S2={s1}{s2}…{sh-1}{t}, in which {s1}{s2}…{sh-1} is a sub-sequence of S or {r} and {t} are contained in S, the candidate (h+1)-sequences {s1}{s2}…{sh-1}{r}{t} and {s1}{s2}…{sh-1}{t}{r} can be generated. EX: frequent 3-sequences are: {5}{1}{2}, {4}{5}{1}, {4}{5}{2}, {5}{3}{1},{5}{3}{2}, {5}{1}{6},{5}{2}{6}. generated candidate 4-sequences :{5}{1}{2}{6}, {5}{1}{6}{2}, {4}{5}{1}{2}, {4}{5}{2}{1}, {5}{3}{1}{2}, {5}{3}{2}{1}. {4}{5}{2}{1}, {5}{3}{2}{1}are pruned
3.2 Query processing for Type II query the generated candidate 4-sequences are {4}{5}{1}{2}, {5}{3}{1}{2}, {5}{1}{6}{2} and {5}{1}{2}{6} After scanning 3-sequence database, the generated frequent 4-sequences are {4}{5}{1}{2},{5}{3}{1}{2} and {5}{1}{2}{6}. If there are frequent (m+1)-sequences generated, then step 2 need to be performed. Otherwise, step 3 is performed directly.
3.2 Query processing for Type II query Step 2. The frequent (m+n+1)-sequences (n≥1) which contain the specified sequence S are generated in each iteration. For the nth iteration, we use the frequent (m+n)-sequences to generate candidate (m+n+1)-sequences and scan the (m+n)-sequence database and 1-sequence database to generate (m+n+1)-sequence database in which the candidate (m+n+1)-sequences are contained in each record but the bit string are not, and find the frequent (m+n+1)-sequences. If there are frequent (m+1)-sequences generated, then step 2 need to be performed. Otherwise, step 3 is performed directly.
3.2 Query processing for Type II query The method to generate candidate (m+n+1)-sequences is as follows: For every two frequent (m+n)-sequences S1 ={s1}{s2}…{si}{r}{si+1}…{sm+n-1} and S2={s1}{s2}…{sj}{t}{sj+1}…{sm+n-1} (i≤j), in which {r} is not contained in S2 and {t} is not contained in S1 a candidate (m+n+1)-sequence {s1}{s2}…{r}…{t}…{sm+n-1} can be generated. perform the sequential bit-string operations on the bit strings for the itemsets in the candidate (m+n+1)-sequence by scanning the 1-sequence database.
3.2 Query processing for Type II query frequent 4-sequences are {4}{5}{1}{2},{5}{3}{1}{2} and {5}{1}{2}{6}. according to step 2, the generated candidate 5-sequences are {4}{5}{3}{1}{2}, {4}{5}{1}{2}{6} and {5}{3}{1}{2}{6}. After scanning 1-sequence database the frequent 5-sequences are {4}{5}{3}{1}{2},{4}{5}{1}{2}{6} and{5}{3}{1}{2}{6}. candidate 6-sequence {4}{5}{3}{1}{2}{6}. After scanning 1-sequence database the frequent 6-sequence is also {4}{5}{3}{1}{2}{6},and there is no candidate 7-sequence generated. Hence, the algorithm for mining frequent sequences terminates.
3.2 Query processing for Type II query Step 3. For the above example, the frequent sequences which satisfy the user requirement in Query 1 are {E}{A}{B}, {D}{E}{A}{B}, {E}{C}{A}{B}, {E}{A}{B}{F}, {D}{E}{C}{A}{B}, {D}{E}{A}{B}{F}, {E}{C}{A}{B}{F} and {D}{E}{C}{A}{B}{F}, and the sequential pattern is {D}{E}{C}{A}{B}{F}. For each frequent sequence, the code for each itemset in the frequent sequence is replaced with the itemset itself. If a frequent sequence is not contained in another frequent sequences, then this frequent sequence is a sequential pattern.
Experimental our algorithm outperforms PrefixSpan algorithm .
CONCLUSION The steps of generate candidate sequence are not consistent Notion is intuitive