Download presentation
Presentation is loading. Please wait.
1
Mining Sequential Patterns
Presenters: Qian Bai, Jiguo Jiang 15/11/2018 Qian Bai, Jigou Jiang
2
Mining Sequential Patterns
Introduction The Algorithm Aprioriall, AprioriSome, DynamicSome Performance Conclusions 15/11/2018 Qian Bai, Jigou Jiang
3
Introduction Background Problem Statement An Example Related Work
15/11/2018 Qian Bai, Jigou Jiang
4
Background Customer purchase patterns Web access patterns
Buy computer, then buy software Rent “Star War”, then “Empire Strikes Back”, and then “Return of the Jedi” Buy “Fitted Sheet and flat sheet and pillow cases”, followed by “comforter”, and then followed by “drapes and ruffles” Web access patterns Open then open 15/11/2018 Qian Bai, Jigou Jiang
5
Background (Continue)
The sequential pattern mining problem was first introduced by Agrawal and Srikant Definition: Given a set of sequences, each of which sequence consists of a list of elements and each element consists of a set of items, and given a user-specified min-support threshold, sequential pattern mining is to find all frequent subsequences, i.e., the subsequences whose occurrence frequency in the set of sequences is no less than min-support 15/11/2018 Qian Bai, Jigou Jiang
6
Problem Statement After reading the three papers about “Mining Sequential Patterns”, we focus on a database D of customer transactions Each transaction consists of the following fields: Customer-id Transaction-time Items purchased in the transaction Note: No customer has more than one transaction with the same transaction time. We do not consider quantities of items bought in a transaction 15/11/2018 Qian Bai, Jigou Jiang
7
Problem Statement (Continue)
Terminology: Itemset: a non-empty set of items. (30, 40, 50), (60) Sequence: ordered list of itemsets. < (30, 40, 50) (60) > Sequence Length: number of itemsets in a sequence. Contained: A sequence (a1, a2, …, aN) is contained in another sequence (b1, b2, …, bM) if there exist integers i1<i2<…<iN such that a1 bi1, a2bi2, …, aNbiN < (30) (40 50) > is contained in < (70) (30 80) ( ) > < (30) (50) > is NOT contained in < (30 50) > 15/11/2018 Qian Bai, Jigou Jiang
8
Problem Statement (Continue)
Terminology (Continue): Maximal Sequence: A sequence is maximal if it is not contained in any other sequence Support: A customer supports a sequence s if s is contained in the customer-sequence for this customer. It is the fraction of total customers who support this sequence Litemset: (Large itemset) An itemset satisfying the minimum support Large sequence: A sequence satisfying the minimum support constraint is called a large sequence 15/11/2018 Qian Bai, Jigou Jiang
9
Problem Statement (Continue)
Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user-specified minimum support. Each such maximal sequence represents a sequential pattern 15/11/2018 Qian Bai, Jigou Jiang
10
An Example A Database sorted by Customer ID and Transaction Time
Items Bought 1 June 25 93 June 30 93 30 90 2 June 10 93 June 15 93 June 20 93 10, 20 40, 60, 70 3 30, 50, 70 4 July 40, 70 5 June 12 93 15/11/2018 Qian Bai, Jigou Jiang
11
An Example (Continue) Customer-Sequence Version of the Database Note:
Patterns are not necessarily contiguous. Some sequences, such as < (30) >, < (30) (40) > though having minimum support, are not in the answer because they are not maximal Customer ID Customer Sequence 1 2 3 4 5 < (30) (90) > < (10 20) (30) ( ) > < ( ) > < (30) (40 70) (90) > < (90) > Sequential Patterns with support > 25% < (30) (90) > (Supported by 1 and 4) < (30) (40 70) > (Supported by 2 and 4) 15/11/2018 Qian Bai, Jigou Jiang
12
Related Work Differences between Association Rule Mining in Customer Transaction Database and Sequential Pattern Mining Association Rules Mining: Finding what items are bought together Finding intra-transaction patterns Patterns are unordered set of items Sequential Patterns Mining: Finding what items are bought in different transactions Finding inter-transaction patterns Patterns are ordered list of sets of items 15/11/2018 Qian Bai, Jigou Jiang
13
Algorithm Sort phase Litemset phase
Sort database with customer-id as the major key and transaction-time as the minor key Litemset phase Scan database to find the set of all 1 sequence litemsets L1 based on the given minimum support Map large itemsets to a set of contiguous integers by treating litemsets as single entities. Example: {30} {40} {70} {40 70} {90} can be mapped to {1} {2} {3} {4} {5} 15/11/2018 Qian Bai, Jigou Jiang
14
Algorithm(Continue) Transformation phase
Replace each transaction by the set of 1-sequence litemsets that it contains Delete customer sequences that contain no 1-sequence litemset Keep the same total number of customers Example: given (30) (90) (40) (70) (40 70) are 1-sequence litemsets ID Before Transformed After Transformed 1 2 3 {(30) (90)} {(10 20) ( } {(50)} {(40) (70) (40 70} 15/11/2018 Qian Bai, Jigou Jiang
15
Algorithm(Continue) Sequence phase Maximal phase
Find the frequent sequences Three algorithms:AprioriAll, AprioriSome, DynamicSome Maximal phase Delete sequences that are subsequences of other large sequences Combine with the sequence phase in AprioriSome and DynamicSome algorithm Example: given sequences {1} {2} {3} {4} {1 2} {1 3} {1 2 3}, the maximal sequences will be {4} {1 2 3} 15/11/2018 Qian Bai, Jigou Jiang
16
Algorithm AprioriAll Main idea Example
All of the subsets of a frequent sequence must be frequent sequences too If a set is not frequent sequence, then its supersets will not be frequent sequences Example {1 2 3} is a frequent sequence, {1} {2} {3} {1 2} {2 3} must be frequent sequences. {1} is not a frequent sequence, then {1 2} { 1 3} … are not frequent sequences. 15/11/2018 Qian Bai, Jigou Jiang
17
AprioriAll (Continue)
Step 1: k = 2 Step 2: Form Ck using Apriori-generate function Step3: Scan database and generate Lk from Ck based on the minimum support Step 4: If Lkis not empty, set k = k+1. Then repeat step 2 and step 3 15/11/2018 Qian Bai, Jigou Jiang
18
AprioriAll (Continue)
Apriori-generate Join two sequences in Lk-1 to generate Ck Step 1: for each two sequences in Lk-1 that have the same 1st to k-2th itemsets, select the 1 to k-1 litemset from the first sequence, and join with the last litemset from another sequence Step 2: delete all sequences in Ck if some of their sub sequences are not in Lk-1 Example Given L3 = {1 2 3}{2 3 4}{1 2 4}{1 3 4}{1 3 5} step 1: C4 = { } { } { }{ } step 2: C4 = { } 15/11/2018 Qian Bai, Jigou Jiang
19
AprioriAll (Continue)
Example: min_sup = 3 Large sequence = {1 2 3}{1 4} 2-seq. Sup. {1 2} {1 3} {1 4} {2 3} {2 4} {3 4} 3 1 ID Mapping Seq. 1 2 3 4 5 ({1}{4}) ({1}{2 3} ({1 2} {2 3}) ({1}{2 3}{4}) 1 seq. Sup. {1} {2} {3} {4} 5 3 3-seq. Sup. {1 2 3} 3 15/11/2018 Qian Bai, Jigou Jiang
20
AprioriSome Intuition: the subsets of a frequent sequence will not be in the final maximum sequences Example: Suppose {2 3} { 3 4} { 1 2} { 1 2 3} are frequent sequences, then the final maximum sequences are {3 4} and {1 2 3} 15/11/2018 Qian Bai, Jigou Jiang
21
AprioriSome (Continue)
Step1: set C1= L1, last =1, k=2 Step 2: forward phase Step 2.1: generate Ck from either Lk-1 or Ck-1 Step 2.2: if k=next(last), scan database to generate Lk based on the minimum support, and set last =k Step 2.3: if both Ck and Llast are not empty, increase k by 1, and repeat from step 2.1 Step 3: back ward phase Step 3.1: decrease k by 1. If Lk is empty, delete sequences in Ck contained in Li where i>k. Scan database again to generate Lk based on the given minimum support. If Lk is not empty, delete sequences in Lk contained in Li where i>k. Step 3.2: if k>1, repeat from step 3.1. Step 4: union all the sequences in L 15/11/2018 Qian Bai, Jigou Jiang
22
AprioriSome (Continue)
Efficiency: highly depends on the next(k) function Tradeoff between counting non-maximal sequences versus counting extensions of small candidate sequences. A special cases: next(k) = k+1 Example: based on the ratio of the number of Lk to the number of Ck, we decide the value of k 15/11/2018 Qian Bai, Jigou Jiang
23
AprioriSome (Continue)
Example: next(k) = 2k, min_sup=2 Answers: { }{1 3 5}{4 5} 3 seq. 4 seq. Sup. {1 2 3} {1 2 4} {1 3 4} {1 3 5} {2 3 4} {1 4 5} {3 4 5} { } { } 2 1 ID Mapping Seq. 1 2 3 4 5 ({1 5}{2}{3}{4}) ({1}{3}{4}{3 5}) ({1}{2}{3}{4}) ({1}{3}{5}) ({4}{5}) 1 seq. Sup. {1} {2} {3} {4} {5} 4 2 2 seq. Sup. {1 2} {1 3} {1 4} {1 5} {2 3} {2 4} {2 5} {3 4} {3 5} {4 5} 2 4 3 3 seq. Sup. {1 3 5} {3 4 5} {1 4 5} 2 1 15/11/2018 Qian Bai, Jigou Jiang
24
DynamicSome Intuition: same idea as AprioriSome
Differences between two algorithms AprioriSome DynamicSome K = next(last) K = k+step Ck =Lk-1/ Ck-1 Ck = otf-generate(Lk,Lstep,c) Two phases: Forward, backward Three phases: Forward, backward and intermediate Initialize: L1 Initialize: L1 to Lstep 15/11/2018 Qian Bai, Jigou Jiang
25
DynamicSome (Continue)
Step 1: generate L1 to Lstep based on Apriori algorithm Step 2: forward phase Step 2.1: Set k = step Step 2.2: scan db to generate Ck+step using otf-generate(Lk,Lstep,c), and then generate Lk+step from Ck+step based on the given minimum support Step 2.3: if Lk is not empty, set k = k+step and repeat from step 2.2 Step 3: intermediate phase Generate all the missing Ck based on Lk-1 or Ck-1 Step 4: backward phase which is same as that of AprioriSome 15/11/2018 Qian Bai, Jigou Jiang
26
DynamicSome (Continue)
On-the-fly candidate generation c = <c1 c2 ..cn>, Lk and Lj Xk = subseq(Lk,c) For all sequences x belong to Xk do End = min{j|x is contained in <c1 c2 …cj> Xj = subseq(Lj,c) For all sequences x belong to Xj Start = max{j|x is contained in <cj cj+1 …cn> Answer = join of Xk with Xj if Xk.end< Xj.start 15/11/2018 Qian Bai, Jigou Jiang
27
DynamicSome (Continue)
Example C = <{1} {2} {3 7} {4}> L2 = <1 2><1 3><3 4> Thus, result = < > Seq. End start <1 2> 2 1 <1 3> 3 <3 4> 4 15/11/2018 Qian Bai, Jigou Jiang
28
DynamicSome (Continue)
Example: step = 2, min_sup = 2 Answers: { }{1 3 5}{4 5} 1 seq. Sup. {1} {2} {3} {4} {5} 4 2 2 seq. Sup. {1 3} {1 2} {1 4} {1 5} {2 3} {2 4} {2 5} {3 4} {3 5} {4 5} 2 4 3 4 seq. Sup. < > < > 2 1 3 seq. Sup. <1 2 3> <1 2 4> <1 3 4> <1 3 5> <3 4 5> 2 1 15/11/2018 Qian Bai, Jigou Jiang
29
Performance 15/11/2018 Qian Bai, Jigou Jiang
30
Performance (Continue)
Note: The result of DynamicSome was not ploted for low values of minimum support since it generated too many candidates and ran out of memory. 15/11/2018 Qian Bai, Jigou Jiang
31
Performance (Continue)
15/11/2018 Qian Bai, Jigou Jiang
32
Performance (Continue)
15/11/2018 Qian Bai, Jigou Jiang
33
Performance (Continue)
15/11/2018 Qian Bai, Jigou Jiang
34
Conclusions The problem of mining sequential patterns from a database of customer transactions was introduced and three algorithms for solving this problem was presented. Two of the algorithms, AprioriSome and AprioriAll, have comparable performance, although AprioriSome performs a little better for the lower values of the minimum support. Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. Question? 15/11/2018 Qian Bai, Jigou Jiang
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.