Download presentation
Presentation is loading. Please wait.
Published byBrandon Flynn Modified over 9 years ago
1
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14
2
Outline Sequential pattern mining Apriori-like methods –GSP Pattern-growth methods –FreeSpan –PrefixSpan Performance analysis Conclusions
3
Motivation Sequential pattern mining: Finding time-related frequent patterns Most data and applications are time-related –Customer shopping patterns, telephone calling patterns –Natural disasters (e.g., earthquake, hurricane) –Disease and treatment –Stock market fluctuation –Weblog click stream analysis –DNA sequence analysis
4
Concepts Let I={i 1,i 2,…,i n } be a set of all items Itemset is a subset of items Sequence is an ordered list of itemset. itemsets are called elements. The number of items in the sequence is its length –e.g. A sequence = is called subsequence of =, denoted , if there exist integers 1 j 1 <j 2 <…<j n m such that a 1 b j1, a 2 b j2,…,a n b jn –e.g. is subsequence of
5
Concepts (con’t) Sequence database is a set of tuples, sid is a sequence_id, and s is a sequence. A tuple is said to contain a sequence if is a subsequence of s Support of is the number of tuples in the database containing If the support of no less than a threshold, it is called sequential pattern – is a sequential pattern given support threshold min_sup =2 SIDsequence 10 20 30 40
6
Problem definition Given a sequence database and min_sup threshold, the problem of sequential pattern mining is to find the complete set of sequential patterns in the database
7
Apriori-like methods Apriori property: If a sequence S is not frequent, then every super-sequence of S is not frequent –e.g. is infrequent, so do, GSP (Generalized Sequential Pattern) algorithm –Level-by-level do Generate candidate sequences Use Apriori property to prune candidates Scan database to collect support counts
8
GSP Mining Process 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. … … … … Cand. cannot pass sup. threshold Cand. not in DB at all
9
Bottlenecks of Apriori-Like Methods Potentially huge set of candidate sequences –1,000 frequent length-1 sequences generate length-2 candidates Multiple scans of database Difficulties at mining long sequential patterns –Exponential number of short candidates –A length-100 sequential pattern needs candidate sequences
10
Pattern-growth methods A divide-and-conquer approach –Recursively project a sequence database into a set of smaller databases –Mine each projected database to find the subset of patterns Algorithms –FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining –PrefixSpan: Prefix-Projected Sequential Pattern Mining
11
FreeSpan Example: given a sequence database S and min_support = 2 Step 1: find length-1 sequential patterns and list them in support descending order –f_list = a:4,b:4,c:4,d:3,e:3,f:3 SIDSequence 10 20 30 40
12
FreeSpan (con’t) Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 disjoint subsets: –ones only contain item a –ones contain item b but no items after b in f_list –ones contain item c but no items after c in f_list –ones contain item d but no items after d in f_list –ones contain item e but no items after e in f_list –ones contain item f find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively
13
FreeSpan (con’t) Finding Seq. Patterns containing item b but no items after b in f_list – -projected database:,,, –Find all the length-2 seq. pat. containing item b but no items after b in f_list : :4, :2, :2 –Further partition and mining SIDSequence 10 20 30 40
14
From FreeSpan to PrefixSpan Freespan: –Projection-based: No candidate sequence needs to be generated –But, projection can be performed at any point in the sequence, and the projected sequences may not shrink much. For example, the size of f-projected database is the same as the original sequence database PrefixSpan –Projection-based –But only prefix-based projection: less projections and quickly shrinking sequences
15
PrefixSpan-concepts Suppose all items in an element are listed alphabetically. Given a sequence =, = (m n) Prefix: is the prefix of iff (1) e’ i =e i (i m-1) (2) e’ m e m (3) all items in (e m - e’ m ) are alphabetically after those in e’ m. –e.g. =, =, ’= Postfix: sequence =, = is called the postfix of w.r.t. prefix , where e’’ m =(e m -e’ m ), denoted as = . –e.g. = is the postfix of w.r.t. prefix
16
PrefixSpan-concepts (con’t) Projected database: let be a sequential pattern in S. -projected database, denoted s| , is the collection of postfixes of sequences in S w.r.t. prefix Support count in projected database: let be a sequential pattern in S, be a sequence having prefix . The support count of in -projected database is the number of sequence in s| such that .
17
PrefixSpan-process Step 1: find length-1 sequential patterns – :4, :4, :4, :3, :3, :3 Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: –ones having prefix ; –… –ones having prefix ; find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively SIDSequence 10 20 30 40
18
PrefixSpan-Process (con’t) Finding Seq. Patterns with Prefix – -projected database:,,, –Find all the length-2 seq. pat. having prefix : :2, :4, :2, :4, :2, :2 –Further partition into 6 subsets Having prefix ; … Having prefix ; SIDSequence 10 20 30 40
19
Completeness of PrefixSpan SIDsequence 10 20 30 40 Length-1 sequential patterns,,,,, … prefix -projected database … prefix -projected database Length-2 seq. pan,,,,, prefix -proj. db prefix, …, …
20
Efficiency of PrefixSpan No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing projected databases –Can be improved by bi-level projections and pseudo- projections
21
Optimization Techniques in PrefixSpan Single-level vs. bi-level projection –Bi-level projection with 3-way checking may reduce the number and size of projected databases Physical projection vs. pseudo-projection –Pseudo-projection may reduce the effort of projection when the projected database fits in main memory
22
S-matrix for sequence database Length-1 sequential patterns:,,,,, All length-2 sequential patterns are found in S-matrix S-matrix fedcba 1(2, 0, 1)(1, 1, 1)(1, 2, 1)(2, 2, 0)(2, 1, 1)f 0(1, 1, 0)(1, 2, 0) (1, 2, 1)e 0(1, 3, 0)(2, 2, 0)(2, 1, 1)d 3(3, 3, 2)(4, 2, 1)c 1(4, 2, 2)b 2a happens twice happens 4 times happens twice happens once
23
S-matrix for -projected database -projected database: –,, frequent items:,, S-matrix: a0 c(1, 0, 1)1 (_c) ( , 2, )( , 1, ) ac(_c) No a(_c), no count Lead to pattern SIDSequence 10 20 30 40
24
Scaling-up by Bi-level Projection Partition search space based on length-2 sequential patterns Only form projected databases and pursue recursive mining over bi-level projected databases
25
Benefits of Bi-level Projection More patterns are found in each shoot Much less projections –In the example, there are 53 patterns. –53 level-by-level projections –22 bi-level projections
26
3-way Apriori Checking Using Apriori heuristic to prune items in projected databases a2 b(4, 2, 2)1 c(4, 2, 1)(3, 3, 2)3 d(2, 1, 1)(2, 2, 0)(1, 3, 0)0 e(1, 2, 1)(1, 2, 0) (1, 1, 0)0 f(2, 1, 1)(2, 2, 0)(1, 2, 1)(1, 1, 1)(2, 0, 1)1 abcdef cannot be a pattern w.r.t. min_support=2 exclude d from -projected database
27
Pseudo-projection Major cost of PrefixSpan: projection –Postfixes of sequences often appear repeatedly in recursive projected databases When the projected database fit in memory, use pointers to form projections –Pointer to the sequence –Offset of the postfix s= s| : (, 2) s| : (, 4)
28
Pseudo-Projection vs. Physical Projection Pseudo-projection avoids physically copying postfixes –Efficient when database fits in main memory –Not efficient when database cannot fit in main memory Disk-based random accessing is very costly Suggested Approach: –Integration of physical and pseudo-projection –Swapping to pseudo-projection when the data set fits in memory
29
Experiments Synthetic datasets were generated using procedure described in R.Agrawal and R.Srikant. Mining sequential patterns. In Proc. 1995 ICDE’95 –number of items 1000 –number of sequences in the data set 10,000 –average number of items within elements 8 –average number of elements in a sequence 8
30
Experiments (con’t) Comparing PrefixSpan with GSP and FreeSpan in large databases –GSP (IBM Almaden, Srikant & Agrawal EDBT’96) –FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, KDD’00) –Prefix-Span-1 (single-level projection) –Prefix-Span-2 (bi-level projection) Comparing effects of pseudo-projection Comparing I/O cost and scalability
31
PrefixSpan Is Faster Than GSP and FreeSpan
32
Effect of Pseudo-Projection for projected database fit in memory
33
I/O Cost: When It Cannot Fit in Memory
34
Scalability (When DB Is Large) min_sup=0.2%
35
Conclusions Both PrefixSpan and FreeSpan are pattern-growth methods which perform better than Apriori-like methods for sequential pattern mining problem PrefixSpan is more elegant than FreeSpan –Apriori heuristic is integrated into bi-level projection in PrefixSpan –Pseudo-projection substantially enhances the performance of the memory-based processing
36
References J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern- projected sequential pattern mining. KDD'00, pages 355-359. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, pages 215-224. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.
37
Q&A
38
Thanks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.