Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Outline Sequential pattern mining Apriori-like methods –GSP Pattern-growth methods –FreeSpan –PrefixSpan Performance analysis Conclusions
Motivation Sequential pattern mining: Finding time-related frequent patterns Most data and applications are time-related –Customer shopping patterns, telephone calling patterns –Natural disasters (e.g., earthquake, hurricane) –Disease and treatment –Stock market fluctuation –Weblog click stream analysis –DNA sequence analysis
Concepts Let I={i 1,i 2,…,i n } be a set of all items Itemset is a subset of items Sequence is an ordered list of itemset. itemsets are called elements. The number of items in the sequence is its length –e.g. A sequence = is called subsequence of =, denoted , if there exist integers 1 j 1 <j 2 <…<j n m such that a 1 b j1, a 2 b j2,…,a n b jn –e.g. is subsequence of
Concepts (con’t) Sequence database is a set of tuples, sid is a sequence_id, and s is a sequence. A tuple is said to contain a sequence if is a subsequence of s Support of is the number of tuples in the database containing If the support of no less than a threshold, it is called sequential pattern – is a sequential pattern given support threshold min_sup =2 SIDsequence
Problem definition Given a sequence database and min_sup threshold, the problem of sequential pattern mining is to find the complete set of sequential patterns in the database
Apriori-like methods Apriori property: If a sequence S is not frequent, then every super-sequence of S is not frequent –e.g. is infrequent, so do, GSP (Generalized Sequential Pattern) algorithm –Level-by-level do Generate candidate sequences Use Apriori property to prune candidates Scan database to collect support counts
GSP Mining Process 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. … … … … Cand. cannot pass sup. threshold Cand. not in DB at all
Bottlenecks of Apriori-Like Methods Potentially huge set of candidate sequences –1,000 frequent length-1 sequences generate length-2 candidates Multiple scans of database Difficulties at mining long sequential patterns –Exponential number of short candidates –A length-100 sequential pattern needs candidate sequences
Pattern-growth methods A divide-and-conquer approach –Recursively project a sequence database into a set of smaller databases –Mine each projected database to find the subset of patterns Algorithms –FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining –PrefixSpan: Prefix-Projected Sequential Pattern Mining
FreeSpan Example: given a sequence database S and min_support = 2 Step 1: find length-1 sequential patterns and list them in support descending order –f_list = a:4,b:4,c:4,d:3,e:3,f:3 SIDSequence
FreeSpan (con’t) Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 disjoint subsets: –ones only contain item a –ones contain item b but no items after b in f_list –ones contain item c but no items after c in f_list –ones contain item d but no items after d in f_list –ones contain item e but no items after e in f_list –ones contain item f find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively
FreeSpan (con’t) Finding Seq. Patterns containing item b but no items after b in f_list – -projected database:,,, –Find all the length-2 seq. pat. containing item b but no items after b in f_list : :4, :2, :2 –Further partition and mining SIDSequence
From FreeSpan to PrefixSpan Freespan: –Projection-based: No candidate sequence needs to be generated –But, projection can be performed at any point in the sequence, and the projected sequences may not shrink much. For example, the size of f-projected database is the same as the original sequence database PrefixSpan –Projection-based –But only prefix-based projection: less projections and quickly shrinking sequences
PrefixSpan-concepts Suppose all items in an element are listed alphabetically. Given a sequence =, = (m n) Prefix: is the prefix of iff (1) e’ i =e i (i m-1) (2) e’ m e m (3) all items in (e m - e’ m ) are alphabetically after those in e’ m. –e.g. =, =, ’= Postfix: sequence =, = is called the postfix of w.r.t. prefix , where e’’ m =(e m -e’ m ), denoted as = . –e.g. = is the postfix of w.r.t. prefix
PrefixSpan-concepts (con’t) Projected database: let be a sequential pattern in S. -projected database, denoted s| , is the collection of postfixes of sequences in S w.r.t. prefix Support count in projected database: let be a sequential pattern in S, be a sequence having prefix . The support count of in -projected database is the number of sequence in s| such that .
PrefixSpan-process Step 1: find length-1 sequential patterns – :4, :4, :4, :3, :3, :3 Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: –ones having prefix ; –… –ones having prefix ; find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively SIDSequence
PrefixSpan-Process (con’t) Finding Seq. Patterns with Prefix – -projected database:,,, –Find all the length-2 seq. pat. having prefix : :2, :4, :2, :4, :2, :2 –Further partition into 6 subsets Having prefix ; … Having prefix ; SIDSequence
Completeness of PrefixSpan SIDsequence Length-1 sequential patterns,,,,, … prefix -projected database … prefix -projected database Length-2 seq. pan,,,,, prefix -proj. db prefix, …, …
Efficiency of PrefixSpan No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing projected databases –Can be improved by bi-level projections and pseudo- projections
Optimization Techniques in PrefixSpan Single-level vs. bi-level projection –Bi-level projection with 3-way checking may reduce the number and size of projected databases Physical projection vs. pseudo-projection –Pseudo-projection may reduce the effort of projection when the projected database fits in main memory
S-matrix for sequence database Length-1 sequential patterns:,,,,, All length-2 sequential patterns are found in S-matrix S-matrix fedcba 1(2, 0, 1)(1, 1, 1)(1, 2, 1)(2, 2, 0)(2, 1, 1)f 0(1, 1, 0)(1, 2, 0) (1, 2, 1)e 0(1, 3, 0)(2, 2, 0)(2, 1, 1)d 3(3, 3, 2)(4, 2, 1)c 1(4, 2, 2)b 2a happens twice happens 4 times happens twice happens once
S-matrix for -projected database -projected database: –,, frequent items:,, S-matrix: a0 c(1, 0, 1)1 (_c) ( , 2, )( , 1, ) ac(_c) No a(_c), no count Lead to pattern SIDSequence
Scaling-up by Bi-level Projection Partition search space based on length-2 sequential patterns Only form projected databases and pursue recursive mining over bi-level projected databases
Benefits of Bi-level Projection More patterns are found in each shoot Much less projections –In the example, there are 53 patterns. –53 level-by-level projections –22 bi-level projections
3-way Apriori Checking Using Apriori heuristic to prune items in projected databases a2 b(4, 2, 2)1 c(4, 2, 1)(3, 3, 2)3 d(2, 1, 1)(2, 2, 0)(1, 3, 0)0 e(1, 2, 1)(1, 2, 0) (1, 1, 0)0 f(2, 1, 1)(2, 2, 0)(1, 2, 1)(1, 1, 1)(2, 0, 1)1 abcdef cannot be a pattern w.r.t. min_support=2 exclude d from -projected database
Pseudo-projection Major cost of PrefixSpan: projection –Postfixes of sequences often appear repeatedly in recursive projected databases When the projected database fit in memory, use pointers to form projections –Pointer to the sequence –Offset of the postfix s= s| : (, 2) s| : (, 4)
Pseudo-Projection vs. Physical Projection Pseudo-projection avoids physically copying postfixes –Efficient when database fits in main memory –Not efficient when database cannot fit in main memory Disk-based random accessing is very costly Suggested Approach: –Integration of physical and pseudo-projection –Swapping to pseudo-projection when the data set fits in memory
Experiments Synthetic datasets were generated using procedure described in R.Agrawal and R.Srikant. Mining sequential patterns. In Proc ICDE’95 –number of items 1000 –number of sequences in the data set 10,000 –average number of items within elements 8 –average number of elements in a sequence 8
Experiments (con’t) Comparing PrefixSpan with GSP and FreeSpan in large databases –GSP (IBM Almaden, Srikant & Agrawal EDBT’96) –FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, KDD’00) –Prefix-Span-1 (single-level projection) –Prefix-Span-2 (bi-level projection) Comparing effects of pseudo-projection Comparing I/O cost and scalability
PrefixSpan Is Faster Than GSP and FreeSpan
Effect of Pseudo-Projection for projected database fit in memory
I/O Cost: When It Cannot Fit in Memory
Scalability (When DB Is Large) min_sup=0.2%
Conclusions Both PrefixSpan and FreeSpan are pattern-growth methods which perform better than Apriori-like methods for sequential pattern mining problem PrefixSpan is more elegant than FreeSpan –Apriori heuristic is integrated into bi-level projection in PrefixSpan –Pseudo-projection substantially enhances the performance of the memory-based processing
References J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern- projected sequential pattern mining. KDD'00, pages J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, pages R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.
Q&A
Thanks