Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14.

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14

Outline Sequential pattern mining Apriori-like methods –GSP Pattern-growth methods –FreeSpan –PrefixSpan Performance analysis Conclusions

Motivation Sequential pattern mining: Finding time-related frequent patterns Most data and applications are time-related –Customer shopping patterns, telephone calling patterns –Natural disasters (e.g., earthquake, hurricane) –Disease and treatment –Stock market fluctuation –Weblog click stream analysis –DNA sequence analysis

Concepts Let I={i 1,i 2,…,i n } be a set of all items Itemset is a subset of items Sequence is an ordered list of itemset. itemsets are called elements. The number of items in the sequence is its length –e.g. A sequence  = is called subsequence of  =, denoted , if there exist integers 1  j 1 <j 2 <…<j n  m such that a 1  b j1, a 2  b j2,…,a n  b jn –e.g. is subsequence of

Concepts (con’t) Sequence database is a set of tuples, sid is a sequence_id, and s is a sequence. A tuple is said to contain a sequence  if  is a subsequence of s Support of  is the number of tuples in the database containing  If the support of  no less than a threshold, it is called sequential pattern – is a sequential pattern given support threshold min_sup =2 SIDsequence 10 20 30 40

Problem definition Given a sequence database and min_sup threshold, the problem of sequential pattern mining is to find the complete set of sequential patterns in the database

Apriori-like methods Apriori property: If a sequence S is not frequent, then every super-sequence of S is not frequent –e.g. is infrequent, so do, GSP (Generalized Sequential Pattern) algorithm –Level-by-level do Generate candidate sequences Use Apriori property to prune candidates Scan database to collect support counts

GSP Mining Process 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. … … … … Cand. cannot pass sup. threshold Cand. not in DB at all

Bottlenecks of Apriori-Like Methods Potentially huge set of candidate sequences –1,000 frequent length-1 sequences generate length-2 candidates Multiple scans of database Difficulties at mining long sequential patterns –Exponential number of short candidates –A length-100 sequential pattern needs candidate sequences

Pattern-growth methods A divide-and-conquer approach –Recursively project a sequence database into a set of smaller databases –Mine each projected database to find the subset of patterns Algorithms –FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining –PrefixSpan: Prefix-Projected Sequential Pattern Mining

FreeSpan Example: given a sequence database S and min_support = 2 Step 1: find length-1 sequential patterns and list them in support descending order –f_list = a:4,b:4,c:4,d:3,e:3,f:3 SIDSequence 10 20 30 40

FreeSpan (con’t) Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 disjoint subsets: –ones only contain item a –ones contain item b but no items after b in f_list –ones contain item c but no items after c in f_list –ones contain item d but no items after d in f_list –ones contain item e but no items after e in f_list –ones contain item f find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively

FreeSpan (con’t) Finding Seq. Patterns containing item b but no items after b in f_list – -projected database:,,, –Find all the length-2 seq. pat. containing item b but no items after b in f_list : :4, :2, :2 –Further partition and mining SIDSequence 10 20 30 40

From FreeSpan to PrefixSpan Freespan: –Projection-based: No candidate sequence needs to be generated –But, projection can be performed at any point in the sequence, and the projected sequences may not shrink much. For example, the size of f-projected database is the same as the original sequence database PrefixSpan –Projection-based –But only prefix-based projection: less projections and quickly shrinking sequences

PrefixSpan-concepts Suppose all items in an element are listed alphabetically. Given a sequence  =,  = (m  n) Prefix:  is the prefix of  iff (1) e’ i =e i (i  m-1) (2) e’ m  e m (3) all items in (e m - e’ m ) are alphabetically after those in e’ m. –e.g.  =,  =,  ’= Postfix: sequence  =,  = is called the postfix of  w.r.t. prefix , where e’’ m =(e m -e’ m ), denoted as  = .  –e.g.  = is the postfix of  w.r.t. prefix

PrefixSpan-concepts (con’t) Projected database: let  be a sequential pattern in S.  -projected database, denoted s| , is the collection of postfixes of sequences in S w.r.t. prefix  Support count in projected database: let  be a sequential pattern in S,  be a sequence having prefix . The support count of  in  -projected database is the number of sequence  in s|  such that . 

PrefixSpan-process Step 1: find length-1 sequential patterns – :4, :4, :4, :3, :3, :3 Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: –ones having prefix ; –… –ones having prefix ; find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively SIDSequence 10 20 30 40

PrefixSpan-Process (con’t) Finding Seq. Patterns with Prefix – -projected database:,,, –Find all the length-2 seq. pat. having prefix : :2, :4, :2, :4, :2, :2 –Further partition into 6 subsets Having prefix ; … Having prefix ; SIDSequence 10 20 30 40

Completeness of PrefixSpan SIDsequence 10 20 30 40 Length-1 sequential patterns,,,,, … prefix -projected database … prefix -projected database Length-2 seq. pan,,,,, prefix -proj. db prefix, …, …

Efficiency of PrefixSpan No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing projected databases –Can be improved by bi-level projections and pseudo- projections

Optimization Techniques in PrefixSpan Single-level vs. bi-level projection –Bi-level projection with 3-way checking may reduce the number and size of projected databases Physical projection vs. pseudo-projection –Pseudo-projection may reduce the effort of projection when the projected database fits in main memory

S-matrix for sequence database Length-1 sequential patterns:,,,,, All length-2 sequential patterns are found in S-matrix S-matrix fedcba 1(2, 0, 1)(1, 1, 1)(1, 2, 1)(2, 2, 0)(2, 1, 1)f 0(1, 1, 0)(1, 2, 0) (1, 2, 1)e 0(1, 3, 0)(2, 2, 0)(2, 1, 1)d 3(3, 3, 2)(4, 2, 1)c 1(4, 2, 2)b 2a happens twice happens 4 times happens twice happens once

S-matrix for -projected database -projected database: –,, frequent items:,, S-matrix: a0 c(1, 0, 1)1 (_c) ( , 2,  )( , 1,  )  ac(_c) No a(_c), no count Lead to pattern SIDSequence 10 20 30 40

Scaling-up by Bi-level Projection Partition search space based on length-2 sequential patterns Only form projected databases and pursue recursive mining over bi-level projected databases

Benefits of Bi-level Projection More patterns are found in each shoot Much less projections –In the example, there are 53 patterns. –53 level-by-level projections –22 bi-level projections

3-way Apriori Checking Using Apriori heuristic to prune items in projected databases a2 b(4, 2, 2)1 c(4, 2, 1)(3, 3, 2)3 d(2, 1, 1)(2, 2, 0)(1, 3, 0)0 e(1, 2, 1)(1, 2, 0) (1, 1, 0)0 f(2, 1, 1)(2, 2, 0)(1, 2, 1)(1, 1, 1)(2, 0, 1)1 abcdef cannot be a pattern w.r.t. min_support=2 exclude d from -projected database

Pseudo-projection Major cost of PrefixSpan: projection –Postfixes of sequences often appear repeatedly in recursive projected databases When the projected database fit in memory, use pointers to form projections –Pointer to the sequence –Offset of the postfix s= s| : (, 2) s| : (, 4)

Pseudo-Projection vs. Physical Projection Pseudo-projection avoids physically copying postfixes –Efficient when database fits in main memory –Not efficient when database cannot fit in main memory Disk-based random accessing is very costly Suggested Approach: –Integration of physical and pseudo-projection –Swapping to pseudo-projection when the data set fits in memory

Experiments Synthetic datasets were generated using procedure described in R.Agrawal and R.Srikant. Mining sequential patterns. In Proc. 1995 ICDE’95 –number of items 1000 –number of sequences in the data set 10,000 –average number of items within elements 8 –average number of elements in a sequence 8

Experiments (con’t) Comparing PrefixSpan with GSP and FreeSpan in large databases –GSP (IBM Almaden, Srikant & Agrawal EDBT’96) –FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, KDD’00) –Prefix-Span-1 (single-level projection) –Prefix-Span-2 (bi-level projection) Comparing effects of pseudo-projection Comparing I/O cost and scalability

PrefixSpan Is Faster Than GSP and FreeSpan

Effect of Pseudo-Projection for projected database fit in memory

I/O Cost: When It Cannot Fit in Memory

Scalability (When DB Is Large) min_sup=0.2%

Conclusions Both PrefixSpan and FreeSpan are pattern-growth methods which perform better than Apriori-like methods for sequential pattern mining problem PrefixSpan is more elegant than FreeSpan –Apriori heuristic is integrated into bi-level projection in PrefixSpan –Pseudo-projection substantially enhances the performance of the memory-based processing

References J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern- projected sequential pattern mining. KDD'00, pages 355-359. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, pages 215-224. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.

Thanks

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14.

Similar presentations

Presentation on theme: "Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14.

Similar presentations

Presentation on theme: "Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14."— Presentation transcript:

Similar presentations

About project

Feedback