PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,

PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen, Umeshwar Dayal, Mei-Chun Hsu Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach 20th International Council for Open and Distance Education (ICDE) World Conference on Open Learning and Distance Education World Conference on Open Learning and Distance Education Dusseldorf, Germany, 01-05 April 2001 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE), VOL. 16, NO. 10, OCTOBER 2004 VOL. 16, NO. 10, OCTOBER 2004 Advisor: Professor Hsin-Hsi Chen Reporter: Clarence Min-Chi Hsieh Natural Language Processing Laboratory, Dept. of Computer Science and Info. Engineering, NTU 2005/07/19

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 2 Outline Abstract Abstract Introduction Introduction PrefixSpan PrefixSpan Performance Performance Conclusions Conclusions

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 3 Abstract Sequential pattern mining is a difficult problem since one may need to examine a combinatorially explosive number of possible subsequence patterns Sequential pattern mining is a difficult problem since one may need to examine a combinatorially explosive number of possible subsequence patterns The general idea of the method is to integrate the mining of frequent sequences with that of frequent patterns and use projected sequence databases to confine the search and the growth of subsequence fragments The general idea of the method is to integrate the mining of frequent sequences with that of frequent patterns and use projected sequence databases to confine the search and the growth of subsequence fragments

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 4 Introduction Apriori-like algorithm will generate a huge set of candidate sequences Apriori-like algorithm will generate a huge set of candidate sequences – There are 1000 frequent sequences of length-1 – 1000×1000+(1000×999)/2=1,499,500 candidate sequences Many scans of databases in mining Many scans of databases in mining – Sequential pattern {(abc)(abc)(abc)(abc)(abc)} – The Apriori-based method must scan the database at least 15 times Difficulties at mining long sequential patterns Difficulties at mining long sequential patterns – There is only a single sequence of length 100, min_sup=1 – length-1 candidate sequences: 100, length-2: 14950, … – total = 2^100-1 » 10^30

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 5 Introduction (Cont.) A sequence : Elements items within an element are listed alphabetically is a subsequence of Given support threshold min_sup =2, is a sequential pattern A sequence database 40 abc 30 20 abc 10 sequenceSID Sequence, Elements, Subsequence and Sequential Pattern

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 6 PrefixSpan PrefixSpan-1 PrefixSpan-1 – single-level projection PrefixSpan-2 PrefixSpan-2 – bi-level projection – Use S-matrix PrefixSpan use Pseudo-Projection PrefixSpan use Pseudo-Projection

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 7 PrefixSpan (Cont.) Definition – Prefix and Postfix (Projection),, and are prefixes of sequence Given sequence PrefixPostfix /Projection

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 8 PrefixSpan-1 Step 1. Find length-1 sequential patterns  Scan DB once to find all frequent items in sequences Step 2. Divide search space  Partitioned into the following subsets according to the prefixes Step 3. Find subsets of sequential patterns  The subsets of sequential patterns can be mined by constructing corresponding projected databases and mine each recursively

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 9 PrefixSpan-1 (Example) Sequence_idSequence 10 20 30 40 min_support = 2 L1 ： : 4 ， : 4 ， : 4 : 3 ， : 3 ， : 3 : 3 ， : 3 ， : 3

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 10 PrefixSpan-1 (Example) (Cont.) Prefix Projected (Postfix) Database,,,,,,,,,,,,,,,,,,,,,, 10 20 30 40 L1 ： : 4 ， : 4 ， : 4 : 3 ， : 3 ， : 3

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 11 PrefixSpan-1 (Example) (Cont.),, Scanning -Projected database once: a:2, b:4, c:4, d:2, e:1, f:2 (_b):2, (_c):1, (_d):1, (_f):1  L2: :2, :4, :2 :4, :2, :2

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 13 PrefixSpan-1 (Example) (Cont.),, Scanning -Projected database once: a:2, c:2, d:1, f:1, (_c):2  L3: :2, :2, :2

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 14 PrefixSpan-1 (Example) (Cont.) PrefixProjected (Postfix) Database, Scanning -Projected database once: a:2, c:1, d:1, f:1  L4: :2

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 15 PrefixSpan-1 (Example) (Cont.) PrefixSequential Patterns,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 16 Completeness of PrefixSpan-1 40 30 20 10 sequenceSID SDB Length-1 sequential patterns,,,,, -projected database Length-2 sequential patterns,,,,, Having prefix -proj. db … Having prefix … Having prefix, …, -projected database Having prefix …

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 17 Analysis No candidate sequence needs to be generated by PrefixSpan No candidate sequence needs to be generated by PrefixSpan Projected databases keep shrinking Projected databases keep shrinking The major cost of PrefixSpan is the construction of projected databases The major cost of PrefixSpan is the construction of projected databases

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 18 PrefixSpan-2 Step 1. Find length-1 sequential patterns  Scan DB once to find all frequent item in sequences Step 2. Construct triangular matrix M (S-matrix)  By scanning DB second time, the S-matrix can be filled up Step 3. Construct  -projected database  For each length-2 sequential pattern , construct  -projected DB Step 4. Mining each projected DB recursively

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 19 PrefixSpan-2 (Example) Sequence_idSequence 10 20 30 40 min_support = 2 L1 ： : 4 ， : 4 ， : 4 : 3 ， : 3 ， : 3 : 3 ， : 3 ， : 3

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 20 PrefixSpan-2 (Example) (Cont.) 10 20 30 40 (2,1,1)(2,2,0)(1,2,1)(1,1,1) (2,0,1) (1,2,1)(1,2,0)(1,2,0)(1,1,0) (2,1,1)(2,2,0) (1,3,0) (4,2,1)(3,3,2) (4,2,2) 1 0 0 3 1 2 abcdef f e d c b a S-matrix happens happens 4 times 4 times happens happens 3 times 3 times happens happens 1 times 1 times happens happens 1 times 1 times

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 21 PrefixSpan-2 (Example) (Cont.) 10 20 30 40 (2,1,1)(2,2,0)(1,2,1)(1,1,1)(2,0,1) (1,2,1)(1,2,0)(1,2,0)(1,1,0) (2,1,1)(2,2,0)(1,3,0) (4,2,1)(3,3,2) (4,2,2) 1 0 0 3 1 2 abcdef f e d c b a -projected database -projected database Local length-1 sequential patterns:,, patterns:,, ( ,2,  ) ( ,1,  )  (1,0,1) 1 0 ac(_c) a c Lead to pattern No hope to form (_ac),So no need to count it

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 22 Benefits of Bi-level Projection Much less projections Much less projections – In this example there are 53 patterns there are 53 patterns 53 level-by-level projections 53 level-by-level projections 22 bi-level projections 22 bi-level projections

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 23 Speed-up by Pseudo-Projection Major cost of PrefixSpan: Projection Major cost of PrefixSpan: Projection – Postfixes of sequences often appear repeatedly in recursive projected databases When (projected) database can be held in main memory, use pointers to form projections When (projected) database can be held in main memory, use pointers to form projections – Pointer to the sequence – Offset of the postfix s= s= <a> <ab> s| : (, 2) s| : (, 4)

PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,

Similar presentations

Presentation on theme: "PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,

Similar presentations

Presentation on theme: "PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,"— Presentation transcript:

Similar presentations

About project

Feedback