PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić Department of Computer Engineering and Computer Science, School of Electrical.
PREFIXSPAN ALGORITHM Mining Sequential Patterns Efficiently by Prefix- Projected Pattern Growth
Sequential Patterns & Process Mining Current State of Research Edgar de Graaf LIACS.
Zhou Zhao, Da Yan and Wilfred Ng
LOGO Association Rule Lecturer: Dr. Bo Yuan
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign.
1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Our New Progress on Frequent/Sequential Pattern Mining We develop new frequent/sequential pattern mining methods Performance study on both synthetic and.
Multi-dimensional Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequence Databases & Sequential Patterns
Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
Performance and Scalability: Apriori Implementation.
A Short Introduction to Sequential Data Mining
What Is Sequential Pattern Mining?
實驗室研究暨成果說明會 Content and Knowledge Management Laboratory (B) Data Mining Part Director: Anthony J. T. Lee Presenter: Wan-chuen Lin.
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Sequential PAttern Mining using A Bitmap Representation
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference.
1 An Efficient Algorithm for Mining Frequent Sequences by a New Strategy without Support Counting Ding-Ying Chiu Yi-Hung Wu Arbee L.P. Chen ICDE2004 peaker:
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011.
Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.
Sequential Pattern Mining
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01
Frequent Sequential Attack Patterns of Malware in Botnets Nur Rohman Rosyid.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者:林靜怡.
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
Packet Classification Using Dynamically Generated Decision Trees
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
Data Mining: Principles and Algorithms Mining Sequence Patterns
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Sequential Pattern Mining
TITLE What should be in Objective, Method and Significant
Reducing Number of Candidates
Sequential Pattern Mining Using A Bitmap Representation
Data Mining: Concepts and Techniques
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
A Parameterised Algorithm for Mining Association Rules
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
Data Mining: Concepts and Techniques — Chapter 8 — 8
Data Warehousing Mining & BI
Frequent-Pattern Tree
FP-Growth Wenlong Zhang.
Association Rule Mining
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Presentation transcript:

PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen, Umeshwar Dayal, Mei-Chun Hsu Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach 20th International Council for Open and Distance Education (ICDE) World Conference on Open Learning and Distance Education World Conference on Open Learning and Distance Education Dusseldorf, Germany, April 2001 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE), VOL. 16, NO. 10, OCTOBER 2004 VOL. 16, NO. 10, OCTOBER 2004 Advisor: Professor Hsin-Hsi Chen Reporter: Clarence Min-Chi Hsieh Natural Language Processing Laboratory, Dept. of Computer Science and Info. Engineering, NTU 2005/07/19

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 2 Outline Abstract Abstract Introduction Introduction PrefixSpan PrefixSpan Performance Performance Conclusions Conclusions

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 3 Abstract Sequential pattern mining is a difficult problem since one may need to examine a combinatorially explosive number of possible subsequence patterns Sequential pattern mining is a difficult problem since one may need to examine a combinatorially explosive number of possible subsequence patterns The general idea of the method is to integrate the mining of frequent sequences with that of frequent patterns and use projected sequence databases to confine the search and the growth of subsequence fragments The general idea of the method is to integrate the mining of frequent sequences with that of frequent patterns and use projected sequence databases to confine the search and the growth of subsequence fragments

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 4 Introduction Apriori-like algorithm will generate a huge set of candidate sequences Apriori-like algorithm will generate a huge set of candidate sequences – There are 1000 frequent sequences of length-1 – 1000×1000+(1000×999)/2=1,499,500 candidate sequences Many scans of databases in mining Many scans of databases in mining – Sequential pattern {(abc)(abc)(abc)(abc)(abc)} – The Apriori-based method must scan the database at least 15 times Difficulties at mining long sequential patterns Difficulties at mining long sequential patterns – There is only a single sequence of length 100, min_sup=1 – length-1 candidate sequences: 100, length-2: 14950, … – total = 2^100-1 » 10^30

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 5 Introduction (Cont.) A sequence : Elements items within an element are listed alphabetically is a subsequence of Given support threshold min_sup =2, is a sequential pattern A sequence database 40 abc abc 10 sequenceSID Sequence, Elements, Subsequence and Sequential Pattern

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 6 PrefixSpan PrefixSpan-1 PrefixSpan-1 – single-level projection PrefixSpan-2 PrefixSpan-2 – bi-level projection – Use S-matrix PrefixSpan use Pseudo-Projection PrefixSpan use Pseudo-Projection

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 7 PrefixSpan (Cont.) Definition – Prefix and Postfix (Projection),, and are prefixes of sequence Given sequence PrefixPostfix /Projection

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 8 PrefixSpan-1 Step 1. Find length-1 sequential patterns  Scan DB once to find all frequent items in sequences Step 2. Divide search space  Partitioned into the following subsets according to the prefixes Step 3. Find subsets of sequential patterns  The subsets of sequential patterns can be mined by constructing corresponding projected databases and mine each recursively

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 9 PrefixSpan-1 (Example) Sequence_idSequence min_support = 2 L1 : : 4 , : 4 , : 4 : 3 , : 3 , : 3 : 3 , : 3 , : 3

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 10 PrefixSpan-1 (Example) (Cont.) Prefix Projected (Postfix) Database,,,,,,,,,,,,,,,,,,,,,, L1 : : 4 , : 4 , : 4 : 3 , : 3 , : 3

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 11 PrefixSpan-1 (Example) (Cont.),, Scanning -Projected database once: a:2, b:4, c:4, d:2, e:1, f:2 (_b):2, (_c):1, (_d):1, (_f):1  L2: :2, :4, :2 :4, :2, :2

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 12 PrefixSpan-1 (Example) (Cont.) Prefix Projected (Postfix) Database,,,,,,,,,,,,,,

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 13 PrefixSpan-1 (Example) (Cont.),, Scanning -Projected database once: a:2, c:2, d:1, f:1, (_c):2  L3: :2, :2, :2

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 14 PrefixSpan-1 (Example) (Cont.) PrefixProjected (Postfix) Database, Scanning -Projected database once: a:2, c:1, d:1, f:1  L4: :2

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 15 PrefixSpan-1 (Example) (Cont.) PrefixSequential Patterns,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 16 Completeness of PrefixSpan sequenceSID SDB Length-1 sequential patterns,,,,, -projected database Length-2 sequential patterns,,,,, Having prefix -proj. db … Having prefix … Having prefix, …, -projected database Having prefix …

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 17 Analysis No candidate sequence needs to be generated by PrefixSpan No candidate sequence needs to be generated by PrefixSpan Projected databases keep shrinking Projected databases keep shrinking The major cost of PrefixSpan is the construction of projected databases The major cost of PrefixSpan is the construction of projected databases

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 18 PrefixSpan-2 Step 1. Find length-1 sequential patterns  Scan DB once to find all frequent item in sequences Step 2. Construct triangular matrix M (S-matrix)  By scanning DB second time, the S-matrix can be filled up Step 3. Construct  -projected database  For each length-2 sequential pattern , construct  -projected DB Step 4. Mining each projected DB recursively

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 19 PrefixSpan-2 (Example) Sequence_idSequence min_support = 2 L1 : : 4 , : 4 , : 4 : 3 , : 3 , : 3 : 3 , : 3 , : 3

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 20 PrefixSpan-2 (Example) (Cont.) (2,1,1)(2,2,0)(1,2,1)(1,1,1) (2,0,1) (1,2,1)(1,2,0)(1,2,0)(1,1,0) (2,1,1)(2,2,0) (1,3,0) (4,2,1)(3,3,2) (4,2,2) abcdef f e d c b a S-matrix happens happens 4 times 4 times happens happens 3 times 3 times happens happens 1 times 1 times happens happens 1 times 1 times

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 21 PrefixSpan-2 (Example) (Cont.) (2,1,1)(2,2,0)(1,2,1)(1,1,1)(2,0,1) (1,2,1)(1,2,0)(1,2,0)(1,1,0) (2,1,1)(2,2,0)(1,3,0) (4,2,1)(3,3,2) (4,2,2) abcdef f e d c b a -projected database -projected database Local length-1 sequential patterns:,, patterns:,, ( ,2,  ) ( ,1,  )  (1,0,1) 1 0 ac(_c) a c Lead to pattern No hope to form (_ac),So no need to count it

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 22 Benefits of Bi-level Projection Much less projections Much less projections – In this example there are 53 patterns there are 53 patterns 53 level-by-level projections 53 level-by-level projections 22 bi-level projections 22 bi-level projections

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 23 Speed-up by Pseudo-Projection Major cost of PrefixSpan: Projection Major cost of PrefixSpan: Projection – Postfixes of sequences often appear repeatedly in recursive projected databases When (projected) database can be held in main memory, use pointers to form projections When (projected) database can be held in main memory, use pointers to form projections – Pointer to the sequence – Offset of the postfix s= s= <a> <ab> s| : (, 2) s| : (, 4)

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 24 Performance

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 25 Performance (Cont.)

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 26 Performance (Cont.)

Reporter: Clarence Min-Chi Hsieh Copyright © Natural Language Processing Lab., NTU, 2005Slider - 27 Conclusions PrefixSpan is a novel, scalable, and efficient sequential mining method