Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang 2003-5-14.

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

PREFIXSPAN ALGORITHM Mining Sequential Patterns Efficiently by Prefix- Projected Pattern Growth
LOGO Association Rule Lecturer: Dr. Bo Yuan
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Rakesh Agrawal Ramakrishnan Srikant
1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Our New Progress on Frequent/Sequential Pattern Mining We develop new frequent/sequential pattern mining methods Performance study on both synthetic and.
Multi-dimensional Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequence Databases & Sequential Patterns
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
1 Mining Association Rules in Large Databases Association rule mining Algorithms for scalable mining of (single-dimensional Boolean) association rules.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining: Concepts and Techniques 1 Mining Sequence Patterns in Transactional Databases CS240B --UCLA Notes by Carlo Zaniolo Based on those by J. Han.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
Performance and Scalability: Apriori Implementation.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
What Is Sequential Pattern Mining?
Ch5 Mining Frequent Patterns, Associations, and Correlations
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Sequential PAttern Mining using A Bitmap Representation
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011.
Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.
Sequential Pattern Mining
Mining Frequent Patterns without Candidate Generation.
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,
Data Mining: Principles and Algorithms Mining Sequence Patterns
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Sequential Pattern Mining
Reducing Number of Candidates
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Association rule mining
Data Mining: Concepts and Techniques
Data Mining Association Analysis: Basic Concepts and Algorithms
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
Association Rule Mining
Data Mining: Concepts and Techniques — Chapter 8 — 8
Data Warehousing Mining & BI
Frequent-Pattern Tree
FP-Growth Wenlong Zhang.
Association Rule Mining
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Presentation transcript:

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang

Outline Sequential pattern mining Apriori-like methods –GSP Pattern-growth methods –FreeSpan –PrefixSpan Performance analysis Conclusions

Motivation Sequential pattern mining: Finding time-related frequent patterns Most data and applications are time-related –Customer shopping patterns, telephone calling patterns –Natural disasters (e.g., earthquake, hurricane) –Disease and treatment –Stock market fluctuation –Weblog click stream analysis –DNA sequence analysis

Concepts Let I={i 1,i 2,…,i n } be a set of all items Itemset is a subset of items Sequence is an ordered list of itemset. itemsets are called elements. The number of items in the sequence is its length –e.g. A sequence  = is called subsequence of  =, denoted , if there exist integers 1  j 1 <j 2 <…<j n  m such that a 1  b j1, a 2  b j2,…,a n  b jn –e.g. is subsequence of

Concepts (con’t) Sequence database is a set of tuples, sid is a sequence_id, and s is a sequence. A tuple is said to contain a sequence  if  is a subsequence of s Support of  is the number of tuples in the database containing  If the support of  no less than a threshold, it is called sequential pattern – is a sequential pattern given support threshold min_sup =2 SIDsequence

Problem definition Given a sequence database and min_sup threshold, the problem of sequential pattern mining is to find the complete set of sequential patterns in the database

Apriori-like methods Apriori property: If a sequence S is not frequent, then every super-sequence of S is not frequent –e.g. is infrequent, so do, GSP (Generalized Sequential Pattern) algorithm –Level-by-level do Generate candidate sequences Use Apriori property to prune candidates Scan database to collect support counts

GSP Mining Process 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. … … … … Cand. cannot pass sup. threshold Cand. not in DB at all

Bottlenecks of Apriori-Like Methods Potentially huge set of candidate sequences –1,000 frequent length-1 sequences generate length-2 candidates Multiple scans of database Difficulties at mining long sequential patterns –Exponential number of short candidates –A length-100 sequential pattern needs candidate sequences

Pattern-growth methods A divide-and-conquer approach –Recursively project a sequence database into a set of smaller databases –Mine each projected database to find the subset of patterns Algorithms –FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining –PrefixSpan: Prefix-Projected Sequential Pattern Mining

FreeSpan Example: given a sequence database S and min_support = 2 Step 1: find length-1 sequential patterns and list them in support descending order –f_list = a:4,b:4,c:4,d:3,e:3,f:3 SIDSequence

FreeSpan (con’t) Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 disjoint subsets: –ones only contain item a –ones contain item b but no items after b in f_list –ones contain item c but no items after c in f_list –ones contain item d but no items after d in f_list –ones contain item e but no items after e in f_list –ones contain item f find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively

FreeSpan (con’t) Finding Seq. Patterns containing item b but no items after b in f_list – -projected database:,,, –Find all the length-2 seq. pat. containing item b but no items after b in f_list : :4, :2, :2 –Further partition and mining SIDSequence

From FreeSpan to PrefixSpan Freespan: –Projection-based: No candidate sequence needs to be generated –But, projection can be performed at any point in the sequence, and the projected sequences may not shrink much. For example, the size of f-projected database is the same as the original sequence database PrefixSpan –Projection-based –But only prefix-based projection: less projections and quickly shrinking sequences

PrefixSpan-concepts Suppose all items in an element are listed alphabetically. Given a sequence  =,  = (m  n) Prefix:  is the prefix of  iff (1) e’ i =e i (i  m-1) (2) e’ m  e m (3) all items in (e m - e’ m ) are alphabetically after those in e’ m. –e.g.  =,  =,  ’= Postfix: sequence  =,  = is called the postfix of  w.r.t. prefix , where e’’ m =(e m -e’ m ), denoted as  = .  –e.g.  = is the postfix of  w.r.t. prefix

PrefixSpan-concepts (con’t) Projected database: let  be a sequential pattern in S.  -projected database, denoted s| , is the collection of postfixes of sequences in S w.r.t. prefix  Support count in projected database: let  be a sequential pattern in S,  be a sequence having prefix . The support count of  in  -projected database is the number of sequence  in s|  such that . 

PrefixSpan-process Step 1: find length-1 sequential patterns – :4, :4, :4, :3, :3, :3 Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: –ones having prefix ; –… –ones having prefix ; find subsets of sequential patterns. They can be mined by constructing projected databases and mining each recursively SIDSequence

PrefixSpan-Process (con’t) Finding Seq. Patterns with Prefix – -projected database:,,, –Find all the length-2 seq. pat. having prefix : :2, :4, :2, :4, :2, :2 –Further partition into 6 subsets Having prefix ; … Having prefix ; SIDSequence

Completeness of PrefixSpan SIDsequence Length-1 sequential patterns,,,,, … prefix -projected database … prefix -projected database Length-2 seq. pan,,,,, prefix -proj. db prefix, …, …

Efficiency of PrefixSpan No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing projected databases –Can be improved by bi-level projections and pseudo- projections

Optimization Techniques in PrefixSpan Single-level vs. bi-level projection –Bi-level projection with 3-way checking may reduce the number and size of projected databases Physical projection vs. pseudo-projection –Pseudo-projection may reduce the effort of projection when the projected database fits in main memory

S-matrix for sequence database Length-1 sequential patterns:,,,,, All length-2 sequential patterns are found in S-matrix S-matrix fedcba 1(2, 0, 1)(1, 1, 1)(1, 2, 1)(2, 2, 0)(2, 1, 1)f 0(1, 1, 0)(1, 2, 0) (1, 2, 1)e 0(1, 3, 0)(2, 2, 0)(2, 1, 1)d 3(3, 3, 2)(4, 2, 1)c 1(4, 2, 2)b 2a happens twice happens 4 times happens twice happens once

S-matrix for -projected database -projected database: –,, frequent items:,, S-matrix: a0 c(1, 0, 1)1 (_c) ( , 2,  )( , 1,  )  ac(_c) No a(_c), no count Lead to pattern SIDSequence

Scaling-up by Bi-level Projection Partition search space based on length-2 sequential patterns Only form projected databases and pursue recursive mining over bi-level projected databases

Benefits of Bi-level Projection More patterns are found in each shoot Much less projections –In the example, there are 53 patterns. –53 level-by-level projections –22 bi-level projections

3-way Apriori Checking Using Apriori heuristic to prune items in projected databases a2 b(4, 2, 2)1 c(4, 2, 1)(3, 3, 2)3 d(2, 1, 1)(2, 2, 0)(1, 3, 0)0 e(1, 2, 1)(1, 2, 0) (1, 1, 0)0 f(2, 1, 1)(2, 2, 0)(1, 2, 1)(1, 1, 1)(2, 0, 1)1 abcdef cannot be a pattern w.r.t. min_support=2 exclude d from -projected database

Pseudo-projection Major cost of PrefixSpan: projection –Postfixes of sequences often appear repeatedly in recursive projected databases When the projected database fit in memory, use pointers to form projections –Pointer to the sequence –Offset of the postfix s= s| : (, 2) s| : (, 4)

Pseudo-Projection vs. Physical Projection Pseudo-projection avoids physically copying postfixes –Efficient when database fits in main memory –Not efficient when database cannot fit in main memory Disk-based random accessing is very costly Suggested Approach: –Integration of physical and pseudo-projection –Swapping to pseudo-projection when the data set fits in memory

Experiments Synthetic datasets were generated using procedure described in R.Agrawal and R.Srikant. Mining sequential patterns. In Proc ICDE’95 –number of items 1000 –number of sequences in the data set 10,000 –average number of items within elements 8 –average number of elements in a sequence 8

Experiments (con’t) Comparing PrefixSpan with GSP and FreeSpan in large databases –GSP (IBM Almaden, Srikant & Agrawal EDBT’96) –FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, KDD’00) –Prefix-Span-1 (single-level projection) –Prefix-Span-2 (bi-level projection) Comparing effects of pseudo-projection Comparing I/O cost and scalability

PrefixSpan Is Faster Than GSP and FreeSpan

Effect of Pseudo-Projection for projected database fit in memory

I/O Cost: When It Cannot Fit in Memory

Scalability (When DB Is Large) min_sup=0.2%

Conclusions Both PrefixSpan and FreeSpan are pattern-growth methods which perform better than Apriori-like methods for sequential pattern mining problem PrefixSpan is more elegant than FreeSpan –Apriori heuristic is integrated into bi-level projection in PrefixSpan –Pseudo-projection substantially enhances the performance of the memory-based processing

References J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern- projected sequential pattern mining. KDD'00, pages J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, pages R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.

Q&A

Thanks