Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Rakesh Agrawal Ramakrishnan Srikant
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
Multi-dimensional Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequence Databases & Sequential Patterns
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
1 Mining Association Rules in Large Databases Association rule mining Algorithms for scalable mining of (single-dimensional Boolean) association rules.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Data Mining: Concepts and Techniques 1 Mining Sequence Patterns in Transactional Databases CS240B --UCLA Notes by Carlo Zaniolo Based on those by J. Han.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
SEG Tutorial 2 – Frequent Pattern Mining.
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
A Short Introduction to Sequential Data Mining
What Is Sequential Pattern Mining?
實驗室研究暨成果說明會 Content and Knowledge Management Laboratory (B) Data Mining Part Director: Anthony J. T. Lee Presenter: Wan-chuen Lin.
Ch5 Mining Frequent Patterns, Associations, and Correlations
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference.
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011.
Sequential Pattern Mining
Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Chapter 6: Mining Frequent Patterns, Association and Correlations
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
What is Frequent Pattern Analysis?
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Data Mining: Principles and Algorithms Mining Sequence Patterns
Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Association rule mining
Frequent Pattern Mining
Data Mining: Concepts and Techniques
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques — Chapter 8 — 8
Data Warehousing Mining & BI
Frequent-Pattern Tree
Association Rule Mining
Association Analysis: Basic Concepts
Presentation transcript:

Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University of South Carolina Department of Computer Science and Engineering

Roadmap Sequential Pattern Mining Problem The challenages The Apriori based algorithms FP-Growth based algorithm SPADE algorithm Mining closed sequential patterns Mining constrained sequential patterns 10/22/2015

Sequence Databases & Sequential Patterns Transaction databases, time-series databases vs. sequence databases Frequent patterns vs. (frequent) sequential patterns Applications of sequential pattern mining  Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months.  Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc.  Telephone calling patterns, Weblog click streams  DNA sequences and gene structures October 22, 2015 Data Mining: Concepts and Techniques

What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences October 22, 2015 A sequence database A sequence : An element may contain a set of items. Items within an element are unordered and we list them alphabetically. is a subsequence of Given support threshold min_sup =2, is a sequential pattern SIDsequence

Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should  find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold  be highly efficient, scalable, involving only a small number of database scans  be able to incorporate various kinds of user-specific constraints October 22, 2015

Review-Apriori: A Candidate Generation-and- Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Method:  Initially, scan DB once to get frequent 1-itemset  Generate length (k+1) candidate itemsets from length k frequent itemsets  Test the candidates against DB  Terminate when no frequent or candidate set can be generated October 22, 2015

Review: Mining Frequent Patterns Without Candidate Generation Grow long patterns from short ones using local frequent items  “abc” is a frequent pattern  Get all transactions having “abc”: DB|abc (DB projection)  “d” is a local frequent item in DB|abc  abcd is a frequent pattern October 22, 2015

Review: Why Is FP-Growth the Winner? Divide-and-conquer:  decompose both the mining task and DB according to the frequent patterns obtained so far  leads to focused search of smaller databases Other factors  no candidate generation, no candidate test  compressed database: FP-tree structure  no repeated scan of entire database  basic ops—counting local freq items and building sub FP-tree, no pattern search and matching October 22, 2015

Sequential Pattern Mining Algorithms Concept introduction and an initial Apriori-like algorithm  Agrawal & Srikant. Mining sequential patterns, ICDE’95 Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & EDBT’96) Pattern-growth methods: FreeSpan & PrefixSpan (Han et Pei, et Vertical format-based mining: SPADE Leanining’00) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Pei, Han, CIKM’02) Mining closed sequential patterns: CloSpan (Yan, Han & October 22, 2015

The Apriori Property of Sequential Patterns A basic property: Apriori (Agrawal & Sirkant’94)  If a sequence S is not frequent  Then none of the super-sequences of S is frequent  E.g, is infrequent  so do and October 22, SequenceSeq. ID Given support threshold min_sup =2

GSP—Generalized Sequential Pattern Mining GSP (Generalized Sequential Pattern) mining algorithm  proposed by Agrawal and Srikant, EDBT’96 Outline of the method  Initially, every item in DB is a candidate of length-1  for each level (i.e., sequences of length-k) do scan database to collect support count for each candidate sequence generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori  repeat until no frequent sequence or no candidate can be found Major strength: Candidate pruning by Apriori October 22, 2015

Finding Length-1 Sequential Patterns Examine GSP using an example Initial candidates: all singleton sequences ,,,,,,, Scan database once, count support for candidates October 22, SequenceSeq. ID min_sup =2 CandSup

GSP: Generating Length-2 Candidates October 22, length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates

The GSP Mining Process October 22, 2015 Data Mining: Concepts and Techniques … … … … 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold Cand. not in DB at all SequenceSeq. ID min_sup =2

Candidate Generate-and-test: Drawbacks A huge set of candidate sequences generated.  Especially 2-item candidate sequence. Multiple Scans of database needed.  The length of each candidate grows by one at each database scan. Inefficient for mining long sequential patterns.  A long pattern grow up from short patterns  The number of short patterns is exponential to the length of mined patterns. October 22, 2015

The SPADE Algorithm SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 A vertical format sequential pattern mining method A sequence database is mapped to a large set of  Item: Sequential pattern mining is performed by  growing the subsequences (patterns) one item at a time by Apriori candidate generation October 22, 2015

The SPADE Algorithm October 22, 2015 Data Mining: Concepts and Techniques 17

Bottlenecks of GSP and SPADE A huge set of candidates could be generated  1,000 frequent length-1 sequences generate s huge number of length- 2 candidates! Multiple scans of database in mining Breadth-first search Mining long sequential patterns  Needs an exponential number of short candidates  A length-100 sequential pattern needs candidate sequences! October 22, 2015

Prefix and Suffix (Projection),, and are prefixes of sequence Given sequence October 22, 2015 PrefixSuffix (Prefix-Based Projection)

Mining Sequential Patterns by Prefix Projections Step 1: find length-1 sequential patterns ,,,,, Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:  The ones having prefix ;  …  The ones having prefix October 22, 2015 SIDsequence

Finding Seq. Patterns with Prefix Only need to consider projections w.r.t.  -projected database:,,, Find all the length-2 seq. pat. Having prefix :,,,,,  Further partition into 6 subsets Having prefix ; … Having prefix October 22, 2015 SIDsequence

Completeness of PrefixSpan October 22, 2015 SIDsequence SDB Length-1 sequential patterns,,,,, -projected database Length-2 sequential patterns,,,,, Having prefix -proj. db … Having prefix -projected database … Having prefix Having prefix, …, …

Efficiency of PrefixSpan No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing projected databases  Can be improved by pseudo-projections October 22, 2015

Performance on Data Set C10T8S8I8 October 22, 2015 Data Mining: Concepts and Techniques

Performance on Data Set Gazelle October 22, 2015 Data Mining: Concepts and Techniques

Effect of Pseudo-Projection October 22, 2015 Data Mining: Concepts and Techniques 26

CloSpan: Mining Closed Sequential Patterns (by Han and Yan UIUC) A closed sequential pattern s: there exists no superpattern s’ such that s’ כ s, and s’ and s have the same support Motivation: reduces the number of (redundant) patterns but attains the same expressive power Using special Subpattern and Superpattern pruning to prune redundant search space October 22, 2015

CloSpan: Performance Comparison with PrefixSpan October 22, 2015 Data Mining: Concepts and Techniques

Constraint-Based Seq.-Pattern Mining Constraint-based sequential pattern mining  Constraints: User-specified, for focused mining of desired patterns  How to explore efficient mining with constraints? — Optimization Classification of constraints  Anti-monotone: E.g., value_sum(S) 10  Monotone: E.g., count (S) > 5, S  {PC, digital_camera}  Succinct: E.g., length(S)  10, S  {Pentium, MS/Office, MS/Money}  Convertible: E.g., value_avg(S) 160, max(S)/avg(S) 5  Inconvertible: E.g., avg(S) – median(S) = 0 October 22, 2015

From Sequential Patterns to Structured Patterns Sets, sequences, trees, graphs, and other structures  Transaction DB: Sets of items {{i 1, i 2, …, i m }, …}  Seq. DB: Sequences of sets: {, …}  Sets of Sequences: {{, …, }, …}  Sets of trees: {t 1, t 2, …, t n }  Sets of graphs (mining for frequent subgraphs): {g 1, g 2, …, g n } Mining structured patterns in XML documents, bio- chemical structures, etc. October 22, 2015

Episodes and Episode Pattern Mining Other methods for specifying the kinds of patterns  Serial episodes: A  B  Parallel episodes: A & B  Regular expressions: (A | B)C*(D  E) Methods for episode pattern mining  Variations of Apriori-like algorithms, e.g., GSP  Database projection-based pattern growth Similar to the frequent pattern growth without candidate generation October 22, 2015

Summary Association Rule Mining With Apriori Principle How to evaluate Rules Sequential Pattern Mining Apriori Principle Pattern Mining without Candiate Generation based Mining of Sequential patterns

Slides Credits Slides in this presentation are partially based on the work of  Han. Textbook  Tan. Textbook