October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

LOGO Association Rule Lecturer: Dr. Bo Yuan
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
Data Mining Association Analysis: Basic Concepts and Algorithms
Rakesh Agrawal Ramakrishnan Srikant
IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign.
1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
Our New Progress on Frequent/Sequential Pattern Mining We develop new frequent/sequential pattern mining methods Performance study on both synthetic and.
Multi-dimensional Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequence Databases & Sequential Patterns
Mining Sequential Patterns Dimitrios Gunopulos, UCR.
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
1 Mining Association Rules in Large Databases Association rule mining Algorithms for scalable mining of (single-dimensional Boolean) association rules.
Association Analysis: Basic Concepts and Algorithms.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Data Mining: Concepts and Techniques 1 Mining Sequence Patterns in Transactional Databases CS240B --UCLA Notes by Carlo Zaniolo Based on those by J. Han.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
A Short Introduction to Sequential Data Mining
What Is Sequential Pattern Mining?
Ch5 Mining Frequent Patterns, Associations, and Correlations
Sequential PAttern Mining using A Bitmap Representation
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference.
Discovering RFM Sequential Patterns From Customers’ Purchasing Data 中央大學資管系 陳彥良 教授 Date: 2015/10/14.
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011.
Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.
Sequential Pattern Mining
Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01
Generalized Sequential Pattern Mining with Item Intervals Yu Hirate Hayato Yamana PAKDD2006.
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.
Data Mining Association Rules: Advanced Concepts and Algorithms
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Data Mining: Principles and Algorithms Mining Sequence Patterns
Sequential Pattern Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Association rule mining
Data Mining: Concepts and Techniques
Association Rule Mining
Data Mining: Concepts and Techniques — Chapter 8 — 8
Data Warehousing Mining & BI
FP-Growth Wenlong Zhang.
Association Rule Mining
Presentation transcript:

October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional databases

October 2, 2015 Data Mining: Concepts and Techniques 2 Chapter 8. Mining Stream, Time- Series, and Sequence Data Mining data streams Mining time-series data Mining sequence patterns in transactional databases Mining sequence patterns in biological data

October 2, 2015 Data Mining: Concepts and Techniques 3 Sequence Databases & Sequential Patterns Transaction databases, time-series databases vs. sequence databases Frequent patterns vs. (frequent) sequential patterns Applications of sequential pattern mining Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months. Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc. Telephone calling patterns, Weblog click streams DNA sequences and gene structures

October 2, 2015 Data Mining: Concepts and Techniques 4 What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence database A sequence : An element may contain a set of items. Items within an element are unordered and we list them alphabetically. is a subsequence of Given support threshold min_sup =2, is a sequential pattern SIDsequence

October 2, 2015 Data Mining: Concepts and Techniques 5 Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans be able to incorporate various kinds of user-specific constraints

October 2, 2015 Data Mining: Concepts and Techniques 6 Sequential Pattern Mining Algorithms Concept introduction and an initial Apriori-like algorithm Agrawal & Srikant. Mining sequential patterns, ICDE’95 Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & EDBT’96) Pattern-growth methods: FreeSpan & PrefixSpan (Han et Pei, et Vertical format-based mining: SPADE Leanining’00) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Pei, Han, CIKM’02) Mining closed sequential patterns: CloSpan (Yan, Han &

October 2, 2015 Data Mining: Concepts and Techniques 7 The Apriori Property of Sequential Patterns A basic property: Apriori (Agrawal & Sirkant’94) If a sequence S is not frequent Then none of the super-sequences of S is frequent E.g, is infrequent  so do and SequenceSeq. ID Given support threshold min_sup =2

October 2, 2015 Data Mining: Concepts and Techniques 8 GSP—Generalized Sequential Pattern Mining GSP (Generalized Sequential Pattern) mining algorithm proposed by Agrawal and Srikant, EDBT’96 Outline of the method Initially, every item in DB is a candidate of length-1 for each level (i.e., sequences of length-k) do scan database to collect support count for each candidate sequence generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori repeat until no frequent sequence or no candidate can be found Major strength: Candidate pruning by Apriori

October 2, 2015 Data Mining: Concepts and Techniques 9 Finding Length-1 Sequential Patterns Examine GSP using an example Initial candidates: all singleton sequences,,,,,,, Scan database once, count support for candidates SequenceSeq. ID min_sup =2 CandSup

October 2, 2015 Data Mining: Concepts and Techniques 10 GSP: Generating Length-2 Candidates 51 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates

October 2, 2015 Data Mining: Concepts and Techniques 11 Candidate Generate-and-test: Drawbacks A huge set of candidate sequences generated. Especially 2-item candidate sequence. Multiple Scans of database needed. The length of each candidate grows by one at each database scan. Inefficient for mining long sequential patterns. A long pattern grow up from short patterns The number of short patterns is exponential to the length of mined patterns.

October 2, 2015 Data Mining: Concepts and Techniques 12 The SPADE Algorithm SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 A vertical format sequential pattern mining method A sequence database is mapped to a large set of Item: Sequential pattern mining is performed by growing the subsequences (patterns) one item at a time by Apriori candidate generation

October 2, 2015 Data Mining: Concepts and Techniques 13 The SPADE Algorithm

October 2, 2015 Data Mining: Concepts and Techniques 14 Bottlenecks of GSP and SPADE A huge set of candidates could be generated 1,000 frequent length-1 sequences generate s huge number of length-2 candidates! Multiple scans of database in mining Breadth-first search Mining long sequential patterns Needs an exponential number of short candidates A length-100 sequential pattern needs candidate sequences!

October 2, 2015 Data Mining: Concepts and Techniques 15 Prefix and Suffix (Projection),, and are prefixes of sequence Given sequence PrefixSuffix (Prefix-Based Projection)

October 2, 2015 Data Mining: Concepts and Techniques 16 Mining Sequential Patterns by Prefix Projections Step 1: find length-1 sequential patterns,,,,, Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: The ones having prefix ; … The ones having prefix SIDsequence

October 2, 2015 Data Mining: Concepts and Techniques 17 Finding Seq. Patterns with Prefix Only need to consider projections w.r.t. -projected database:,,, Find all the length-2 seq. pat. Having prefix :,,,,, Further partition into 6 subsets Having prefix ; … Having prefix SIDsequence

October 2, 2015 Data Mining: Concepts and Techniques 18 Completeness of PrefixSpan SIDsequence SDB Length-1 sequential patterns,,,,, -projected database Length-2 sequential patterns,,,,, Having prefix -proj. db … Having prefix -projected database … Having prefix Having prefix, …, …

October 2, 2015 Data Mining: Concepts and Techniques 19 Efficiency of PrefixSpan No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing projected databases Can be improved by pseudo-projections

October 2, 2015 Data Mining: Concepts and Techniques 20 Speed-up by Pseudo-projection Major cost of PrefixSpan: projection Postfixes of sequences often appear repeatedly in recursive projected databases When (projected) database can be held in main memory, use pointers to form projections Pointer to the sequence Offset of the postfix s= s| : (, 2) s| : (, 4)

October 2, 2015 Data Mining: Concepts and Techniques 21 Pseudo-Projection vs. Physical Projection Pseudo-projection avoids physically copying postfixes Efficient in running time and space when database can be held in main memory However, it is not efficient when database cannot fit in main memory Disk-based random accessing is very costly Suggested Approach: Integration of physical and pseudo-projection Swapping to pseudo-projection when the data set fits in memory

October 2, 2015 Data Mining: Concepts and Techniques 22 Performance on Data Set C10T8S8I8

October 2, 2015 Data Mining: Concepts and Techniques 23 Performance on Data Set Gazelle

October 2, 2015 Data Mining: Concepts and Techniques 24 Effect of Pseudo-Projection

October 2, 2015 Data Mining: Concepts and Techniques 25 Ref: Mining Sequential Patterns R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT’96. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI:97. M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01 (TKDE’04). J. Pei, J. Han and W. Wang, Constraint-Based Sequential Pattern Mining in Large Databases, CIKM'02. X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM'03. J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04. H. Cheng, X. Yan, and J. Han, IncSpan: Incremental Mining of Sequential Patterns in Large Database, KDD'04. J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99. J. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series data, KDD'00.