Mining Sequential Patterns Dimitrios Gunopulos, UCR.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Graph Mining Laks V.S. Lakshmanan
Sequential Patterns & Process Mining Current State of Research Edgar de Graaf LIACS.
LOGO Association Rule Lecturer: Dr. Bo Yuan
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
FP-Growth algorithm Vasiljevic Vladica,
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Data Mining Association Analysis: Basic Concepts and Algorithms
Rakesh Agrawal Ramakrishnan Srikant
Chapter 5: Mining Frequent Patterns, Association and Correlations
Generalized Sequential Pattern (GSP) Step 1: – Make the first pass over the sequence database D to yield all the 1-element frequent sequences Step 2: Repeat.
Our New Progress on Frequent/Sequential Pattern Mining We develop new frequent/sequential pattern mining methods Performance study on both synthetic and.
Multi-dimensional Sequential Pattern Mining
Sequential Pattern Mining
Sequence Databases & Sequential Patterns
Association Analysis: Basic Concepts and Algorithms.
Data Mining: Concepts and Techniques 1 Mining Sequence Patterns in Transactional Databases CS240B --UCLA Notes by Carlo Zaniolo Based on those by J. Han.
Summarization of Frequent Pattern Mining. What is FPM? Why being frequent is so important? Application of FPM Decision make/Business Software Debugging.
Association Rule Mining - MaxMiner. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and.
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Performance and Scalability: Apriori Implementation.
A Short Introduction to Sequential Data Mining
What Is Sequential Pattern Mining?
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference.
Data Mining Association Analysis Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.
Sequential Pattern Mining
Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
Panagiotis Papapetrou, George Kollios, Stan Sclaroff, Dimitrios Gunopulos Department of Computer Science Boston University University of California, Riverside.
Temporal Database Paper Reading R 資工碩一 馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.
18 February 2003Mathias Creutz 1 T Seminar: Discovery of frequent episodes in event sequences Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Data Mining: Principles and Algorithms Mining Sequence Patterns
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Sequential Pattern Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Association rule mining
Advanced Pattern Mining 02
Data Mining: Concepts and Techniques
Jiawei Han Department of Computer Science
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
I don’t need a title slide for a lecture
Association Rule Mining
Data Mining: Concepts and Techniques — Chapter 8 — 8
Data Warehousing Mining & BI
Frequent-Pattern Tree
Association Rule Mining
Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi.
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Presentation transcript:

Mining Sequential Patterns Dimitrios Gunopulos, UCR

Finding Frequent Sequential Patterns The problem: Given a sequence of discrete events that may repeat: A B A C D A C E B A B C… Find patterns that repeat frequently. For example: A followed by B (A->B), or A followed by C (A->C) The patterns should occur within a window W. Applications in telecommunication data, networks, biology

Sequences Sequence ((T=90F)  (H=60%, P=1.1atm)) time t1 Later time t2 attributevalue item itemset k-sequence: sequence with k items T1  H2P1T3  P2, P1T2  H4P2T5: 5-sequences S1 is subsequence of S2 (S1  S2) T1  P1T2  H1T1  P2  H2P1T2 (T1  H1T1, P1T2  H2P1T2) H1P1  T2  H1T1  P2  H2P1T2

Sequential Patterns: The Problem support or frequency of a sequence S (  (S)): = the total number of times sequence S is encountered user specified minimum threshold min_sup S is frequent   (S)  min_sup S:maximal frequent sequence  S is frequent and all of its supersequences are non-frequent S:minimal non-frequent sequence  S is non frequent and all of its subsequences are frequent The problem Given: database D and min_sup the problem: find all frequent sequences in D

Example Database

Algorithms for Sequential Patterns Apriori, GSP [Srikant, Agrawal, EDBT 1996] [Mannila, Toivonen, Verkamo, DMKD 1997] SPADE, Parallel Spade [Zaki, 2001] FreeSpan, PrefixSpan [Han et al, SIGKDD 2000], [Pei et al, ICDE 2001] Sequential Patterns with constraints [Garofalakis et al, VLDB 99] DFS-Mine [Tsoukatos and Gunopulos, SSTD 2001]

The Lattice Structure Lemma: All subsequences of a frequent sequence are frequent

SPADE ([Zaki, 2001]) Lattice-based approach vertical id-list format enumerates all frequent sequences equivalence classes to decompose the problem: two k-sequences belong in the same [  ]  i class if they have the same i-length prefix  each class fits in main memory generates a (k+1)-sequence by intersecting two k-sequences that have common (k-1)-length prefix minimizes I/O cost - 2 database scans: frequent 1-sequences, frequent 2-sequences

SPADE

DFS_MINE Depth-First-Search approach fast discoveries of long maximal frequent patterns uses minimal amount of memory some frequent sequences are deduced to be frequent from lattice candidate (k+1)-sequence: intersect a k-sequence with all frequent items (FreqItems) in main memory: S.Useless: set of items sequence S must not be intersected with MaxFreqList: List of Maximal Frequent Sequences MinNonFreqList: List of Minimal Non Frequent Sequences scan database to determine the support of candidate sequences

BCDBCD CDCD In MinNonFreqList candidate BCDBCD ABCDEABCDE In MaxFreqList candidate MaxFreqList - MinNonFreqList Lemma: All subsequences of a frequent sequence are also frequent Supersequences S is inserted in MinNonFreqList if: S is not in MinNonFreqList S is not a supersequence of a sequence in MinNonFreqList S was scanned in database and was found to be non-frequent Supersequences of S in MinNonFreqList are removed.  S is inserted in MaxFreqList if: S is not in MaxFreqList S is not a subsequence of a sequence in MaxFreqList S was scanned in database and was found to be frequent Subsequences of S in MaxFreqList are removed.

Examining Candidate Sequences k-sequence S is intersected with all items I j in FreqItems-S.Useless resulting sets SET(S+I j ) for all I j each sequence S: check MinNonFreqList check MaxFreqList scan database for all unknown sequences (if any) in SET(S+I j ) for all I j (1pass) update MaxFreqList, MinNonFreqList

Generating sequences k-sequence S + I j in FreqItems-S.Useless = candidate (k+1)-sequences A  B  C  D + E 1. E  A  B  C  D 2. AE  B  C  D 3. A  E  B  C  D 4. A  BE  C  D 5. A  B  E  C  D 6. A  B  CE  D 7. A  B  C  E  D 8. A  B  C  DE 9. A  B  C  D  E A  B  C  D + D 1. D  A  B  C  D 2. AD  B  C  D 3. A  D  B  C  D 4. A  BD  C  D 5. A  B  D  C  D 6. A  B  CD  D 7. A  B  C  D  D 8. A  B  C  DD  9. A  B  C  D  D  insert item I j in all possible positions that follow its rightmost occurrence is a k-sequence S. If the item does not occur at all in the sequence, then it is inserted in all positions. AAAAAA AD  A  AA  A  AD AD  A  AD DD DD 

Useless Set of a sequence S after intersecting S with item I j, it is inserted in S.Useless when intersecting S with item I j, all items I k (k<j) are in S.Useless S.Useless is ‘inherited’ by the (k+1)-sequences produced SET(S+A,A)SET(S+B,B) SET(S+A) SET(S+B) Sequence S SET(S+A,B)=SET(S+B,A) A ABAB B A  B+E E  A  B AE  B A  E  B A  BE A  B  E  A  B+D D  A  B AD  B A  D  B A  BD A  B  D D  A  B +E E  D  A  B DE  A  B D  E  A  B D  AE  B D  A  E  B D  A  BE D  A  B  E Sequence SSET(S+D) D SET(S+E) E SET(S+D,E) E not frequent Bound to be not frequent  Scenario 1 Scenario 2

Open Problems Output subexponential maximal sequential pattern algorithms Efficient algorithms for finding episodes (approximate sequential patterns – edit distance)

Spatiotemporal Datasets Temperature Map US Snow-ice-rain radar USSnow-ice-rain radar NEPrecipitation radar Bay Area Precipitation radar Lakes

Mining Spatiotemporal Data CONQUEST, [Stolorz et al, KDD 1995] –Patterns in global climate change SKICAT, [Fayyad et al, 1996] –Image processing techniques and classification techniques to identify objects in satellite pictures GeoMiner [Han et al, 1997] MultiMediaMiner, [Zaiane et al, 1998] –Data Cube structure. Mining of association and classification rules. DFS-Mine, [Tsoukatos et al, 2001] –Discovery of spatiotemporal patterns

Open Problems Similarity models and indexing techniques for higher- dimensional time series Efficient trend detection/subsequence matching algorithms Algorithms to capture the data distribution when it changes over time New models for capturing the evolution of spatial phenomena over time