Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Short Introduction to Sequential Data Mining

Similar presentations


Presentation on theme: "A Short Introduction to Sequential Data Mining"— Presentation transcript:

1 A Short Introduction to Sequential Data Mining
Koji IWANUMA Hidetomo NABESHIMA University of Yamanashi The First Franco-Japanese Symposium on Knowledge Discovery in System Biology, September 17, Aix-en-Provence

2 Two Main Frameworks of Sequential Mining
Sequential pattern mining for multiple data sequences Sequential pattern mining for a single data sequence Sequence ID Purchase data record 1 <bread, cheese> 2 <(wheat, milk), bread, (berry, sausage)> 3 <(bread, pumpkin, sausage)> 4 <bread, cheese, sausage> 5 <cheese> Data sequence <S1 S2 S3 S4 S5 S6 S7 … … Sn>

3 What Is Sequential Pattern Mining?
J. Han and M. Kamber. Data Mining: Concepts and Techniques, What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern

4 Challenges on Sequential Pattern Mining
A huge number of possible sequential patterns are hidden in databases A mining algorithm should find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans be able to incorporate various kinds of user-specific constraints J. Han and M. Kamber. Data Mining: Concepts and Techniques,

5 Sequential Pattern Mining Algorithms for Multiple Data Sequences
Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & EDBT’96) Pattern-growth methods: FreeSpan & PrefixSpan (Han et Pei, et Vertical format-based mining: SPADE Leanining’00) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Pei, Han, CIKM’02) Mining closed sequential patterns: CloSpan (Yan, Han & J. Han and M. Kamber. Data Mining: Concepts and Techniques,

6 Mining Sequential Patterns from a Very-Long Single Sequence
A series of daily news paper articles < > typhoon flood, landslide typhoon flood, landslide <typhoon (flood, landslide)>

7 Sequential Pattern Mining Algorithms for a Single data Sequence
Discovery of frequent episodes in event sequences, based on a sliding window system [Mannila 1998]:  The frequency measure becomes anti-monotonic, but has a problem, i.e., a duplicate counting of an occurrence. Asynchronous periodic pattern mining [Yang et.al 2000, Huang 2004]: Any anti-monotonic frequency measures are not investigated. On-line approximation algorithm for mining frequent items, not for frequent subsequences Lossy counting algorithm [Manku and Motwani, VLDB’02]

8 Research in Our Laboratory
Sequential Data Mining from a very-large single data sequence. Main target: sequential textual data, especially, newspaper-articles corpora Objectives: to generate a robust and useful large-scale event-sequences corpus. Application 1: topic tracking/detection in information retrieval. Application 2: automated content-tracking in WEB. Application 3: scenario/story semi-automatic creation  Ordinary temporal data analysis: various log data in computer systems, genetic information, etc.

9 Technical Topics (1/2) A new framework for extracting frequent subsequences from a single long data sequence: in IEEE Inter. Conf. on Data Mining 2005 (ICDM2005): A new rational frequency measures, which satisfies the Apriori (anti-monotonic) property and has no duplicate counting. A fast on-line algorithm for a some limited case

10 Technical Topics (1/2) On-going current works and future work
On-line rational filters based on confidence criteria and/or information-gain for eliminating redundant valueless sequences from system output Methods for finding meta-structures embedded in huge amount of frequent sequences generated by a system A method using compression based on context-free grammar-inference/learning More fast extraction algorithm based on a method for simultaneously searching multiple strings over compressed data.

11 References: Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques (Chapter 8).

12 Thanks for your attention!!


Download ppt "A Short Introduction to Sequential Data Mining"

Similar presentations


Ads by Google