Online Frequent Episode Mining Xiang Ao 1, Ping Luo 1, Chengkai Li 2, Fuzhen Zhuang 1 and Qing He 1 1 2 2015-9-18X. Ao et al. Online Frequent Episode Mining1.

Slides:

Advertisements

Similar presentations

gSpan: Graph-based substructure pattern mining

Advertisements

Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Frequent Closed Pattern Search By Row and Feature Enumeration

Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.

Fast Algorithms For Hierarchical Range Histogram Constructions

Yasuhiro Fujiwara (NTT Cyber Space Labs)

On the Privacy of Private Browsing Kiavash Satvat, Matt Forshaw, Feng Hao, Ehsan Toreini Newcastle University DPM’13.

Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Schema Summarization cong Yu Department of EECS University of Michigan H. V. Jagadish Department of EECS University of Michigan

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

Fast Algorithms for Association Rule Mining

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

A Short Introduction to Sequential Data Mining

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

A Unified Modeling Framework for Distributed Resource Allocation of General Fork and Join Processing Networks in ACM SIGMETRICS

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,

Sequential PAttern Mining using A Bitmap Representation

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.

ICDE 2012 Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data Yongxin Tong 1, Lei Chen 1, Bolin Ding 2 1 Department of Computer.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Mining High Utility Itemset in Big Data

Mining Serial Episode Rules with Time Lags over Multiple Data Streams Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P.

Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者：林靜怡.

QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

Temporal Database Paper Reading R 資工碩一馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.

1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.

1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.

Intelligent Database Systems Lab Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Salvatore Orlando Raffaele Perego Claudio Silvestri 國立雲林科技大學 National.

18 February 2003Mathias Creutz 1 T Seminar: Discovery of frequent episodes in event sequences Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

Presented by: Dardan Xhymshiti Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2 Sean Chester*Darius Sidlauskas`Ira Assent*Kenneth.

1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.

Gspan: Graph-based Substructure Pattern Mining

Online Frequent Episode Mining

Mining Frequent Itemsets over Uncertain Databases

Free-rider Episode Screening via Dual Partition Model

Approximate Frequency Counts over Data Streams

Discovery of Significant Usage Patterns from Clickstream Data

Presentation transcript:

Online Frequent Episode Mining Xiang Ao 1, Ping Luo 1, Chengkai Li 2, Fuzhen Zhuang 1 and Qing He X. Ao et al. Online Frequent Episode Mining1

2 Agenda Introduction Problem Formulation Solution Framework Experiental Results Conlusions X. Ao et al. Online Frequent Episode Mining

Introduction X. Ao et al. Online Frequent Episode Mining 3 Frequent episode mining (FEM) techniques are broadly conduced to analyze data sequences in many domains. ManufacturingTelecommunicationFinance Biology News analysis System log analysis Time stamps Events Episode (especially for serial episode in this paper), is kind of totally ordered set of events. E.g., D → A is an episode.

Introduction 9/18/2015X. Ao et al. Online Frequent Episode Mining 4 FEM aims at identifying all the frequent episodes whose frequencies are larger than a user-specified threshold.

Introduction 9/18/2015X. Ao et al. Online Frequent Episode Mining 5 Usually, FEM algorithms are time-consuming: 1. The anti-monotonicity property may fail to hold for episode frequency [Achar, 2012]. 2. Testing whether an episode occurs in a sequence is an NP-complete problem [Tatti, 2011]. [Achar, 2012] A. Achar, S. Laxman, and P. Sastry, “A unified view of the apriori-based algorithms for frequent episode discovery,” KAIS, [Tatti, 2011] N. Tatti and B. Cule, “Mining closed episodes with simultaneous events,” in KDD, 2011.

Introduction 9/18/2015X. Ao et al. Online Frequent Episode Mining 6 Previous studies on FEM mostly process data offline in a batch mode. FEM algorithm Historical data Frequent episodes Output Updated data Updated frequent episodes Different

Introduction 9/18/2015X. Ao et al. Online Frequent Episode Mining 7 In this paper, we consider online frequent episode mining problem (OFEM). Newly emerging episodes may become valuable. Old episodes may become obsolete. Time-critical applications. Need efficient methods to find recent and frequent episodes.

Predictive maintenance Introduction 9/18/2015X. Ao et al. Online Frequent Episode Mining 8 Examples of motivated applications High Frequency Trading Fast-growing data Recency effect Time-critical analysis.

Introduction 9/18/2015X. Ao et al. Online Frequent Episode Mining 9 Challenges of OFEM algorithm:  Infrequent events at the current moment may become frequent in future.  Intensive computation will generate lots of episode occurrences.  Efficiently mining all occurrences of episodes also becomes a big challenge over the growing sequence.

Introduction Contributions of this paper:  Propose an algorithm, MESELO (Mining frEquent Serial Episode via Last Occurrence), for online frequent episode mining.  Design a data structure, episode trie, to compactly store all minimal occurrences of episode.  Introduce the concept of last episode occurrence.  Compare our method and some state-of-the-art batch mode FEM methods based on minimal occurrence. 9/18/2015X. Ao et al. Online Frequent Episode Mining 10

11 Agenda Introduction Problem Formulation Solution Framework Experiental Results Conlusions X. Ao et al. Online Frequent Episode Mining

Problem Formulation 9/18/2015X. Ao et al. Online Frequent Episode Mining 12 Valid Sequence ∆ ∆ Frequent episodes may change as the sequence continues growing. ∆—window size of valid sequence.

13 Agenda Introduction Problem Formulation Solution Framework Experiental Results Conlusions X. Ao et al. Online Frequent Episode Mining

Solution Framework 9/18/2015X. Ao et al. Online Frequent Episode Mining 14 Minimal occurrence is a kind of occurrence of episode which can not contain any other occurrence of same episode. A → B is a serial episode in the example. Consider another episode D → D in the example. δ Also, minimal episode occurrence is bounded by a user- specified parameter -- maximal occurrence window δ. The support of A → B is 2 in the example.

Frequent episodes Solution Framework Updated frequent episodes 9/18/2015X. Ao et al. Online Frequent Episode Mining 15 Valid Sequence δ - 1 The concept of local time window

Solution Framework 9/18/2015X. Ao et al. Online Frequent Episode Mining 16 The concept of last episode occurrence last occurrence of A→B in the local time window Minimal but not last occurrence of A→B in the local time window last minimal occurrence of A→B in the local time window  In MESELO, only last minimal episode occurrences could be further expanded to new minimal episode occurrences.

Solution Framework 9/18/2015X. Ao et al. Online Frequent Episode Mining 17 Valid Sequence The concept of minimal occurrence starting at i and ending not later than j. Definition (Minimal episode occurrence starting at t i and ending no later than t j ). Given a time window [t i, t j ], we use to denote the set of all minimal episode occurrence for which the start time is equal to t i, and the end time is not larger than t j. In the running example, = {(A, [5, 5]), (A → A, [5, 6]), (A → B, [5, 6]), (A → B → B, [5, 7]), (A → A →B, [5, 7])}.

Solution Framework 9/18/2015X. Ao et al. Online Frequent Episode Mining 18 9/18/2015X. Ao et al. Online Frequent Episode Mining 18 Δ δ-1 Sequence grows to k+1 δ-1

non-last occurrence node, denotes a minimal but not last occurrence last occurrence node, denotes a last minimal occurrence Solution Framework 9/18/2015X. Ao et al. Online Frequent Episode Mining 19 Use episode trie to denote Each node p = p.event:p.time, consists of two fields p.event and p.time. p.event registers which event this node represents. p.time registers the occurrence timestamp. The event field of the root is associated with the empty string (labeled as “root”), and the time field of the root is equal to t i. The event sequence along the path from the root to p denotes an episode minimal occurrence, and its occurrence window is [t i, p.time]. E.g., (A → A, [5, 6]). The episode trie In fact,  In MESELO, only last occurrence node could be further expanded to new minimal episode occurrences.

Solution Framework 9/18/2015X. Ao et al. Online Frequent Episode Mining 20 MESELO Algorithm Basically, Step 1: create a new and update the super script of each which still varies from k to k+1. Step 2: transfer the episode trie out of the main memory.

9/18/2015X. Ao et al. Online Frequent Episode Mining 21 Valid Sequence Latest δ timestamps Before processing After processing The more details, the proof of soundness and completeness of the algorithm, and the complexity analysis can refer to the paper.

22 Agenda Introduction Motivation Problem Formulation Solution Framework Experiental Results Conlusions X. Ao et al. Online Frequent Episode Mining

Experimental Results X. Ao et al. Online Frequent Episode Mining 23 Data sets Online mode Batch mode Mining Server: 2.00 GHz Intel Xeon E G gigabytes memory Windows 2008 Database Server: 2.00 GHz Intel Xeon E G gigabytes memory Linux CentOS 100MB connection Baselines Online modeBRUTE Online modeMESELO-BS Batch modePPS [ICDM’04] Batch modeMINEPI+ [Info. Sys.’08] Batch modeUP-Span [KDD’13] Batch mode DFS [DKE’13] Environments Degradation of MESELO Alg.

Experimental Results (1) 9/18/2015X. Ao et al. Online Frequent Episode Mining 24 Online mode data preparation Industry Name# of StocksDatasets Name Pharmaceuticals1Stock-1 Security2Stock-2 Electricity Power4Stock-3 Iron and Steel6Stock-4 Nonferrous-material8Stock-5 Estate10Stock-6 Table 1. Details of online mode data sets Data from China Stock Exchange Daily Trading list (denoted as Stock-1 to 6) over 2,509 trading days from January 1st, 2004 to May 9th, We always select the most leading stocks from each industry. Build stock event from daily closing price 1.Calculate the increase ratio r of price between two consecutive trading days. 2.Discretize the value of r into 4 levels: UH (r >= 3.5%), UL (0% ≤ r ≤ 3.5%), DL (−3.5% ≤ r < 0%), DH (r ≤ −3.5%) 3.Then, a stock must happen one of the four events every day.

Experimental Results (2) 9/18/2015X. Ao et al. Online Frequent Episode Mining 25 Online mode experimental results Comparison method: Sequentially read every event set of the coming time stamp, and perform online frequent episode mining. Record the execution time at each time stamp and use their average value as the measure for the comparison. Note: the average time over all time stamps is only related to δ.

Experimental Results (4) 9/18/2015X. Ao et al. Online Frequent Episode Mining 26 Batch mode data preparation Datasets NameData Type RetailMarket basket data from stores. ChainStoreMarket basket data from stores. KosarakClick-stream data from web sites. BMSClick-stream data from web sites. Table 2. Details of batch mode data sets Note: The four datasets are originally for sequential pattern mining. We follow the processing steps in [1]. [1] C.-W. Wu, Y.-F. Lin, S. Y. Philip, and V. S. Tseng, “Mining high utility episodes in complex event sequences,” in KDD, TidEvents 1A, B, D 2B, E 3A, F …… Sequential pattern mining data form Episode mining data form to Horizontal Vertical

Experimental Results (5) 9/18/2015X. Ao et al. Online Frequent Episode Mining 27 Batch mode performance evaluations Comparison method: min_sup & δ variations 1. Fix δ and vary min_sup. (See Fig. 8) 2. Fix min_sup and vary δ. (See Fig. 9) BMS holds a shorter sequence length. And most importantly, less number of events per timestamp compared with other datasets.

28 Agenda Introduction Motivation Problem Formulation Solution Framework Experiental Results Conlusions X. Ao et al. Online Frequent Episode Mining

Conclusions  New problem: online frequent episode mining. Especially useful to time-critical applications with growing sequences.  Efficient online algorithm (i.e. MESELO). Experiments on real data sets show the efficiency of MESELO is at least one magnitude of order faster than other baselines.  New concept of last episode occurrence and episode trie. Detecting the minimal episode occurrences efficiently. All minimal episode occurrences are stored in a compact way. 9/18/2015X. Ao et al. Online Frequent Episode Mining 29

Thanks! Q&A X. Ao et al. Online Frequent Episode Mining30