Free-rider Episode Screening via Dual Partition Model

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Mining Association Rules in Large Databases
Recap: Mining association rules from large datasets
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Techniques Association Rule
Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li
Sequential Patterns & Process Mining Current State of Research Edgar de Graaf LIACS.
Frequent Closed Pattern Search By Row and Feature Enumeration
LOGO Association Rule Lecturer: Dr. Bo Yuan
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Fast Algorithms for Association Rule Mining
Online Frequent Episode Mining Xiang Ao 1, Ping Luo 1, Chengkai Li 2, Fuzhen Zhuang 1 and Qing He X. Ao et al. Online Frequent Episode Mining1.
Sequential PAttern Mining using A Bitmap Representation
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Mining High Utility Itemset in Big Data
Supermarket shelf management – Market-basket model:  Goal: Identify items that are bought together by sufficiently many customers  Approach: Process.
Sequential Pattern Mining
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
1 What is Association Analysis: l Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Mining Dependent Patterns
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association rule mining
Association Rules Repoussis Panagiotis.
Online Frequent Episode Mining
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
CARPENTER Find Closed Patterns in Long Biological Datasets
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Differential Privacy in Practice
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Data Warehousing Mining & BI
Frequent-Pattern Tree
Market Basket Analysis and Association Rules
Department of Computer Science National Tsing Hua University
Lecture 11 (Market Basket Analysis)
Mining Association Rules in Large Databases
Association Analysis: Basic Concepts
Hansheng Lei Univ. of Texas Rio Grande Valley
Presentation transcript:

Free-rider Episode Screening via Dual Partition Model Xiang Ao1,2, Yang Liu1,2, Zhen Huang3, Luo Zuo1,2, and Qing He1,2 1Institute of Computing Technology, CAS, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3Tsinghua University, Beijing, China

Agenda Motivation Related Work The Proposed Model Experiments Conclusion

Agenda Motivation Related Work The Proposed Model Experiments Conclusion

Customer basket transaction database Motivation Frequency-based pattern mining derives redundant and uninteresting output Tid Items 1 beer, milk, diaper, coke 2 beer, bread, diaper, coke 3 beer, coke, beef 4 diaper, pork, beer, coke 5 milk, bread, coke 6 milk, bread, beer, coke, diaper 7 bread, beef Customer basket transaction database

Customer basket transaction database Motivation Frequency-based pattern mining derives redundant and uninteresting output min_sup=3/7 Tid Items 1 beer, milk, diaper, coke 2 beer, bread, diaper, coke 3 beer, coke, beef 4 diaper, pork, beer, coke 5 milk, bread, coke 6 milk, bread, beer, coke, diaper 7 bread, beef, milk, coke diaper beer milk bread Customer basket transaction database

Customer basket transaction database Motivation Frequency-based pattern mining derives redundant and uninteresting output min_sup=3/7 Tid Items 1 beer, milk, diaper, coke 2 beer, bread, diaper, coke 3 beer, coke, beef 4 diaper, pork, beer, coke 5 milk, bread, coke 6 milk, bread, beer, coke, diaper 7 bread, beef, milk, coke diaper beer coke milk bread coke Customer basket transaction database

Overwhelmingly redundant and uninteresting output Motivation Frequency-based pattern mining derives redundant and uninteresting output support=4/7, length=2 support=4/7, length=3 Overwhelmingly redundant and uninteresting output diaper beer diaper beer coke Maximal Closed support=3/7, length=2 support=3/7, length=3 milk bread milk bread coke Free-rider Maximal Closed

Motivation Framework: compare the pattern’s observed support with its expected support. Given a pattern X, 𝐿𝑖𝑓𝑡 𝑋 = support(𝑋) Expect_sup⁡(𝑋) X = A ∪ B, 𝐸𝑥𝑝𝑒𝑐𝑡_sup⁡(𝑋)=𝑃 𝐴 ×𝑃(𝐵) For the pattern X = (diaper, beer), considering “diaper” and “beer” are independent, 𝐿𝑖𝑓𝑡 𝑋 = 4/7 4/7×5/7 =𝟏.4 For the pattern Y = (diaper, beer, coke), considering “(diaper, beer)” and “coke” are independent, 𝐿𝑖𝑓𝑡 𝑌 = 4/7 4/7×7/7 =1

Motivation Time stamps Events Frequent episode aims at discovering frequently appeared ordered sets of events from a single symbol (event) sequence. In this work, our task is to screen uninteresting (free-rider) episodes from a given set of frequent episodes. Time stamps Problem Statement of Free-rider Episode Screening: Given an event sequence S and a set of frequent episodes on S, the free-rider episode screening problem is to rank the episodes with the Lift produced by the proposed model and filter the episodes whose Lift values are less than a threshold min_lift. Manufacturing Telecommunication Finance Events Episode (especially for serial episode in this paper), is kind of totally ordered set of events. E.g., b → d is an episode. Biology System log analysis News analysis

Challenges Challenges of free-rider episode screening: It is non-trivial to define a good expected support of an episode The episode occurrences in a single sequence may be dependent to each other, and support is no longer a sum of independent variables. Noise events could be doped before, after, or inside a real pattern. It is non-trivial to compute the expected support of an episode efficiently The search space of partitions suffers from combination explosion as it increases exponentially to the number of events in an episode.

Contributions Contributions of this paper: 1. We propose a novel partition model EDP(Episode Dual Partition) Dual partition mechanism, non-prefix episodes are considered. Novel definition of expected support of an episode based on generative random sequence. 2. We propose an efficient automaton based algorithm to compute the expected support 3. We verify the effectiveness and efficiency of EDP on both synthetic and real-world datasets

Agenda Motivation Related Work The Proposed Model Experiments Conclusion

PRT—[Tatti, DMKD15] Prefix Graph Partition: partition an episode into two consecutive sub-episodes and consider they are independent to each other. a b c d Episode α: Cannot take non-prefix sub-episode into consideration, while noise events may doped inside real patterns. H1 a H2 b H3 c H4 d H5 Automaton of α: Partitions of α: H1 H2 H3 H4 H5 H1 H2 H3 H4 H5 H1 H2 H3 H4 H5 H1 H2 H3 H4 H5 Nikolaj Tatti, Ranking episodes using a partition model, DMKD 29(5):1312—1342, 2015

SkOPUS—[Petitjean et al., DMKD16 ] Partition based Alternative Reordering: define an expected support by reordering composition based on binary sequential partition. a b c d Episode α: Reordering candidate may distort expected support especially when episode becomes longer. One of binary sequential partition(BSP) of α: {<a, c ,d>, <b>} Reordering alternatives based on BSP: Expected support of α under such partition: the avg. support of these candidates in the original sequence. b a c d a b c d a c b d a c d b Francois Petitjean et al., Mining top-k sequential patterns under leverage, DMKD 30(5):1086–1111, 2016

Agenda Motivation Related Work The Proposed Model Experiments Conclusion

Episode Dual Partition Model Step 1: Dual Partition Strategy Partition α to informative( 𝐼 𝛼 ) and random( 𝐼 𝛼 ) event sets. a b c d Episode α: The intuition behind such generation is we assume real patterns bear from implicit generative rules. The events in 𝑰 𝜶 form real pattern, while others are noise. 𝐼 𝛼 ={a, c} and 𝐼 𝛼 ={b, d} Step 2: Generate Random Sequences Given a dual partition 𝐼 𝛼 ={a, c}, 𝐼 𝛼 ={b, d} and the original sequence, Original event sequence Random sequences are generated by fixing events in 𝐼 𝛼 as the original sequence and drawing events in 𝐼 𝛼 by independence probability.

Episode Dual Partition Model Step 3: Expected Support Definition Given a specific dual partition, the expected support of α is defined as the weighted sum of the observed support of α on each random sequence. The expected support of an episode α is defined as the maximum expected support value that a dual partition can produce. Third, we define a novel expected support of episode based on such random sequences. In more detail, given a Filter episodes whose Lift less than min_lift Ranking episodes by Lift

Expected Support Calculation Algorithm The alg. tracks episode occurrences by automatons and does not generate specific random sequences. We sequentially generate event sets as timestamp increases and manage a list of all possible active automatons at each timestamp. We consider all possible event sets at each timestamp. The time complexity of algorithm is 𝑂( 2 𝐼 𝛼 +|𝛼| 𝑛|𝛼|), while we can stop early with pruning. We know that generating multiple random sequences and detecting episode on each sequence may time consuming. Hence we propose an efficient algorithm based on automatons

Agenda Motivation Related Work The Proposed Model Experiments Conclusion

Experiments - Three Datasets SYN – A synthetic dataset (known patterns are embedded) STK – Daily prices of 50 blue chip stocks from Chinese stock market over 25 years, we made an event if the stock price increases. JMLR – Abstracts of papers published on the Journal of Machine Learning Research, every word is considered as an event.

Experiments - Four Baselines PRT1 – A partition model SkOPUS2 – A partition model EGH3 – An independence model IND – The degraded version of EDP, considering every event is a random event. 1Tatti, N.: Ranking episodes using a partition model. DMKD 29, 1312–1342 (2015) 2Petitjean, F., Li, T., Tatti, N., Webb, G.I.: Skopus: mining top-k sequential patterns under leverage. DMKD 30, 1086–1111 (2016) 3Laxman, S., Sastry, P.S., Unnikrishnan, K.P.: A formal connection discovering frequent episodes and learning hidden Markov models. IEEE TKDE 17, 1505–1517 (2005)

Experiments - Results on Data with Known Patterns Item Symbols Gap or Not Frequency 1st episode abc No 300 2nd episode defg Yes Noise event X / 3000 Real patterns: 15 embedded patterns non-empty sub-episodes of abc and defg except the single events e.g. ab, ac, bc, de, df, dg, ef, eg, fg, def, dfg, deg, efg Fake patterns: 1. freerider with noise: bcX, bXd, dXe ...... 2. overlap: acd, abcd, abd ......

Experiments - Results on Data with Known Patterns

Experiments - Results on Data with Known Patterns 𝑝 𝑖𝑛𝑑 𝑋 is the probability of the embedded high frequency noise event

Experiments - Results on STK Dataset Table 4. The percentage of the most frequent 10 events in the top k episodes in STK. Real patterns may dope with some highly frequent but not related events, which makes the episode become a free-rider. Ratio = # (most frequent 10 events in top 𝑘) # (events in top 𝑘) This ratio is higher may indicate that there might be more possibility to have free-rider episodes in the top k episodes.

Experiments - Results on STK Dataset PRT Results SkOPUS Results a software corp. a securities corp. a mineral corp. Most frequent EDP Results(Our Model) a dairy corp.

Experiments - Results on JMLR Dataset

Experiments - Scalability We distributed the EDP model into processes and visualize its speedup.

Agenda Motivation Related Work The Proposed Model Experiments Conclusion

Conclusions We propose a novel partition model EDP(Episode Dual Partition) to rank interesting episodes and filter the free-riders. A dual partition mechanism, non-prefix episodes are considered. New definition of expected support of an episode based on generative random sequence. We propose an efficient automaton based algorithm to compute the expected support without generating specific random sequence. We verify the effectiveness and efficiency of EDP on both synthetic and real-world datasets, and it outperforms recent state-of-the-arts.

Free-rider Episode Screening via Dual Partition Model Thanks for listening! Feel free to contact me at aoxiang@ict.ac.cn