Free-rider Episode Screening via Dual Partition Model Xiang Ao1,2, Yang Liu1,2, Zhen Huang3, Luo Zuo1,2, and Qing He1,2 1Institute of Computing Technology, CAS, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3Tsinghua University, Beijing, China
Agenda Motivation Related Work The Proposed Model Experiments Conclusion
Agenda Motivation Related Work The Proposed Model Experiments Conclusion
Customer basket transaction database Motivation Frequency-based pattern mining derives redundant and uninteresting output Tid Items 1 beer, milk, diaper, coke 2 beer, bread, diaper, coke 3 beer, coke, beef 4 diaper, pork, beer, coke 5 milk, bread, coke 6 milk, bread, beer, coke, diaper 7 bread, beef Customer basket transaction database
Customer basket transaction database Motivation Frequency-based pattern mining derives redundant and uninteresting output min_sup=3/7 Tid Items 1 beer, milk, diaper, coke 2 beer, bread, diaper, coke 3 beer, coke, beef 4 diaper, pork, beer, coke 5 milk, bread, coke 6 milk, bread, beer, coke, diaper 7 bread, beef, milk, coke diaper beer milk bread Customer basket transaction database
Customer basket transaction database Motivation Frequency-based pattern mining derives redundant and uninteresting output min_sup=3/7 Tid Items 1 beer, milk, diaper, coke 2 beer, bread, diaper, coke 3 beer, coke, beef 4 diaper, pork, beer, coke 5 milk, bread, coke 6 milk, bread, beer, coke, diaper 7 bread, beef, milk, coke diaper beer coke milk bread coke Customer basket transaction database
Overwhelmingly redundant and uninteresting output Motivation Frequency-based pattern mining derives redundant and uninteresting output support=4/7, length=2 support=4/7, length=3 Overwhelmingly redundant and uninteresting output diaper beer diaper beer coke Maximal Closed support=3/7, length=2 support=3/7, length=3 milk bread milk bread coke Free-rider Maximal Closed
Motivation Framework: compare the pattern’s observed support with its expected support. Given a pattern X, 𝐿𝑖𝑓𝑡 𝑋 = support(𝑋) Expect_sup(𝑋) X = A ∪ B, 𝐸𝑥𝑝𝑒𝑐𝑡_sup(𝑋)=𝑃 𝐴 ×𝑃(𝐵) For the pattern X = (diaper, beer), considering “diaper” and “beer” are independent, 𝐿𝑖𝑓𝑡 𝑋 = 4/7 4/7×5/7 =𝟏.4 For the pattern Y = (diaper, beer, coke), considering “(diaper, beer)” and “coke” are independent, 𝐿𝑖𝑓𝑡 𝑌 = 4/7 4/7×7/7 =1
Motivation Time stamps Events Frequent episode aims at discovering frequently appeared ordered sets of events from a single symbol (event) sequence. In this work, our task is to screen uninteresting (free-rider) episodes from a given set of frequent episodes. Time stamps Problem Statement of Free-rider Episode Screening: Given an event sequence S and a set of frequent episodes on S, the free-rider episode screening problem is to rank the episodes with the Lift produced by the proposed model and filter the episodes whose Lift values are less than a threshold min_lift. Manufacturing Telecommunication Finance Events Episode (especially for serial episode in this paper), is kind of totally ordered set of events. E.g., b → d is an episode. Biology System log analysis News analysis
Challenges Challenges of free-rider episode screening: It is non-trivial to define a good expected support of an episode The episode occurrences in a single sequence may be dependent to each other, and support is no longer a sum of independent variables. Noise events could be doped before, after, or inside a real pattern. It is non-trivial to compute the expected support of an episode efficiently The search space of partitions suffers from combination explosion as it increases exponentially to the number of events in an episode.
Contributions Contributions of this paper: 1. We propose a novel partition model EDP(Episode Dual Partition) Dual partition mechanism, non-prefix episodes are considered. Novel definition of expected support of an episode based on generative random sequence. 2. We propose an efficient automaton based algorithm to compute the expected support 3. We verify the effectiveness and efficiency of EDP on both synthetic and real-world datasets
Agenda Motivation Related Work The Proposed Model Experiments Conclusion
PRT—[Tatti, DMKD15] Prefix Graph Partition: partition an episode into two consecutive sub-episodes and consider they are independent to each other. a b c d Episode α: Cannot take non-prefix sub-episode into consideration, while noise events may doped inside real patterns. H1 a H2 b H3 c H4 d H5 Automaton of α: Partitions of α: H1 H2 H3 H4 H5 H1 H2 H3 H4 H5 H1 H2 H3 H4 H5 H1 H2 H3 H4 H5 Nikolaj Tatti, Ranking episodes using a partition model, DMKD 29(5):1312—1342, 2015
SkOPUS—[Petitjean et al., DMKD16 ] Partition based Alternative Reordering: define an expected support by reordering composition based on binary sequential partition. a b c d Episode α: Reordering candidate may distort expected support especially when episode becomes longer. One of binary sequential partition(BSP) of α: {<a, c ,d>, <b>} Reordering alternatives based on BSP: Expected support of α under such partition: the avg. support of these candidates in the original sequence. b a c d a b c d a c b d a c d b Francois Petitjean et al., Mining top-k sequential patterns under leverage, DMKD 30(5):1086–1111, 2016
Agenda Motivation Related Work The Proposed Model Experiments Conclusion
Episode Dual Partition Model Step 1: Dual Partition Strategy Partition α to informative( 𝐼 𝛼 ) and random( 𝐼 𝛼 ) event sets. a b c d Episode α: The intuition behind such generation is we assume real patterns bear from implicit generative rules. The events in 𝑰 𝜶 form real pattern, while others are noise. 𝐼 𝛼 ={a, c} and 𝐼 𝛼 ={b, d} Step 2: Generate Random Sequences Given a dual partition 𝐼 𝛼 ={a, c}, 𝐼 𝛼 ={b, d} and the original sequence, Original event sequence Random sequences are generated by fixing events in 𝐼 𝛼 as the original sequence and drawing events in 𝐼 𝛼 by independence probability.
Episode Dual Partition Model Step 3: Expected Support Definition Given a specific dual partition, the expected support of α is defined as the weighted sum of the observed support of α on each random sequence. The expected support of an episode α is defined as the maximum expected support value that a dual partition can produce. Third, we define a novel expected support of episode based on such random sequences. In more detail, given a Filter episodes whose Lift less than min_lift Ranking episodes by Lift
Expected Support Calculation Algorithm The alg. tracks episode occurrences by automatons and does not generate specific random sequences. We sequentially generate event sets as timestamp increases and manage a list of all possible active automatons at each timestamp. We consider all possible event sets at each timestamp. The time complexity of algorithm is 𝑂( 2 𝐼 𝛼 +|𝛼| 𝑛|𝛼|), while we can stop early with pruning. We know that generating multiple random sequences and detecting episode on each sequence may time consuming. Hence we propose an efficient algorithm based on automatons
Agenda Motivation Related Work The Proposed Model Experiments Conclusion
Experiments - Three Datasets SYN – A synthetic dataset (known patterns are embedded) STK – Daily prices of 50 blue chip stocks from Chinese stock market over 25 years, we made an event if the stock price increases. JMLR – Abstracts of papers published on the Journal of Machine Learning Research, every word is considered as an event.
Experiments - Four Baselines PRT1 – A partition model SkOPUS2 – A partition model EGH3 – An independence model IND – The degraded version of EDP, considering every event is a random event. 1Tatti, N.: Ranking episodes using a partition model. DMKD 29, 1312–1342 (2015) 2Petitjean, F., Li, T., Tatti, N., Webb, G.I.: Skopus: mining top-k sequential patterns under leverage. DMKD 30, 1086–1111 (2016) 3Laxman, S., Sastry, P.S., Unnikrishnan, K.P.: A formal connection discovering frequent episodes and learning hidden Markov models. IEEE TKDE 17, 1505–1517 (2005)
Experiments - Results on Data with Known Patterns Item Symbols Gap or Not Frequency 1st episode abc No 300 2nd episode defg Yes Noise event X / 3000 Real patterns: 15 embedded patterns non-empty sub-episodes of abc and defg except the single events e.g. ab, ac, bc, de, df, dg, ef, eg, fg, def, dfg, deg, efg Fake patterns: 1. freerider with noise: bcX, bXd, dXe ...... 2. overlap: acd, abcd, abd ......
Experiments - Results on Data with Known Patterns
Experiments - Results on Data with Known Patterns 𝑝 𝑖𝑛𝑑 𝑋 is the probability of the embedded high frequency noise event
Experiments - Results on STK Dataset Table 4. The percentage of the most frequent 10 events in the top k episodes in STK. Real patterns may dope with some highly frequent but not related events, which makes the episode become a free-rider. Ratio = # (most frequent 10 events in top 𝑘) # (events in top 𝑘) This ratio is higher may indicate that there might be more possibility to have free-rider episodes in the top k episodes.
Experiments - Results on STK Dataset PRT Results SkOPUS Results a software corp. a securities corp. a mineral corp. Most frequent EDP Results(Our Model) a dairy corp.
Experiments - Results on JMLR Dataset
Experiments - Scalability We distributed the EDP model into processes and visualize its speedup.
Agenda Motivation Related Work The Proposed Model Experiments Conclusion
Conclusions We propose a novel partition model EDP(Episode Dual Partition) to rank interesting episodes and filter the free-riders. A dual partition mechanism, non-prefix episodes are considered. New definition of expected support of an episode based on generative random sequence. We propose an efficient automaton based algorithm to compute the expected support without generating specific random sequence. We verify the effectiveness and efficiency of EDP on both synthetic and real-world datasets, and it outperforms recent state-of-the-arts.
Free-rider Episode Screening via Dual Partition Model Thanks for listening! Feel free to contact me at aoxiang@ict.ac.cn