Mining Probabilistic Sequential Patterns Name: Yip Chi Kin Date: 18-05-2006.

Slides:

Advertisements

Similar presentations

Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip.

Advertisements

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Mining Multiple-level Association Rules in Large Databases

Minimum Spanning Trees (MSTs) Prim's Algorithm For each vertex not in the tree, keep track of the lowest cost edge that would connect it to the tree This.

Chapter 7 Decision Analysis

1 1 Slide © 2009 South-Western, a part of Cengage Learning Slides by John Loucks St. Edward’s University.

Decision analysis: part 1 BSAD 30 Dave Novak Source: Anderson et al., 2013 Quantitative Methods for Business 12 th edition – some slides are directly from.

1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.

Dynamic Bayesian Networks (DBNs)

Avrilia Floratou, Sandeep Tata, and Jignesh M. Patel ICDE 2010 Efficient and Accurate Discovery of Patterns in Sequence Datasets.

Subgroup Discovery Finding Local Patterns in Data.

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.

Rakesh Agrawal Ramakrishnan Srikant

June 3, 2015Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.

A Generalized Model for Financial Time Series Representation and Prediction Author: Depei Bao Presenter: Liao Shu Acknowledgement: Some figures in this.

An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo Cory Tobin.

Mining Sequential Patterns Dimitrios Gunopulos, UCR.

Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:

Environmental Data Analysis with MatLab Lecture 24: Confidence Limits of Spectra; Bootstraps.

Based on Slides by D. Gunopulos (UCR)

Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}

Measures of Character Fit Measures of Character Fit.

Data Mining CS 157B Section 2 Keng Teng Lao. Overview Definition of Data Mining Application of Data Mining.

Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.

Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:

What Is Sequential Pattern Mining?

實驗室研究暨成果說明會 Content and Knowledge Management Laboratory (B) Data Mining Part Director: Anthony J. T. Lee Presenter: Wan-chuen Lin.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.

Ch5 Mining Frequent Patterns, Associations, and Correlations

October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.

Sequential PAttern Mining using A Bitmap Representation

Mining Serial Episode Rules with Time Lags over Multiple Data Streams Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P.

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011.

Sequential Pattern Mining

Pattern Discovery of Fuzzy Time Series for Financial Prediction -IEEE Transaction of Knowledge and Data Engineering Presented by Hong Yancheng For COMP630P,

CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.

Data Mining Association Rules: Advanced Concepts and Algorithms

1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.

MS Sequence Clustering

Data Mining Find information from data data ? information.

Temporal Database Paper Reading R 資工碩一馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.

Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.

Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000)

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,

1 Mining Episode Rules in STULONG dataset N. Méger 1, C. Leschi 1, N. Lucas 2 & C. Rigotti 1 1 INSA Lyon - LIRIS FRE CNRS Université d’Orsay – LRI.

Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.

Product A Product B Product C A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 B4B4 C1C1 C3C3 C4C4 Turret lathes Vertical mills Center lathes Drills From “Fundamentals of.

CS4445 Data Mining B term WPI Solutions HW4: Classification Rules using RIPPER By Chiying Wang 1.

An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,

Chapter 8 Decision Analysis n Problem Formulation n Decision Making without Probabilities n Decision Making with Probabilities n Risk Analysis and Sensitivity.

Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Multiple Sequence Composition Alignment

Finite state machine optimization

Finite state machine optimization

Sequential Pattern Mining

Data Mining: Concepts and Techniques

Association rule mining

Online Frequent Episode Mining

I don’t need a title slide for a lecture

COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong

A Fast Algorithm for Subspace Clustering by Pattern Similarity

Data Warehousing Mining & BI

Frequent patterns and Association Rules

FP-Growth Wenlong Zhang.

Huffman Coding Greedy Algorithm

Presentation transcript:

Mining Probabilistic Sequential Patterns Name: Yip Chi Kin Date:

Studied Papers [YWY03] Periodic Patterns [CWC94] Pattern Prediction [YWY01] Meta-Pattern Model [YWY03a] Probabilistic Model Surprising Pattern [YWY03b] Probabilistic Model Distance Penalty

Main Aspects ․ Time series Dataset ․ Noise and Repeats ․ Multi-sequence ․ Patterns Mining Frequent (Non-position) Regular (Cyclic Position) Probabilistic Significant Approximate

Periodic Patterns ․ min_rep ․ max_dis ․ valid pattern ․ valid subsequence (a 1, a 2, a 2, a 1, a 3, a 2, a 1, a 4, a 2, a 1, a 4, a 4, a 1, a 1, a 3, a 2, a 1, a 4, a 2, a 1, a 5, a 2, a 1, a 2, a 2 ) max_dis = 4 min_rep = 3 valid pattern ( a 1, , a 2 ) is 2-pattern of period 3 valid subsequence of min_rep=3 max_dis=4

Meta-pattern Model ․ Don’t care positions ․ Predefined meta-patterns r - r r □ r - r - r - □ r - r - r - r - r - r r - r - - r - - r - - r - r □ □ r - - r - - □ r - - r - - r - r - r - r - r - r r □ r - r - r - □ r - r - - r - - r - r □ □ r - - r - - □ r - - r - - r - - r - - r means refill order of flu medicine in the corresponding week - represents that no flu medicine replenishment in that week r noise/distortion of patterns □ position eliminated in the sequence Pattern about Pattern r - - r -

Probabilistic Model ․ CWC model ․ STAMP Algorithm ․ InforMiner Algorithm ․ YWY model Exact Match Information Gain Surprising Pattern Apriori Property Prefix Rules Significant Weight of evidence Threshold = 1.96 Predict Pattern Relevant Position Counting Combine Certainties

Information Gain EventProbabilityInformation gain a1a1 0.2 I(a 1 ) = – log |3| ( ) = a2a2 0.3I(a 2 ) = – log |3| ( ) = a3a3 0.5I(a 3 ) = – log |3| ( ) = Sequence ( a 1, a 1, a 2, a 2, a 2, a 3, a 3, a 3, a 3, a 3 )

YWY model ․ InfoMiner (a 1, a 3, a 4, a 5, a 1, a 4, a 3, a 3, a 2, a 6, a 3, a 2, a 1, a 4, a 3, a 3, a 1, a 3, a 3, a 5, a 1, a 3, a 4, a 5, a 2, a 3, a 3, a 5, a 1, a 3, a 4, a 5, a 2, a 6, a 3, a 5, a 2, a 6, a 2, a 2 ) Sequence (,,, ) a1a1 EventRepetition a2a2 5 3 a2a2 a3a3 a3a3 a3a3 a4a4 a4a4 a5a5 a6a

YWY model ․ InfoMiner PatternInformation Gain (a 1, , ,  )I(a 1 ) × 5 = 5.30 max_info = I(a 1 ) + I(a 6 ) + I(a 4 ) + I(a 5 ) = 4.73 min_rep = = 1 MIG threshold (min_gain = 4.5)

YWY model ․ InfoMiner min_rep = = 2 (a 1, a 3, a 4, a 5, a 1, a 4, a 3, a 3, a 1, a 4, a 3, a 3, a 1, a 3, a 3, a 5, a 1, a 3, a 4, a 5, a 1, a 3, a 4, a 5 ) Projected Subsequence max_info = I(a 1 ) + I(a 4 ) + I(a 4 ) + I(a 5 ) = 4.44 PatternInformation Gain (a 1, a 3, ,  )[ I(a 1 ) + I(a 3 ) ] × 3 = 5.19

YWY model ․ STAMP a 2, a 3, a 1,a 4, a 1, a 1,a 2, a 3, a 4,a 2, a 3, a 1,a 6, a 2, a 1,a 2, a 3, a 1,a 2, a 3, a 7,GIG ( a 2, ,  ) ( , a 3,  ) (a 2, a 3,  ) GIG of ( a 2, ,  ) = – – = 2.4 GIG of ( , a 3,  ) = – = 5.2 GIG of (a 2, a 3,  ) = ( 5 – 1 ) × 2.5 – 1.2 – 1.3 – 1.2 = 6.3 Generalized Information Gain Given: I(a 1 ) = 1.1, I(a 2 ) = 1.2, I(a 3 ) = 1.3

YWY model ․ STAMP Optimal Information Surplus Given: I(a2) = 1.1, Period = 3 Pos eventa 1 a 2 a 7 a 4 a 9 a 2 a 4 a 2 a 2 a 4 a 9 a 7 a 4 a 2 a 2 a 6 a 9 a 2 a 4 a 2 a 1 a 4 a 9 a 7 a 6 a 6 a 2 a2a2 a 2 a 2 a2a2 a2a2 a2a2 loss gain1.1 OIS a2a2 475 If 3 < distance ≤ 6 then loss = If 6 < distance ≤ 9 then loss = -1.1 × 2 = -2.2

YWY model ․ STAMP Maximum Information Gain Given: I(a 2 ) = 1.1, I(a 4 ) = 1.2, I(a 6 ) = 1.3, Period = 3 eventa 1 a 2 a 7 a 4 a 9 a 2 a 4 a 2 a 2 a 4 a 9 a 7 a 4 a 2 a 2 a 6 a 9 a 2 a 4 a 2 a 1 a 4 a 9 a 7 a 6 a 6 a 2 OIS a1a1 0 0 a2a a4a a6a a7a a9a MIG(a 2 ) position #2 = 1.1×3 = 3.3, MIG(a 2 ) position #3 = 1.1×4 = 4.4, MIG(a 4 ) position #1 = 1.2×5 = 6.0, MIG(a 2 ) position #1 = 0, MIG(a 6 ) position #1 = 1.3×1 = 1.3, MIG(a 6 ) position #2 = 0

YWY model ․ STAMP MIG counting Given: Minimum Information Gain = 3.5 Complex pattern (a 4, , a 2 ) a4a4 a2a2 a6a6 a2a2 a4a4 Position #3#2#1 Position #3#2#1 ․ STAMP ․ InforMiner

CWC model … 4-wheel 3-wheel 4-wheel 2-wheel 2-wheel … 3-wheel 4-wheel 3-window 2-window 3-window 1-window 3-window … 1-window 1-window 4-wheel  2-window (1st locomotive  2nd locomotive) 3-wheel  3-window (2nd locomotive  3rd locomotive) 4-wheel  1-window (3rd locomotive  4th locomotive) 2-wheel  3-window (4th locomotive  5th locomotive) … …

CWC model Contingency table for number of windows and number of wheels 1-window2-window3-windowTotal 2-wheel wheel wheel Total Weight of evidence (e.g. 4-wheel  2-window) Pr(number of wheels = Four | number of windows = Two) Pr(number of wheels = Four | number of windows  Two) = == 0.94

CWC model Construction of Rules (e.g. 4-wheel  2-window) Rule format: If (condition) then (conclusion) with certainty (weight) Rule#1: 4-wheel(p)  2-window(p+1), Certainty = 0.94 If a locomotive has four wheels when it is with certainty 0.94 that the locomotive located at one position later in the sequence has two windows. Position #58 #59 #60 According to the Rule#6, Rule#8, and Rule#10 in [CWC94] : W(number of windows = One / number of windows  One | medium-funnel, 2-strip, 1-window) = – 1.38 = 0.90

Sequential Pattern 4 wheels2 wheels3 windows Rule#5 (certainty = 0.91) Rule#3 (certainty = 1.94)Certainty = 0 Sequence p+2 p+1 p Rule#3: 2 wheels(p)  3 windows(p+1), certainty = 1.94 Rule#5: 4 wheels(p)  3 windows(p+2), certainty = 0.91 Maximum of certainties = = 1.85 Hence, the sequential pattern is (4 wheels  2 wheels  3 windows) Significant Not frequent

Optimal Algorithm W 2 13 W 2 32 W 2 21 W 2 11 … Descending weights of stack 2-itemset of weight in descending order 1-itemset of weight in descending order W 1 12 W 1 11 W 1 31 W 1 11 W 1 12 W 1 31 W 1 11 W itemset Pattern W 2 32 W 1 31 W 1 12 … Event Pull Event Matching W 2 32 W 2 13 … … W W 1 12 –W 1 11 –W 1 12 < W 2 32 –W e.g.

Comparative Performance CWC model YWY model ․ Probabilistic Sequence Type SynchronousAsynchronous ․ Sequence Profile Multi-sequenceSingle sequence ․ Calculation of Information Gain Weight of evidence Distance counting ․ Mining Targets Significant Pattern Surprising Pattern ․ Construction in mining process Rule-basedApriori Property

Comparative Performance CWC model YWY model ․ Format of Pattern Multi-profileFixed Period ․ Threshold Bounded Chi-square 5%Information gain ․ Mining Limitation Singular ProfilePredefined Length ․ Constraint Short PatternSmall min_rep, max_dis ․ Applications WidespreadSpecify Area (e.g. DNA)

Code Rearrange CWC model YWY model ( C 1, F 3, S 3, A 2, C 3, F 3, S 2, A 1, C 3, F 2, S 1, A 2, C 2, F 2, S 3, A 1, … ) { ( a 2 b 4 c 1 d 3 ), ( a 3 b 3 c 2 d 3 ), ( a 4 b 2 c 2 d 2 ), ( a 2 b 3 c 3 d 4 ), ( a 4 b 4 c 2 d 3 ) } ( F 3, C 1, S 3, A 2, F 3, C 3, S 2, A 1, F 2, C 3, S 1, A 2, F 2, C 2, S 3, A 1, … ) e.g. a 2 = 2-wheel, b 4 = 4-window, c 1 = 1- funnel, d 3 = 3-strip

Customer Behavior Stock index time series Fund performances Customer asset … … … … Takes B/H/S Actions Time Granularities C3C3 C1C1 C2C2 C1C1 C1C1 C1C1 C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 F2F2 F2F2 F3F3 F1F1 F1F1 F3F3 F1F1 F2F2 S1S1 S2S2 S3S3 S1S1 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 A2A2 A3A3 A2A2 A1A1 A3A3 A2A2 A3A3 A1A1 A3A3 A2A2 C 1 = Saving $20000 F 1 = Up > 5%/weekF 2 = Down > 5%/week S 1 = Up > 1%/dayS 2 = Down > 1%/day A 1 = Buy productA 2 = Sell product e.g.

Possible Patterns Perfect Pattern … … … … C3C3 C1C1 C2C2 C1C1 C1C1 C1C1 C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 F2F2 F2F2 F3F3 F1F1 F1F1 F3F3 F1F1 F2F2 S1S1 S2S2 S3S3 S1S1 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 A2A2 A3A3 A2A2 A1A1 A3A3 A2A2 A3A3 A1A1 A3A3 A2A2 Approximate Patterns Don’t care event = × … … … … ×C1C1 ×C1C1 C1C1 C1C1 C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 F2F2 ×F3F3 F1F1 F1F1 F3F3 F1F1 F2F2 S1S1 S2S2 S3S3 S1S1 S2S2 S1S1 ×S3S3 S1S1 S2S2 A2A2 A3A3 A2A2 A1A1 A3A3 A2A2 ××A3A3 A2A2

Multi-sequence Stock index time series Fund performances Customer asset Takes B/H/S Actions … … … … Codes for daily C3C3 C1C1 C2C2 C1C1 C1C1 C1C1 C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 F2F2 F2F2 F3F3 F1F1 F1F1 F3F3 F1F1 F2F2 S1S1 S2S2 S3S3 S1S1 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 A2A2 A3A3 A2A2 A1A1 A3A3 A2A2 A3A3 A1A1 A3A3 A2A2 While stock index (Down > 1%/day) and the fund performs (5%Down/wk  Price ≤ 5%Up/wk), the customer buys product, who has ($10000≤ Saving ≤$20000) Last day stock index down less than 1%, and the customer has less then $10000 today, he will sell the product tomorrow

Multi-sequence Pattern Codes for daily Stock index time series Fund performances Customer asset … … … … Takes B/H/S Actions C3C3 C1C1 ×C1C1 C1C1 C1C1 C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 ××F3F3 F1F1 F1F1 F3F3 F1F1 F2F2 ××S3S3 S1S1 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 A2A2 A3A3 A2A2 A1A1 A3A3 A2A2 ×A1A1 A3A3 A2A2 Synchronous Sequential Pattern Today, stock index is (Down > 1%/day) and the fund performs (5%Down/wk  Price ≤ 5%Up/wk), the customer spends less than $10000 in next day. He will sell the product two days later.

Non-pattern Synchronous pattern Sequential Pattern … … … … C3C3 C2C2 ×C1C1 C1C1 ×C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 F2F2 ×××F1F1 F3F3 F1F1 F2F2 S1S1 S2S2 ××S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 ××A2A2 A1A1 A3A3 A2A2 A1A1 A1A1 A3A3 A2A2 t t+1 t+5 t+9 …

Pattern Format Tree pattern A1A1 D2D2 A2A2 C3C3 B2B2 Weak pattern B3B3 D2D2 Standard pattern B2B2 C1C1 Strong pattern A1A1 D3D3 A3A3 C3C3 B2B2 A1A1 A2A2 C2C2 D2D2 B1B1 B1B1 C3C3 D2D2 B3B3 Valid Patterns

Further Problems ․ Multi-sequence ․ Edit Distance ․ Speed-up Algorithms ․ Idle time in time series

Conclusions ․ Time-series Applications ․ Multi-sequence Pattern ․ Approximate Matching Noise (Wild-card)