Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Probabilistic Sequential Patterns Name: Yip Chi Kin Date: 18-05-2006.

Similar presentations


Presentation on theme: "Mining Probabilistic Sequential Patterns Name: Yip Chi Kin Date: 18-05-2006."— Presentation transcript:

1 Mining Probabilistic Sequential Patterns Name: Yip Chi Kin Date: 18-05-2006

2 Studied Papers [YWY03] Periodic Patterns [CWC94] Pattern Prediction [YWY01] Meta-Pattern Model [YWY03a] Probabilistic Model Surprising Pattern [YWY03b] Probabilistic Model Distance Penalty

3 Main Aspects ․ Time series Dataset ․ Noise and Repeats ․ Multi-sequence ․ Patterns Mining Frequent (Non-position) Regular (Cyclic Position) Probabilistic Significant Approximate

4 Periodic Patterns ․ min_rep ․ max_dis ․ valid pattern ․ valid subsequence (a 1, a 2, a 2, a 1, a 3, a 2, a 1, a 4, a 2, a 1, a 4, a 4, a 1, a 1, a 3, a 2, a 1, a 4, a 2, a 1, a 5, a 2, a 1, a 2, a 2 ) max_dis = 4 min_rep = 3 valid pattern ( a 1, , a 2 ) is 2-pattern of period 3 valid subsequence of min_rep=3 max_dis=4

5 Meta-pattern Model ․ Don’t care positions ․ Predefined meta-patterns r - r r □ r - r - r - □ r - r - r - r - r - r r - r - - r - - r - - r - r □ □ r - - r - - □ r - - r - - r - r - r - r - r - r r □ r - r - r - □ r - r - - r - - r - r □ □ r - - r - - □ r - - r - - r - - r - - r means refill order of flu medicine in the corresponding week - represents that no flu medicine replenishment in that week r noise/distortion of patterns □ position eliminated in the sequence Pattern about Pattern r - - r -

6 Probabilistic Model ․ CWC model ․ STAMP Algorithm ․ InforMiner Algorithm ․ YWY model Exact Match Information Gain Surprising Pattern Apriori Property Prefix Rules Significant Weight of evidence Threshold = 1.96 Predict Pattern Relevant Position Counting Combine Certainties

7 Information Gain EventProbabilityInformation gain a1a1 0.2 I(a 1 ) = – log |3| ( ) = 1.465 a2a2 0.3I(a 2 ) = – log |3| ( ) = 1.096 a3a3 0.5I(a 3 ) = – log |3| ( ) = 0.631 Sequence ( a 1, a 1, a 2, a 2, a 2, a 3, a 3, a 3, a 3, a 3 )

8 YWY model ․ InfoMiner (a 1, a 3, a 4, a 5, a 1, a 4, a 3, a 3, a 2, a 6, a 3, a 2, a 1, a 4, a 3, a 3, a 1, a 3, a 3, a 5, a 1, a 3, a 4, a 5, a 2, a 3, a 3, a 5, a 1, a 3, a 4, a 5, a 2, a 6, a 3, a 5, a 2, a 6, a 2, a 2 ) Sequence (,,, ) a1a1 EventRepetition a2a2 5 3 a2a2 a3a3 a3a3 a3a3 a4a4 a4a4 a5a5 a6a6 2 1 4 4 1 2 2 4

9 YWY model ․ InfoMiner PatternInformation Gain (a 1, , ,  )I(a 1 ) × 5 = 5.30 max_info = I(a 1 ) + I(a 6 ) + I(a 4 ) + I(a 5 ) = 4.73 min_rep = = 1 MIG threshold (min_gain = 4.5)

10 YWY model ․ InfoMiner min_rep = = 2 (a 1, a 3, a 4, a 5, a 1, a 4, a 3, a 3, a 1, a 4, a 3, a 3, a 1, a 3, a 3, a 5, a 1, a 3, a 4, a 5, a 1, a 3, a 4, a 5 ) Projected Subsequence max_info = I(a 1 ) + I(a 4 ) + I(a 4 ) + I(a 5 ) = 4.44 PatternInformation Gain (a 1, a 3, ,  )[ I(a 1 ) + I(a 3 ) ] × 3 = 5.19

11 YWY model ․ STAMP a 2, a 3, a 1,a 4, a 1, a 1,a 2, a 3, a 4,a 2, a 3, a 1,a 6, a 2, a 1,a 2, a 3, a 1,a 2, a 3, a 7,GIG ( a 2, ,  )1.2-1.21.2 -1.21.2 2.4 ( , a 3,  )1.3-1.31.3 5.2 (a 2, a 3,  )2.5-1.2 -1.32.5 -1.22.5 6.3 GIG of ( a 2, ,  ) = – 1.2 + 1.2 + 1.2 – 1.2 + 1.2 + 1.2 = 2.4 GIG of ( , a 3,  ) = – 1.3 + 1.3 + 1.3 + 1.3 + 1.3 + 1.3 = 5.2 GIG of (a 2, a 3,  ) = ( 5 – 1 ) × 2.5 – 1.2 – 1.3 – 1.2 = 6.3 Generalized Information Gain Given: I(a 1 ) = 1.1, I(a 2 ) = 1.2, I(a 3 ) = 1.3

12 YWY model ․ STAMP Optimal Information Surplus Given: I(a2) = 1.1, Period = 3 Pos.258111417202326 eventa 1 a 2 a 7 a 4 a 9 a 2 a 4 a 2 a 2 a 4 a 9 a 7 a 4 a 2 a 2 a 6 a 9 a 2 a 4 a 2 a 1 a 4 a 9 a 7 a 6 a 6 a 2 a2a2 a 2 a 2 a2a2 a2a2 a2a2 loss -1.1 -2.2 gain1.1 OIS001.1 2.2 2.2 3.34.45.5 4.4 a2a2 475 If 3 < distance ≤ 6 then loss = -1.1. If 6 < distance ≤ 9 then loss = -1.1 × 2 = -2.2

13 YWY model ․ STAMP Maximum Information Gain Given: I(a 2 ) = 1.1, I(a 4 ) = 1.2, I(a 6 ) = 1.3, Period = 3 eventa 1 a 2 a 7 a 4 a 9 a 2 a 4 a 2 a 2 a 4 a 9 a 7 a 4 a 2 a 2 a 6 a 9 a 2 a 4 a 2 a 1 a 4 a 9 a 7 a 6 a 6 a 2 OIS a1a1 0 0 a2a2 001.1 2.2 2.2 3.34.45.5 4.4 a4a4 01.22.43.6 4.86.0 a6a6 0 0 1.3 a7a7 0 0 0 a9a9 0 0 0 0 MIG(a 2 ) position #2 = 1.1×3 = 3.3, MIG(a 2 ) position #3 = 1.1×4 = 4.4, MIG(a 4 ) position #1 = 1.2×5 = 6.0, MIG(a 2 ) position #1 = 0, MIG(a 6 ) position #1 = 1.3×1 = 1.3, MIG(a 6 ) position #2 = 0

14 YWY model ․ STAMP MIG counting Given: Minimum Information Gain = 3.5 Complex pattern (a 4, , a 2 ) 001.1 6.000 000 a4a4 a2a2 a6a6 a2a2 a4a4 Position #3#2#1 Position #3#2#1 ․ STAMP ․ InforMiner

15 CWC model … 4-wheel 3-wheel 4-wheel 2-wheel 2-wheel … 3-wheel 4-wheel 3-window 2-window 3-window 1-window 3-window … 1-window 1-window 4-wheel  2-window (1st locomotive  2nd locomotive) 3-wheel  3-window (2nd locomotive  3rd locomotive) 4-wheel  1-window (3rd locomotive  4th locomotive) 2-wheel  3-window (4th locomotive  5th locomotive) … …

16 CWC model Contingency table for number of windows and number of wheels 1-window2-window3-windowTotal 2-wheel34714 3-wheel71311 4-wheel1219233 Total22241258 Weight of evidence (e.g. 4-wheel  2-window) Pr(number of wheels = Four | number of windows = Two) Pr(number of wheels = Four | number of windows  Two) = == 0.94

17 CWC model Construction of Rules (e.g. 4-wheel  2-window) Rule format: If (condition) then (conclusion) with certainty (weight) Rule#1: 4-wheel(p)  2-window(p+1), Certainty = 0.94 If a locomotive has four wheels when it is with certainty 0.94 that the locomotive located at one position later in the sequence has two windows. Position #58 #59 #60 According to the Rule#6, Rule#8, and Rule#10 in [CWC94] : W(number of windows = One / number of windows  One | medium-funnel, 2-strip, 1-window) = 1.25 + 1.03 – 1.38 = 0.90

18 Sequential Pattern 4 wheels2 wheels3 windows Rule#5 (certainty = 0.91) Rule#3 (certainty = 1.94)Certainty = 0 Sequence p+2 p+1 p Rule#3: 2 wheels(p)  3 windows(p+1), certainty = 1.94 Rule#5: 4 wheels(p)  3 windows(p+2), certainty = 0.91 Maximum of certainties = 0 + 0.91 + 1.94 = 1.85 Hence, the sequential pattern is (4 wheels  2 wheels  3 windows) Significant Not frequent

19 Optimal Algorithm W 2 13 W 2 32 W 2 21 W 2 11 … Descending weights of stack 2-itemset of weight in descending order 1-itemset of weight in descending order W 1 12 W 1 11 W 1 31 W 1 11 W 1 12 W 1 31 W 1 11 W 2 21 3-itemset Pattern W 2 32 W 1 31 W 1 12 … Event Pull Event Matching W 2 32 W 2 13 … … W 1 31 +W 1 12 –W 1 11 –W 1 12 < W 2 32 –W 2 11 321 e.g.

20 Comparative Performance CWC model YWY model ․ Probabilistic Sequence Type SynchronousAsynchronous ․ Sequence Profile Multi-sequenceSingle sequence ․ Calculation of Information Gain Weight of evidence Distance counting ․ Mining Targets Significant Pattern Surprising Pattern ․ Construction in mining process Rule-basedApriori Property

21 Comparative Performance CWC model YWY model ․ Format of Pattern Multi-profileFixed Period ․ Threshold Bounded Chi-square 5%Information gain ․ Mining Limitation Singular ProfilePredefined Length ․ Constraint Short PatternSmall min_rep, max_dis ․ Applications WidespreadSpecify Area (e.g. DNA)

22 Code Rearrange CWC model YWY model ( C 1, F 3, S 3, A 2, C 3, F 3, S 2, A 1, C 3, F 2, S 1, A 2, C 2, F 2, S 3, A 1, … ) { ( a 2 b 4 c 1 d 3 ), ( a 3 b 3 c 2 d 3 ), ( a 4 b 2 c 2 d 2 ), ( a 2 b 3 c 3 d 4 ), ( a 4 b 4 c 2 d 3 ) } ( F 3, C 1, S 3, A 2, F 3, C 3, S 2, A 1, F 2, C 3, S 1, A 2, F 2, C 2, S 3, A 1, … ) e.g. a 2 = 2-wheel, b 4 = 4-window, c 1 = 1- funnel, d 3 = 3-strip

23 Customer Behavior Stock index time series Fund performances Customer asset … … … … Takes B/H/S Actions Time Granularities C3C3 C1C1 C2C2 C1C1 C1C1 C1C1 C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 F2F2 F2F2 F3F3 F1F1 F1F1 F3F3 F1F1 F2F2 S1S1 S2S2 S3S3 S1S1 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 A2A2 A3A3 A2A2 A1A1 A3A3 A2A2 A3A3 A1A1 A3A3 A2A2 C 1 = Saving $20000 F 1 = Up > 5%/weekF 2 = Down > 5%/week S 1 = Up > 1%/dayS 2 = Down > 1%/day A 1 = Buy productA 2 = Sell product e.g.

24 Possible Patterns Perfect Pattern … … … … C3C3 C1C1 C2C2 C1C1 C1C1 C1C1 C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 F2F2 F2F2 F3F3 F1F1 F1F1 F3F3 F1F1 F2F2 S1S1 S2S2 S3S3 S1S1 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 A2A2 A3A3 A2A2 A1A1 A3A3 A2A2 A3A3 A1A1 A3A3 A2A2 Approximate Patterns Don’t care event = × … … … … ×C1C1 ×C1C1 C1C1 C1C1 C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 F2F2 ×F3F3 F1F1 F1F1 F3F3 F1F1 F2F2 S1S1 S2S2 S3S3 S1S1 S2S2 S1S1 ×S3S3 S1S1 S2S2 A2A2 A3A3 A2A2 A1A1 A3A3 A2A2 ××A3A3 A2A2

25 Multi-sequence Stock index time series Fund performances Customer asset Takes B/H/S Actions … … … … Codes for daily C3C3 C1C1 C2C2 C1C1 C1C1 C1C1 C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 F2F2 F2F2 F3F3 F1F1 F1F1 F3F3 F1F1 F2F2 S1S1 S2S2 S3S3 S1S1 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 A2A2 A3A3 A2A2 A1A1 A3A3 A2A2 A3A3 A1A1 A3A3 A2A2 While stock index (Down > 1%/day) and the fund performs (5%Down/wk  Price ≤ 5%Up/wk), the customer buys product, who has ($10000≤ Saving ≤$20000) Last day stock index down less than 1%, and the customer has less then $10000 today, he will sell the product tomorrow

26 Multi-sequence Pattern Codes for daily Stock index time series Fund performances Customer asset … … … … Takes B/H/S Actions C3C3 C1C1 ×C1C1 C1C1 C1C1 C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 ××F3F3 F1F1 F1F1 F3F3 F1F1 F2F2 ××S3S3 S1S1 S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 A2A2 A3A3 A2A2 A1A1 A3A3 A2A2 ×A1A1 A3A3 A2A2 Synchronous Sequential Pattern Today, stock index is (Down > 1%/day) and the fund performs (5%Down/wk  Price ≤ 5%Up/wk), the customer spends less than $10000 in next day. He will sell the product two days later.

27 Non-pattern Synchronous pattern Sequential Pattern … … … … C3C3 C2C2 ×C1C1 C1C1 ×C1C1 C3C3 C1C1 C2C2 F3F3 F3F3 F2F2 ×××F1F1 F3F3 F1F1 F2F2 S1S1 S2S2 ××S2S2 S1S1 S2S2 S3S3 S1S1 S2S2 ××A2A2 A1A1 A3A3 A2A2 A1A1 A1A1 A3A3 A2A2 t t+1 t+5 t+9 …

28 Pattern Format Tree pattern A1A1 D2D2 A2A2 C3C3 B2B2 Weak pattern B3B3 D2D2 Standard pattern B2B2 C1C1 Strong pattern A1A1 D3D3 A3A3 C3C3 B2B2 A1A1 A2A2 C2C2 D2D2 B1B1 B1B1 C3C3 D2D2 B3B3 Valid Patterns

29 Further Problems ․ Multi-sequence ․ Edit Distance ․ Speed-up Algorithms ․ Idle time in time series

30 Conclusions ․ Time-series Applications ․ Multi-sequence Pattern ․ Approximate Matching Noise (Wild-card)


Download ppt "Mining Probabilistic Sequential Patterns Name: Yip Chi Kin Date: 18-05-2006."

Similar presentations


Ads by Google