Download presentation
Presentation is loading. Please wait.
Published byAileen Newton Modified over 9 years ago
1
Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2, Issue 2 (December 2000) Special issue on “Scalable data mining algorithms”
2
Outline Introduction Injection of noises –Asynchronous Patterns –Meta Patterns Over-population of uninteresting patterns Conclusions
3
Introduction Pattern discovery in time series data or some inherent physical structure → mining patterns in long sequential data Application : –Bio-Medical Study : chromosomes as sequences of amino acids –Performance Analysis : system-monitoring application –Client Profile : User profiles can be built based on the discovered pattern on trace logs
4
Introduction (cont.) Tolerable noises may in different formats (depend on the type of application and the user’s interests) : –Injection of noises. –Over-population of uninteresting patterns.
5
Injection of noises Two models are proposed to address the issues of accommodating insertion of random noises and characterizing change of behavior. –Asynchronous Patterns –Meta Patterns
6
Asynchronous Patterns Mining periodic pattern assumed that Disturbance is allowed only in terms of "missing occurrences" but not as general as any "insertion of random noise events". –"Smith reads newspaper every morning" is a periodic pattern. missing occurrences (synchronization) –inventory replenishment of cold medicine : the refill time shifts to the 3rd week of a month (not the beginning of the month any longer). insertion of random noise event (asynchronous)
7
Asynchronous Patterns (cont.) Valid segments : is required to be of at least min_rep contiguous repetitions of the pattern and the length of each piece of disturbance is allowed only up to max_dis. Valid subsequence : is a set of non- overlapping valid segments. Longest valid subsequence : A valid subsequence with the most overall repetitions.
8
S2 & S4 can be a valid subsequence whose overall # of repetitions is 10 S1 & S4 : valid segment (dis=9 > mix_dis=6 ) D1,D2~D19 are 19 matches of (d1,*,*) If min-rep=5, S1,S2,S4 are valid segments, S3 is not Both X And Y are extendible (given position i and ending position j (j<i), if j ≧ I- max_dis-1 then j is extendible), X dominates Y at position 20 (iif the number of repetitions of X ≧ Y)
9
Asynchronous Patterns (cont.) Distance-based pruning of candidate patterns –Given a symbol d and a period l, if C dl ≧ min_rep threshold, then it’s possible that d might participate in some valid pattern of period l. –For example, {A,B,D,A,C,A,A,C,A,A,A,B,A,A,C } and min_rep=3, then A,C may be valid pattern and B,D not be Apriori property of complex patterns –a valid segment of a pattern is also a valid segment of any pattern with fewer symbols specified in the pattern. –For example, a valid segment for (d l,d 2,*) will also be one for (d l,*,*). Extendibility and subsequence dominate For example, if (d 1,*,*,*) and (d 2,*,*,*) are valid, then three candidates 2-patterns can be generated: (d 1,d 2,*,*), (d 1,*,d 2,*), (d 1,*,*,d 2 ). Similarly, (d 1, d 2, d 3,*) can become a candidates 3- pattern only if (d 1, d 2,*,*), (d 1,*, d 3,*) and (*, d 2, d 3,*) are all valid.
10
Meta patterns Let S={a,b,c…} be a set of literals. –Basic pattern: each component in the pattern is restricted to be either a literal or a “*”. biweekly replenishment P1=(r:[1,1],*:[2,2]) triweekly replenishment P2=(r:[1,1],*:[2,3]) –Meta pattern : may have pattern(s)/meta- pattern(s) as its component(s). Two-level periodicity (P1:[1,24],*:[25,25],P2:[26,52]) Three components: P1,*,P2
11
Meta patterns (cont.) ((r:[1,1],*:[2,2]):[1,24],*:[25,25],(r:[1,1],*:[2,3]):[26,52]) –length of a component : (r:[1,1],*:[2,2])=24 –Span of meta-pattern : 52 –Abbreviation : ((r,*):[1,24],*,(r,*:[2,3]):[26,52]) –Level of meta-pattern: max level of its component +1 Level of basic pattern is 1. For instance, (r,*:[2,3]) is level 1 P1 = ((r,*):[1,24],*,(r,*:[2,3]):[26,52]) is level 2 the components of a meta-pattern do not have to be of the same level. For instance, (P1:[1,260],*:[261,300]) is level 3 (P1 is level 2)
12
Meta patterns (cont.) Figure 2(a) : min_rep=3, max_dis=4, a meta-pattern ((a,b,*):[1,19],*:[20,21],(b,c):[22,27],*:[28,30],(a,b,*):[31,49],*:[50, 51],(b,c):[52,57],*:[58,60]) Figure 2(b) : Many patterns/meta-patterns may collocate or overlap for any given portion of a sequence. For example,both of (a, b, a, *) and (a, *) are valid within the subsequence.
13
Meta patterns (cont.) How to identify the “proper” candidate ? Component location property : can provide substantial inter-level pruning effect during the generation of high level candidates from valid low level meta-patterns. Apriori property : can render some pruning power to conduct the mining process of meta-patterns of the same level. A valid low level meta-pattern may serve as a component of a higher level meta-pattern only if its presence in the symbol sequence exhibits some cyclic behavior and such cyclic behavior has to follow the same periodicity as the higher level meta-pattern by sufficient number of times (i.e., at least min_rep times). For example: X1 can server as a component of a higher level meta-pattern X2 X1=((a,b,*):[1,19],*:[20,21]) X2=((a,b,*):[1,19],*:[20,21],(b,c):[22,27],*:[28,30],X1:[31,150]
14
Meta patterns (cont.)
15
Figure 3, the pruning effects provided by the component location property and the Apriori property are indicated by dashed arrows and solid arrows, respectively.
16
Over-population of uninteresting patterns In some applications, the number of occurrences may not represent the significance of a pattern. –Computational Biology : gene expressions –Web server load : the high workload on all servers may occur at a much lower frequency than other states. –Earthquake : big earthquake is much more valuable even though it occurs at a much lower frequency than smaller ones.
17
Over-population of uninteresting patterns (cont.) Information gain : is a measurement of how likely a pattern will occur or the amount of "surprise" when a pattern actually occurs.Information gain Information model : For a given minimum information gain threshold, let Ψ be the set of patterns that satisfy this threshold. Support model : in order to find all patterns in Ψ, the minimum support threshold has to be set very low. --> too many patterns discovered. next
18
Information gain Let E = {a 1, a 2,... a n } be a set of distinct events. The event sequence is a sequence of events in E. information carried by an event a i (a i E) is defined to be I(a i ) = -log |E| Prob(a i ) –|E| : is # of events in E –Prob(a i ) : the probability that a i occurs = Num(a i )/N information gain : a pattern P in an event sequence D, the information gain of P in D is defined as G(P) = I ( P ) x (Support(P) - 1).
19
Information gain (cont.) I(a i )= -log |E| Prob(a i ) G(P)= I (P) x (Support(P) - 1) E = {a 1, a 2,... a 6 } |E| = 40 Support (a 2, a 6, *, *) = Repetition((a 2, a 6, *, * )) = 3 G((a 2, a 6, *,* )) = I(I(a 2 )+I(a 6 )) x (Support ((a 2, a 6, *,* ))-1 ) = (0.90+1.45) x (3-1) = 2.35 x 2 = 4.70 back
20
Over-population of uninteresting patterns (cont.) Two traces : –Scour is a web search engine that is specialized for multimedia contents. –IBM Intranet traces consist of 160 critical nodes, e.g.,file servers, routers, etc., in the IBM T. J. Watson Intranet. For example, the pattern (node a _fail,*, node b _saturated,*) has the eighth highest information gain. This pattern means that a short time after a router (node a ) fails, the CPU on another node (node b ) is saturated. Under a thorough investigation, we found that node b is a file server and after node a fails, all requests to some files are sent to node b, thus causes the bottleneck.
21
Conclusion This paper discuss three recent research advances of mining patterns in time series data given the presence of noise. –J. Yang, W. Wang, and P. Yu. Mining asynchronous periodic patterns in time series data. Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pp. 275-279, 2000. –J. Yang, W. Yang, and P. Yu. Meta-patterns: revealing hierarchy of periodic patterns. IBM Research Report,2001. –J. Yang, W. Yaug, and P. Yu. InfoMiner: mining significant periodic patterns with rare events in time series data. IBM Research Report, 2001.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.