1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003
2 Outline Problem Definition Mining partial periodicity for some given period(s) – single period – multi periods Mining partial periodicity when no period length is given in advance Conclusion & Future work
3 Problem definition Time series S = D 1, D 2,..., D n, where D i is a set of features for time instant i. Partial pattern s = s 1... s p. Here, s i is defined over (2 L - { } {*}) where L is the underlying set of features and * refers to the “don’t care” character. – |s|: pattern length – L-length of s: number of s i which contains letters from L. – subpattern of a pattern s: is a pattern s’ = s’ 1... s’ p such that |s| = |s’| and s’ i s i for every position i where s’ i *. – E.g.: s = a*{a,c}de |s|=5, L-length is 4(also called 4-pattern) a*{a,c}** and **cde are all its subpatterns.
4 Problem definition frequency_count(s) in sequence S=D 1, D 2,..., D n – frequency_count(s) = |{i|0 i<m, and string s is true in D i|s|+1, D i|s|+s,..., D i|s|+|s| }|. confidence(s) = frequency_count(s)/m – m: maximum number of periods of length |s| contained in the time series.(m|s| n<(m+1)|s|). – E.g.: In a{b,c}baebaced, freq_count(a*b) =2, conf(a*b) =2/3 period segment: segment in form of D i|s|+1, D i|s|+s,..., D i|s|+|s| where 0 i<m. – A patterns s = s 1... s p is true in some period segment means: for each position i, either s i is * or all the letters in s i occur in the i th set of the features in the segment. Pattern “a*b” is true in segment “acb”, but not true in “bcb” frequent partial periodic pattern s: – confidence(s) min_conf, which is a user specified threshold
5 Problem definition Input: – A time series S – Specified period(s) – m: indicating the ratio of the lengths of S and the patterns must be at least m – min_conf Goal: – Discover all the frequent patterns for one period or some periods
6 Mining partial periodicity for some given period(s) For single period For multi periods Deviation of all partial patterns – Max-subpattern tree – Deviation of frequent patterns from max-subpattern tree
7 Mining partial periodicity for single period Notation: – F1: the set of frequent 1-patterns of period p. For example, p=3, a**, *{b,c}*, **g are all in F1. Single-period Apriori – Find frequent F1. Accumulate the frequency count for each 1-pattern in each whole period segment; select those F1 whose frequency count min_conf*m – Find all frequent i-patterns of period p(2 i p) using the Apriori property. Terminate when the candidate i-pattern set is empty. Step 1 scan source data once, and step 2 need scan source data up to p-1 times in the worst case.
8 Mining partial periodicity for single period The advantage of Apriori property in mining partial patterns is not as obvious as that in mining association rule. – In mining association rule, the number of frequent i- itemsets shrinks quickly as i increase because of the sparsity of frequent i-itemsets. – In mining periodic patterns, the number of frequent i-patterns does not shrink quickly as i increase because of strong correlation between frequencies of patterns and their subpatterns.
9 Max-subpattern hit set method -single period Candidate max-pattern, C max, is the maximal pattern generated from F 1. – E.g., F1={a****,*b***,**c**}, C max =abc** A subpattern of C max is hit in a period segment S i of S if it is the maximal subpattern of C max. – E.g., C max =abc**, S i = abdef, then ab*** is its hit subpattern. The complete set of partial periodic patterns can be derived from the frequency counts of all the hit maximal subpatterns of C max.
10 Max-subpattern hit set method -single period(algorithm2) Scan S once to find frequent F 1. Form the candidate C max. Scan S once again. For each period segment, if it’s nonempty, add it to the max-subpattern tree(introduced later). Derive frequent patterns from the max- subpattern tree(introduced later). Only two scans on source data.
11 Max-subpattern tree Max-subpattern tree is used to facilitate the process of deriving the set of frequent patterns, which is the step 2 of the algorithm 2. Rooted at C max Each subpattern of C max with one non-* letter missing is a direct child node of the root. A node w with one more non-* letters may have a set of children, each of which is a subpattern of w with one more non-* letter missing. Each node has a “count” field.
12 Max-subpattern tree 10 *b2*d**b1*d**{b1,b2}***a**d*ab2***ab1*** ~a ~d ~a ~b1 ~b2 ~b1 ~b2 ~d ~b2 *{b1,b2}*d*ab2*d*ab1*d*a{b1,b2}*** a{b1,b2}*d* ~a ~b1 ~d~b2 One node is linked to only one parent, e.g., a**d* is linked to ab2*d*,but not linked to ab1*d*(missing link is marked by a green dash line.)
13 Insertion in the max-subpattern tree Insert a max-subpattern w found during the scan of S into the max-subpattern tree T. Step1: Starting from the root of the tree, find the corresponding node by checking the missing non-* letter in order. – E.g., To max-pattern node *b1*d*, it has two letters, a and b2 missing from the C max =a{b1,b2}*d*. The node can be found follong ~a link to *{b1,b2}*d*, then following ~b2 link to *b1*d*.
14 Insertion in the max-subpattern tree Step2: If the node W is found, increase its count by 1. Otherwise, create a new node w with count 1 and its missing ancestor nodes(only those on the path to w, with count 0), if any, and insert it/them into the corresponding place(s) of the tree. – E.g., If the first max-subpattern node found is *b1*d*, we will create the node *b1*d* with count 1 and create two ancestor nodes with count 0: w1=a{b1,b2}*d*(root) and w2= *{b1,b2}*d* following ~a link of w1. The node *b1*d* is w2’s child, following the ~b2 link.
15 Derivation of frequent patterns from max-subpattern tree The set frequent F1 is derived in the first scan. The set of frequent k-pattern(k>1) is derived from the max-subpattern tree as follows: – for i=2 to |F1| derive candidate patterns with L-length i from frequent (L- 1)-length patterns by “(i+1)-way join” scan tree T to find frequency counts of these candidate patterns and eliminate the non-frequent once.(note: the frequency count of a node is the sum of count(itself) and count(all of its reachable ancestors)). IF the derived frequent i-pattern set is empty, return.
16 Example: Set of reachable ancestors of a node w in a max-subpattern tree T is the set of all the nodes in T, which are proper super-patterns of w. Let min_conf*m=45. We check level-2 node, w = *b2*d*
17 Example: – reachable ancestor set(w) = {root, *{b1,b2}*d*,ab2*d*} – frequency_count(w) = =68 >45 – it’s frequent! 10 *b2*d**b1*d**{b1,b2}***a**d*ab2***ab1*** ~a ~d ~a ~b1 ~b2 ~b1 ~b2 ~d ~b2 *{b1,b2}*d*ab2*d*ab1*d*a{b1,b2}*** a{b1,b2}*d* ~a ~b1 ~d~b2
18 Max-subpattern hit set method -multi period Scan S once, for all periods p j. Find F 1 (p j ) and form C max (p j ). Scan S again, for all periods p j do the same as step2 in the former algorithm. Only two scans on source data.
19 Experiment Synthetic data – 100k, 550k |F1|=12 p=50 HitSet method outperforms Apriori method Performance gain when L-length increase
20 Mining partial periodicity when no period is given in advance In the algorithms provided before, the period length is given in advance. – How to find frequent partial patterns when periods are unknown? PPD(Partial Periodic Detection) algorithm – Filter step: scan the data once to find those possible periods automatically(introduce later). – Mining step:Apply Han’s algorithm to discover frequent partial patterns for the periods gotten in the first step.
21 Filter step Assume the size of a time series is N Step1: Create a binary vector of size N for every letter in the alphabet. – 1 for every occurrence of the corresponding letter – 0 for every other letter e.g.: abcdabebadfcacdcfcaa, N=20, binary vector(a) = binary vector(b) =
22 Filter step Step2: calculate the Circular Autocorrelation Function for every binary vector. – Autocorrelation means self-correlation. I.e., discovering correlations among the elements of the same vector v for every possible period length(1,...,N). – The function value is the dot products between v and v shifted circularly by a lag k. We denote v(k) as vector shifted by k. Circular means that if one point will be out of N after shifting k, it will be moved to the beginning of the vector. E.g., assume v =1001, then v shifted by 2 is 0110 – For each k from 1 to N, compute autocorrelation function r(k) r(k) = v.v(k) E.g., r(2) =(1001). = 0 for v=1001
23 – Example:abcdabebadfcacdcfcaa, N=20, binary vector(a) = first value of the autocorrelation vector is the dot product of the binary vector with itself. It’s 6. The peak identified at position 5 implies that there is probably a period of length 4 and the value of 3 at this position is an estimate of the frequency count of this period. Identifying all those peaks get a set of candidate periods. Given min-conf c, extract frequent periods. If the value for period p is greater than or equal to cN/p, then p is frequent period.
24 Exprement: Test data: – Real data: Wal-Mart stores and power consumption data. – Synthetic data: Machine Learning Repository. Different runs over different portions of the data sets showed that the execution time is linearly proportional to the size of the time series as well as the size of the alphabet.
25 Conclusion & future work Introduce algorithms used to mine frequent partial patterns in time series database – given periods – periods are unknown How to solve the similar problem in data stream setting considering some constraint: – one-pass – memory limitation
26 References: J. Han, G. Dong, Y. Yin. Efficient Mining of Partial Periodic Patterns in Time Series Database. In ICDE99. C.Berberidis, I. Vlahavas, W. G. Aref. etc. On the Discovery of Weak Periodicities in Large Time Series. In PKDD02.