Mining Time-Series Databases Mohamed G. Elfeky
Introduction A Time-Series Database is a database that contains data for each point in time. Examples: Weather Data Stock Prices
What to Mine? Full Periodic Patterns Every point in time contributes to the cyclic behavior of the time-series for each period. e.g., describing the weekly stock prices pattern considering all the days of the week. Partial Periodic Patterns Describing the behavior of the time-series at some but not all points in time. e.g., discovering that the stock prices are high every Saturday and small every Tuesday.
Mining Partial Periodic Patterns Problem Definition Methods Apriori Max-Subpattern Hit Set Jiawei Han, Guozhu Dong, and Yiwen Yin – ICDE98
Problem Definition The time-series is: S = D 1 D 2 … D n A pattern is: s = s 1 … s p over the set of features L and the letter *. |s| = p is the period of the pattern s. L-length of s is the number of s i that is not *. If s has L-length j, it is called a j-pattern. A subpattern of s is: s ’ = s ’ 1 … s ’ p such that for each position i: s ’ i is a * or subset of s i.
Problem Definition (Cont.) Each segment of the form D i|s|+1 … D i|s|+|s| is called a period segment. A period segment matches s if for each position j, either s j is * or subset of D i|s|+j. The frequency count of s in a time-series S is the number of period segments of S that matches s. The confidence of s is defined as the division of its frequency count by the maximum number of periods of length |s| in S. A pattern is called frequent if its confidence not less than a minimum threshold.
Problem Definition (Example) The pattern: a*{a,c}de is of length 5 and of L-length 4 and so it is called 4-pattern. The patterns: a*{a,c}** and **cde are subpatterns of the above pattern. In the series a{b,c}baebaced, the pattern: a*b, whose period is 3, has frequency count 2. Its confidence is 2/3 where 3 is the maximum number of periods of length 3.
Apriori Method Apriori Property: Each subpattern of a frequent pattern of period p is itself a frequent pattern of period p. Method: 1. Find F 1, the set of frequent 1-patterns of period p. 2. Find all frequent i-patterns of period p, for i from 2 to p, based on the idea of Apriori, and terminate when the candidate i-pattern set is empty.
Max-Subpattern Hit Set Method Definitions Algorithm Implementation Data Structure
Definitions A candidate max-pattern C max is the maximal pattern which can be generated from F 1 (the set of frequent 1-patterns). Example: If F 1 = {a***, *b**, *c**, **d*}, Then C max = a{b,c}d*
Definitions (Cont.) A subpattern of C max is hit in a period segment S i if it is the maximal subpattern of C max in S i. Example: For C max = a{b,c}d* and S i = a{b,c}ce, The hit subpattern is: a{b,c}** The hit set H is the set of all hit subpatterns of C max in S.
Algorithm 1. Scan S once to find F 1 and form the candidate max-pattern C max. 2. Scan S again, and for each period segment, add its max-subpattern to the hit set setting its count to 1 if it is not exist, or increase its count by Derive the frequent patterns from the hit set.
Implementation Data Structure Max-Subpattern Tree The root node is: C max. A child node is a subpattern of the parent node with one non-* letter missing. The link is labeled by this letter. A node containing only 2 non-* letters have no children since they are already in F 1. Each node has a count field which registers its number of hits.
Max-Subpattern Tree (Example) a{b,c}d* *{b,c}d* acd*abd*a{b,c}** a d cb *cd* *bd*a*d*ab**ac** bc b d d a a b bc
Max-Subpattern Tree (Construction) Finding w the max-subpattern in the current segment. Search for w in the tree, starting from the root and following the path corresponds to the missing non-* letters in order. If the node w is found, increase its count by 1. Otherwise, create a new node w (with count 1) and its missing ancestors in the followed path (with count 0).
Max-Subpattern Tree (Construction) a{b,c}d* *{b,c}d* a *cd* b 0 0 1
Max-Subpattern Tree (Traversal) After the second scan, the tree will contain all the max subpatterns of the time-series. Now the tree must be traversed to compute the confidence value of each subpattern.
Max-Subpattern Tree (Traversal) The frequency count of each node is the sum of its count and those of all its reachable ancestors. For Example: The frequency count of *cd* is 78. The frequency count of a*d* is 105.
Max-Subpattern Tree (Example) a{b,c}d* *{b,c}d* acd*abd*a{b,c}** a d cb *cd* *bd*a*d*ab**ac** bc b d d a a b bc