1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
CSE 634 Data Mining Techniques
Association rules and frequent itemsets mining
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo.
Frequent Closed Pattern Search By Row and Feature Enumeration
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Fast Algorithms for Association Rule Mining
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Mining Association Rules
Mining Association Rules
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Performance and Scalability: Apriori Implementation.
SEG Tutorial 2 – Frequent Pattern Mining.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
COSC160: Data Structures Linked Lists
Chapter 8 – Binary Search Tree
Market Basket Analysis and Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Farzaneh Mirzazadeh Fall 2007
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

2 Outline Problem Definition Mining partial periodicity for some given period(s) – single period – multi periods Mining partial periodicity when no period length is given in advance Conclusion & Future work

3 Problem definition Time series S = D 1, D 2,..., D n, where D i is a set of features for time instant i. Partial pattern s = s 1... s p. Here, s i is defined over (2 L - {  }  {*}) where L is the underlying set of features and * refers to the “don’t care” character. – |s|: pattern length – L-length of s: number of s i which contains letters from L. – subpattern of a pattern s: is a pattern s’ = s’ 1... s’ p such that |s| = |s’| and s’ i  s i for every position i where s’ i  *. – E.g.: s = a*{a,c}de |s|=5, L-length is 4(also called 4-pattern) a*{a,c}** and **cde are all its subpatterns.

4 Problem definition frequency_count(s) in sequence S=D 1, D 2,..., D n – frequency_count(s) = |{i|0  i<m, and string s is true in D i|s|+1, D i|s|+s,..., D i|s|+|s| }|. confidence(s) = frequency_count(s)/m – m: maximum number of periods of length |s| contained in the time series.(m|s|  n<(m+1)|s|). – E.g.: In a{b,c}baebaced, freq_count(a*b) =2, conf(a*b) =2/3 period segment: segment in form of D i|s|+1, D i|s|+s,..., D i|s|+|s| where 0  i<m. – A patterns s = s 1... s p is true in some period segment means: for each position i, either s i is * or all the letters in s i occur in the i th set of the features in the segment. Pattern “a*b” is true in segment “acb”, but not true in “bcb” frequent partial periodic pattern s: – confidence(s)  min_conf, which is a user specified threshold

5 Problem definition Input: – A time series S – Specified period(s) – m: indicating the ratio of the lengths of S and the patterns must be at least m – min_conf Goal: – Discover all the frequent patterns for one period or some periods

6 Mining partial periodicity for some given period(s) For single period For multi periods Deviation of all partial patterns – Max-subpattern tree – Deviation of frequent patterns from max-subpattern tree

7 Mining partial periodicity for single period Notation: – F1: the set of frequent 1-patterns of period p. For example, p=3, a**, *{b,c}*, **g are all in F1. Single-period Apriori – Find frequent F1. Accumulate the frequency count for each 1-pattern in each whole period segment; select those F1 whose frequency count  min_conf*m – Find all frequent i-patterns of period p(2  i  p) using the Apriori property. Terminate when the candidate i-pattern set is empty.  Step 1 scan source data once, and step 2 need scan source data up to p-1 times in the worst case.

8 Mining partial periodicity for single period The advantage of Apriori property in mining partial patterns is not as obvious as that in mining association rule. – In mining association rule, the number of frequent i- itemsets shrinks quickly as i increase because of the sparsity of frequent i-itemsets. – In mining periodic patterns, the number of frequent i-patterns does not shrink quickly as i increase because of strong correlation between frequencies of patterns and their subpatterns.

9 Max-subpattern hit set method -single period Candidate max-pattern, C max, is the maximal pattern generated from F 1. – E.g., F1={a****,*b***,**c**}, C max =abc** A subpattern of C max is hit in a period segment S i of S if it is the maximal subpattern of C max. – E.g., C max =abc**, S i = abdef, then ab*** is its hit subpattern. The complete set of partial periodic patterns can be derived from the frequency counts of all the hit maximal subpatterns of C max.

10 Max-subpattern hit set method -single period(algorithm2) Scan S once to find frequent F 1. Form the candidate C max. Scan S once again. For each period segment, if it’s nonempty, add it to the max-subpattern tree(introduced later). Derive frequent patterns from the max- subpattern tree(introduced later).  Only two scans on source data.

11 Max-subpattern tree Max-subpattern tree is used to facilitate the process of deriving the set of frequent patterns, which is the step 2 of the algorithm 2. Rooted at C max Each subpattern of C max with one non-* letter missing is a direct child node of the root. A node w with one more non-* letters may have a set of children, each of which is a subpattern of w with one more non-* letter missing. Each node has a “count” field.

12 Max-subpattern tree 10 *b2*d**b1*d**{b1,b2}***a**d*ab2***ab1*** ~a ~d ~a ~b1 ~b2 ~b1 ~b2 ~d ~b2 *{b1,b2}*d*ab2*d*ab1*d*a{b1,b2}*** a{b1,b2}*d* ~a ~b1 ~d~b2 One node is linked to only one parent, e.g., a**d* is linked to ab2*d*,but not linked to ab1*d*(missing link is marked by a green dash line.)

13 Insertion in the max-subpattern tree Insert a max-subpattern w found during the scan of S into the max-subpattern tree T. Step1: Starting from the root of the tree, find the corresponding node by checking the missing non-* letter in order. – E.g., To max-pattern node *b1*d*, it has two letters, a and b2 missing from the C max =a{b1,b2}*d*. The node can be found follong ~a link to *{b1,b2}*d*, then following ~b2 link to *b1*d*.

14 Insertion in the max-subpattern tree Step2: If the node W is found, increase its count by 1. Otherwise, create a new node w with count 1 and its missing ancestor nodes(only those on the path to w, with count 0), if any, and insert it/them into the corresponding place(s) of the tree. – E.g., If the first max-subpattern node found is *b1*d*, we will create the node *b1*d* with count 1 and create two ancestor nodes with count 0: w1=a{b1,b2}*d*(root) and w2= *{b1,b2}*d* following ~a link of w1. The node *b1*d* is w2’s child, following the ~b2 link.

15 Derivation of frequent patterns from max-subpattern tree The set frequent F1 is derived in the first scan. The set of frequent k-pattern(k>1) is derived from the max-subpattern tree as follows: – for i=2 to |F1| derive candidate patterns with L-length i from frequent (L- 1)-length patterns by “(i+1)-way join” scan tree T to find frequency counts of these candidate patterns and eliminate the non-frequent once.(note: the frequency count of a node is the sum of count(itself) and count(all of its reachable ancestors)). IF the derived frequent i-pattern set is empty, return.

16 Example: Set of reachable ancestors of a node w in a max-subpattern tree T is the set of all the nodes in T, which are proper super-patterns of w. Let min_conf*m=45. We check level-2 node, w = *b2*d*

17 Example: – reachable ancestor set(w) = {root, *{b1,b2}*d*,ab2*d*} – frequency_count(w) = =68 >45 – it’s frequent! 10 *b2*d**b1*d**{b1,b2}***a**d*ab2***ab1*** ~a ~d ~a ~b1 ~b2 ~b1 ~b2 ~d ~b2 *{b1,b2}*d*ab2*d*ab1*d*a{b1,b2}*** a{b1,b2}*d* ~a ~b1 ~d~b2

18 Max-subpattern hit set method -multi period Scan S once, for all periods p j. Find F 1 (p j ) and form C max (p j ). Scan S again, for all periods p j do the same as step2 in the former algorithm. Only two scans on source data.

19 Experiment Synthetic data – 100k, 550k |F1|=12 p=50 HitSet method outperforms Apriori method Performance gain when L-length increase

20 Mining partial periodicity when no period is given in advance In the algorithms provided before, the period length is given in advance. – How to find frequent partial patterns when periods are unknown? PPD(Partial Periodic Detection) algorithm – Filter step: scan the data once to find those possible periods automatically(introduce later). – Mining step:Apply Han’s algorithm to discover frequent partial patterns for the periods gotten in the first step.

21 Filter step Assume the size of a time series is N Step1: Create a binary vector of size N for every letter in the alphabet. – 1 for every occurrence of the corresponding letter – 0 for every other letter e.g.: abcdabebadfcacdcfcaa, N=20, binary vector(a) = binary vector(b) =

22 Filter step Step2: calculate the Circular Autocorrelation Function for every binary vector. – Autocorrelation means self-correlation. I.e., discovering correlations among the elements of the same vector v for every possible period length(1,...,N). – The function value is the dot products between v and v shifted circularly by a lag k. We denote v(k) as vector shifted by k. Circular means that if one point will be out of N after shifting k, it will be moved to the beginning of the vector. E.g., assume v =1001, then v shifted by 2 is 0110 – For each k from 1 to N, compute autocorrelation function r(k) r(k) = v.v(k) E.g., r(2) =(1001). = 0 for v=1001

23 – Example:abcdabebadfcacdcfcaa, N=20, binary vector(a) = first value of the autocorrelation vector is the dot product of the binary vector with itself. It’s 6. The peak identified at position 5 implies that there is probably a period of length 4 and the value of 3 at this position is an estimate of the frequency count of this period. Identifying all those peaks  get a set of candidate periods. Given min-conf c, extract frequent periods. If the value for period p is greater than or equal to cN/p, then p is frequent period.

24 Exprement: Test data: – Real data: Wal-Mart stores and power consumption data. – Synthetic data: Machine Learning Repository. Different runs over different portions of the data sets showed that the execution time is linearly proportional to the size of the time series as well as the size of the alphabet.

25 Conclusion & future work Introduce algorithms used to mine frequent partial patterns in time series database – given periods – periods are unknown How to solve the similar problem in data stream setting considering some constraint: – one-pass – memory limitation

26 References: J. Han, G. Dong, Y. Yin. Efficient Mining of Partial Periodic Patterns in Time Series Database. In ICDE99. C.Berberidis, I. Vlahavas, W. G. Aref. etc. On the Discovery of Weak Periodicities in Large Time Series. In PKDD02.