Supervised Time Series Pattern Discovery through Local Importance Mustafa Gokce Baydogan* George Runger* Eugene Tuv† * Arizona State University † Intel Corporation 10/14/2012 INFORMS Annual Meeting 2012, Phoenix
Outline Time series classification Problem definition Motivation Supervised Time Series Pattern Discovery through Local Importance (TS-PD) Computational experiments and results Conclusions and future work
Time Series Classification Time series classification is a supervised learning problem The input consists of a set of training examples and associated class labels, Each example is formed by one or more time series Predict the class of the new (test) series
Motivations People measure things, and things (with rare exceptions) change over time Time series are everywhere ECG Heartbeat Stock
Motivations Other types of data can be converted to time series. Everything is about the representation. Example: Recognizing words An example word “Alexandria” from the dataset of word profiles for George Washington's manuscripts. A word can be represented by two time series created by moving over and under the word Images from E. Keogh. A quick tour of the datasets for VLDB 2008. In VLDB, 2008.
Challenges How can we handle the warping in time series? Observed 4 peaks are related to certain event in the manufacturing process Indication of a problem Time of the peaks may change (two peaks are observed earlier for blue series) Problem occurred over a shorter time interval TRANSLATION DILATION
Approaches Instance-based methods Feature-based methods Predict based on the similarity to the training time series Requires a similarity measure (distance measure) Euclidean distance …. Dynamic Time Warping (DTW) distance is known to be strong solution [1] Handles translations and dilations by matching observations Feature-based methods Predict a test instance based on a model trained on extracted feature vectors Requires feature extraction methods and a supervised learner (i.e. decision tree, support vector machine, etc.) to be trained on the extracted features
Instance-based methods Advantages Accurate Not requiring setting of many parameters Disadvantages May not be suitable for real time applications [3] DTW has a time complexity of O(n) using a lower bound (LB_Keogh [8]) (it is a variation of shortest path problem) n is the length of the time series Not scalable with large number of training samples and variables, No model, each test series is compared to all (or some) training series Requires storage of the training time series Not suitable for resource limited environments (i.e. sensors) Performance degrades with long time series and short features of interest
Feature-based methods Time series are represented by the features generated. Shape-based features Mean, variance, slope … Wavelet features Coefficients … … Global features provide a compact representation of the series (such as global mean/variance) Local features are important Features from time series segments (intervals) mean
Feature-based methods Advantages Fast Robust to noise Fusion of domain knowledge Features specific to domain i.e. Linear predictive coding (LPC) features for speech recognition Disadvantages Problems in handling warping Cardinality of the feature set may vary
Time Series Pattern Discovery through Local Importance (TS-PD) Identifying the region of time series important to classification is required for Interpretability Good classification with appropriate approaches (matching the patterns) Local importance is a measure that evaluates the potential descriptiveness of certain segment (interval) of the time series
TS-PD Local Importance Time series are represented by the interval features A tree-based ensemble is trained on this representation (Random Forest) -> RFint Any features can be added to representation Currently shape-based Application specific? A permutation-based approach to evaluate the descriptiveness of each interval (based on the out-of-bagging idea) Mean Different scales Variance Slope
TS-PD Local Importance Test on permuted OOB samples Train Test Let time series 1 be of class 1 Local importance is defined =
Local Importance
TS-PD Distance-based features Find the important intervals for each time series Sample intervals from these intervals (regions) Search for similarity over all time series for each specific region (Euclidean distance in our case) Use the minimum distance of a pattern to the time series as a feature for classification
TS-PD Classification In the feature set Each row is a time series Each column is a pattern The entries are the distance of the region of the time series that is the most similar to the pattern Basically, a kernel based on the distances to the patterns A tree-based ensemble is trained on this feature set (Random Forest) -> RFts Scalable Variable importance measure
TS-PD Interpretability Variable importance [9] enables interpretability Find the most important features from RF Visualize
TS-PD Experiments 43 datasets from UCR database
TS-PD Experiments Parameters Interval length and sliding window Set small enough that probability of missing a pattern is decreased. Number of locally important intervals to be used as reference pattern Depends on the dataset characteristics If features of interest is long, larger setting preferred Interval length also affects RF is not affected by this setting if set large enough because of the embedded feature selection Irrelevant patterns are easily identified Correlated patterns are handled by building tree on random feature subspaces Number of trees for both RF, RFint and RFts This can be easily set based on the OOB error rates If there is no concern about the computation time, larger setting is preferred 6 and 3 time units 10 intervals 2000 trees
TS-PD Experiments Two types of NN classifiers with DTW NNDTWNoWin NNBestDTW searches for the best warping window, based on the training data
TS-PD Example Extending TS-PD to MTS classification Gesture recognition task [12] Acceleration of hand on x, y and z axis Classify gestures (8 different types of gestures)
TS-PD Example
TS-PD Conclusion TS-PD identifies regions of interests Provides a visualization tool for understanding underlying relations Fast approach to detect the local information related to the classification Handles the warping partially Handles translations Dilations? Distance based features do not guarantee Provides a kernel based on local distances Interpretable and provides fast classification results For reproducibility of the results, the code of TS-PD is available on http://www.mustafabaydogan.com/supervised-time-series-pattern-discovery-through-local-importance-tspd.html
Questions and Comments? Thanks! Questions and Comments?
References
References (continued)
References (continued)