A Time Series Representation Framework Based on Learned Patterns Mustafa Gokce Baydogan● George Runger* Didem Yamak" ● Boğaziçi University * Arizona State University " DeVry University CIDSE, Arizona State University, 10/11/2013
Outline Time series data mining Motivation Representing time series Measuring similarity Learning a pattern-based representation Pattern (relationship) discovery Learned pattern similarity (LPS) Computational experiments and results Conclusions and future work
Time series data mining What is time series? A numeric (nominal) time series is a sequence of observations of a numeric (nominal) property over time The output of an Electrocardiography (ECG) recorder with time represented on the x-axis voltage represented on the y-axis 3
Time series data mining Motivations People measure things, and things (with rare exceptions) change over time Time series are everywhere ECG Heartbeat Stock Images from E. Keogh. A decade of progress in indexing and mining large time series databases. In VLDB, page 1268, 2006. 4 4
Time series data mining Motivations Other types of data can be converted to time series. Everything is about the representation. Example: Recognizing words An example word “Alexandria” from the dataset of word profiles for George Washington's manuscripts. A word can be represented by two time series created by moving over and under the word Images from E. Keogh. A quick tour of the datasets for VLDB 2008. In VLDB, 2008. 5 5
Time series data mining Tasks Clustering Classification Rule Discovery 50 1000 150 2000 2500 20 40 60 80 100 120 140 A B C Motif Discovery All tasks requires a representation Most of them requires a similarity measure Query by Content Anomaly Detection 6
Time series classification A supervised learning problem aimed at labeling temporally structured univariate (or multivariate) sequences of certain (or variable) length. 7
Challenges Local patterns are important Translations and dilations (warping) Time of the peaks may change (two peaks are observed earlier for blue series) Observed four peaks are related to certain event in the manufacturing process Problem occurred over a shorter time interval Indication of a problem
Challenges Time series are usually noisy Multivariate time series (MTS) Relation of patterns within the series and interactions between series may be important High-dimensionality
Bag-of-words Originated from document classification approaches Bag-of-words is also referred as Bag-of-features in computer vision Bag-of-instances in multiple instance learning (MIL) Bag-of-frames in audio and speech recognition Used for many computer vision problems “This is a book not a pencil”
Earlier work Time series classification A Bag-of-Features Framework to Classify Time Series* Works on univariate time series Segments subsequences of random length from random locations Extracts simple features (mean, slope, variance) over intervals Trains a supervised learner on subsequence representation to generate class probability estimates (CPE) for each subsequence Aggregates CPE of subsequences for each series to generate a time series representation in a BoF framework Fast and provides the best results with few parameters *Mustafa Gokce Baydogan, George Runger, Eugene Tuv, "A Bag-of-Features Framework to Classify Time Series," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, no.11, pp.2796-2802, Nov. 2013 11
Earlier work Time series classification Multivariate Time Series Classification with Learned Discretization* Works on both univariate and multivariate series Trains a supervised learner on the observed values to learn a symbolic representation for time series Does not generate features explicitly Has similarities to work to be described today Aggregates the symbols using a bag-of-words (BoW) representation Without generation of motifs (2-mers, k-mers, etc.) Considers relationship across multiple variables in a straightforward manner (for multivariate series) Simple, fast, performs better than compared to existing approaches without setting of many parameters *Mustafa Gokce Baydogan, George Runger, Eugene Tuv, "Multivariate Time Series Classification with Learned Discretization," to Data Mining and Knowledge Discovery (received major revision on August 7th 2013) 12
Approaches for time series analysis Time series representation To reduce high-dimensionality noise To capture trends, shapes and patterns Provide more information compared to exact values of each time series data point Time series similarity To capture and reflect the underlying similarity Important for a variety of DM tasks such as similarity search, classification, clustering, etc.
Time series representation But the method of trees is different from that used previously for time series * Allows lower bounding for similarity computations
Time series similarity Popular (No parameter) Intuitive Fast computation Performs POORLY Very popular (No parameter) Handles warping (Accurate) Hard to beat May perform POORLY (long series with noise) Handles warping (Accurate) Too many parameters to tune Computationally not efficient
A popular similarity measure Dynamic time warping (DTW) Strong solution known for time series problems in a variety of domains The sequences ”warped” non-linearly in the time dimension to measure similarity independent of certain non-linear variations in the time dimension Alignment of time series by DTW recognizes the similarity of the series better 16 16
Representations based on trees Overview of regression trees A regression tree is a kind of additive model of the form ki is constant and Di is the disjoint partitions defined by the tree Models of this type are sometimes called piecewise constant regression models partition the predictor space in a set of regions and fit a constant value within each region. Find Di that minimizes SSE of m(x) in a recursive manner 17
Previous work on Tree-based time series representation A regression tree-based approach has been used to learn a representation (Geurts, 2001) A simple piecewise constant model Your data matrix
A new representation approach Predicting (forecasting) a segment Time series segment of length L Your data matrix Forecast ∆=50 (gap) time units forward
A new representation approach Learned patterns Time series is 128 units long Predictor segment 1-60 Response segment 51-111
A new representation approach Multiple segments Extract all possible segments of length L<T L=16 (segment length) where T=27 (TS length) Series 1 Series n Series N Concatenate over all time series 21
A new representation approach based on regression trees Build J trees with depth D Selection of a random predictor column Introduces multiple random ∆ values Works well for regression (P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3-42, 2006.) 22
A pattern-based representation Tree #1 Tree #2 Tree #3 Tree #J ……… ………………... ……………… Aggregate the information over all trees for prediction (i.e. denoising) Each terminal node defines a basis 2. pattern-based representation (a vector of size R times J *) ……………… *Assuming each tree has R terminal nodes 23
Similarity measure Learned Pattern Similarity (LPS) Time series is represented by * Penalizes the number of mismatches Series with mismatching observations in the patterns are different Robust to noise Implicitly works on the discrete values Robust to warping Representation learning handles the problem of warping *Assuming each tree has R terminal nodes 24
Similarity measure Learned Pattern Similarity (LPS) The computations are similar to Euclidean distance Fast Allows for bounding schemes Early abandon Similarity search: Find the reference time series that is most similar to query series Keep record of the best distance found so far Stop computing distance for a reference series if current distance is larger than best-so-far Known to improve the testing time (query time) significantly 25
Learned Pattern Similarity (LPS) Experiments 45 univariate time series datasets from UCR database* Compared to popular NN classifiers with different distance measures Euclidean DTW (Constrained and unconstrained version) SpADe Sparse Spatial Sample Kernels (SSSK) Addition of difference series Taking trend information into consideration A multivariate time series extension Parameters Cross-validation to set parameters for each dataset Segment length (L) (0.25, 0.5, 0.75) factor of time series length Depth of trees (4,6,8) Number of trees=150 Not important if set large enough *http://www.cs.ucr.edu/~eamonn/time_series_data/
Univariate datasets Health Energy Robotics Astronomy Astronomy Chemistry Gesture recognition
LPS Sensitivity analysis Illustration over 6 datasets (L=0.5xT) Multiple depth (D) and number of trees (J) levels 28
Better or comparable results than DTW based approaches Average error rates over 10 replications Scatter plot of error rates LPS versus DTW with no windows DTW with best window (constrained) SpADe LPS w/o difference series Better or comparable results than DTW based approaches LPS performs better than Euclidean distance and SSSK (not shown) 29
LPS Computational complexity Training complexity is O(JNTD) Linear to time series length and number of training series Memory efficient S is not generated explicitly. Only two columns are used at each split decision Testing complexity is Tree traversal to generate the representation -> O(TJD) Similarity computation (worst case) -> O(NJ2D) StarLightCurves dataset (N=1000, T=1024) 30
LPS Multivariate time series While training, randomly select one univariate time series and a target segment Complexity does not change More trees with larger depth may be required 31
LPS Multivariate time series uWaveGestureLibrary Gesture recognition task* Acceleration of hand on x, y and z axis Classify gestures (8 different types of gestures) Same parameters result in error rate of 0.022 * 32
LPS Conclusions and future work A new approach for time series representation Captures relations between and within the series Features learned within the algorithm (not pre-specified) Handles nominal and missing values Handles warping by representation learning Scalable (also allows for parallel implementation) Training complexity is linear to time series length and number of training series Training took at most 6 minutes over 45 datasets (single thread, J=150, D=8, N=1800, T=750) There is still space for improving the implementation SpADe did not return a result for a week of run times Our similarity search takes less than a millisecond Fast and accurate results with few parameters
LPS Conclusions and future work Proposed representation has some relations to deep learning This approach can be extended to many data mining tasks (for both univariate and multivariate time series and images) such as Denoising (in progress) Forecasting (in progress) Anomaly detection (in progress) Clustering (in progress) Indexing … LPS package is provided on http://www.mustafabaydogan.com/learned-pattern-similarity-lps.html 34
Questions and Comments? Thanks! Questions and Comments? LPS package is provided on http://www.mustafabaydogan.com/learned-pattern-similarity-lps.html