k-Shape: Efficient and Accurate Clustering of Time Series John Paparrizos and Luis Gravano Sigmod 2015
Time Series: Sequentially Collected Observations Time series are ubiquitous and abundant in many disciplines Medicine: Electrocardiogram (ECG) Engineering: Human gaits [Weyand et al., Journal of Applied Physiology 2000] Atria activation Ventricle activation Recovery wave Ground Force Voltage Time Time
Time-Series Analysis: Similar Challenges Across Tasks Popular time-series analysis tasks: Manipulation of time series is challenging and time consuming Operations handle data distortions, noise, missing data, high dimensionality, … Each domain has different requirements and needs Choice of distance measure is critical for effective time-series analysis Our focus Querying Classification Clustering Class A Class B Input Output A or B? query new time series
Shape-Based Clustering: Group Series with Similar Patterns Input: a set of time series Output: k time-series clusters Funnel cluster Cylinder cluster Bell cluster Amplitude Amplitude Amplitude Time Time Time Group time series into clusters based on their shape similarity (i.e., regardless of any differences in amplitude and phase) distance: 483 distance: 37 Ignore differences in phase Ignore differences in amplitude distance: 694 distance: 17 Amplitude Amplitude Amplitude Amplitude Offer shift invariance Offer scaling and translation invariances Time Time Time Time
Shape-Based Clustering: k-Means Objective: Find the partition P* that minimizes the within-cluster sum of squared distances between time series and centroids k-Means finds a locally optimal solution by iteratively performing two steps: Assignment step: assigns time series to clusters of their nearest centroids Refinement step: updates centroids to reflect changes in cluster membership Centroid computation finds the time series that minimizes the sum of squared distances to all other time series in the cluster Requirements: a distance measure and a centroid computation method
Shape-Based Clustering: Existing Approaches Choice of distance measures: Choice of centroid computation methods: Arithmetic mean of the coordinates of time series (AVG) Non-linear alignment and averaging filters (NLAAF) Prioritized shape averaging (PSA) Dynamic time warping barycenter averaging (DBA) Issues with existing approaches: Cannot scale as they rely on expensive methods (e.g., DTW) Euclidean Distance (ED) Dynamic Time Warping (DTW) ED DTW
k-Shape: A Novel Instantiation of k-Means for Efficient Shape-Based Clustering k-Shape account for shapes of time series during clustering k-Shape is a scale-, translate-, and shift-invariant clustering method Distance measure: A normalized version of the cross-correlation measure Centroid computation: A novel method based on that distance measure Amplitude Correlation Time Possible shifts Input Our centroid Amplitude k-Means centroid Time
k-Shape’s Scale-, Translate-, and Shift-Invariant Distance Measure Cross-correlation measure: Keep one sequence static and slide the other to compute the correlation (i.e., inner product) for each shift Intuition: Determine shift that exhibits the maximum correlation Amplitude Time Correlation Shifts
k-Shape’s Scale-, Translate-, and Shift-Invariant Distance Measure Key issue: Time-series normalizations and cross-correlation sequence Shape-Based Distance Measure (SBD): Z-normalize the time series (i.e., mean=0 and std=1) Divide the cross-correlation sequence by the geometric mean of the autocorrelations of the individual time series Naïve implementation makes SBD appear as slow as DTW SBD becomes very efficient with the use of Fast Fourier Transform Perfectly aligned time series of length 1024. Cross-correlation should be maximized at shift 1024 Inadequate normalizations produce misleading results Amplitude Amplitude ✖ ✔ Time Shifts
K-Sharp- Coefficient Normalization
k-Shape’s Time-Series Centroid Computation Method Centroid computation of k-means problematic for misaligned time series k-Shape performs two steps in its centroid computation: First step: Align time series towards a reference sequence Second step: Compute the time series that maximizes the sum of squared correlations to all other time series in the cluster Amplitude Amplitude Time Time Input Our centroid k-Means centroid Amplitude Amplitude Amplitude Time Time Time Time
k-Shape’s Algorithm
Experimental Settings Datasets: Largest public collection of annotated time-series datasets (UCR Archive) – 48 real and synthetic datasets Baselines for distance measures (Dist): Euclidean Distance (ED): efficient – yet – accurate measure Dynamic Time Warping (DTW): the currently best performing – but expensive – distance measure constrained Dynamic Time Warping (cDTW): constrained version of DTW with improved accuracy and efficiency Baselines for scalable clustering methods: k-AVG+Dist: k-means with a Dist KSC: k-means with pairwise scaling and shifting in distance and centroid k-DBA: k-means with DTW and suitable centroid computation (i.e., DBA) Evaluation metrics: For runtime: CPU time utilization For distance measures: one nearest neighbor classification accuracy For clustering methods: Rand Index For statistical analysis: Wilcoxon test and Friedman test [Keogh et al. 2011] [Faloutsos et al., SIGMOD 1994] [Keogh, VLDB 2002] [Keogh, VLDB 2002]
SBD Against State-of-the-Art Distance Measures Results are relative to ED LB subscript: Keogh’s lower bounding; 5, 10, and opt superscripts: 5%, 10%, and optimal % of time-series length Distance Measure > = < Better Average Accuracy Runtime DTW DTWLB 29 2 17 ✔ 0.788 15573x 6040x cDTWopt 31 15 0.814 2873x 322x cDTW5 34 3 11 0.809 1558x 122x cDTW10 33 1 14 0.804 2940x 364x SBDNoFFT SBDNoPow2 SBD 30 12 6 0.795 224x 8.7x 4.4x Accuracy: SBD significantly outperforms ED SBD is as competitive as DTW and cDTW Efficiency: SBD is one and two orders of magnitude faster than cDTW and DTW, respectively
k-Shape Against Scalable Clustering Methods Results are relative to k-AVG+ED Algorithm > = < Better Worse Rand Index Runtime k-AVG+SBD 32 1 15 ✖ 0.745 3.6x k-AVG+DTW 10 38 ✔ 0.584 3444x KSC 22 26 0.636 448x k-DBA 18 30 0.733 3892x k-Shape+DTW 19 28 0.698 4175x k-Shape 36 11 0.772 12.4x Accuracy: k-Shape is the only scalable method that outperforms k-AVG+ED k-Shape outperforms both KSC and k-DBA Efficiency: k-Shape is one and two orders of magnitude faster than KSC and k-DBA, respectively
Shape-Based Time-Series Clustering: Conclusion k-Shape outperforms all state-of-the-art partitional, hierarchical, and spectral clustering approaches Execpt one. This method achieving similar performance but it is two orders of magnitude slower than k-Shape and its distance measure requires tuning, unlike that for k-Shape Overall, k-Shape is a domain-independent, accurate, and scalable approach for time-series clustering.
Thank you!