Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University of California, Riverside University of Texas at Dallas 1
Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation
Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation
Problem Description The problem this work plans to address is robustly clustering large time series datasets with invariance to irrelevant data. Accuracy Invariance to irrelevant data Scalability (Efficiency, Interruputability) Robustness to parameter settings
Accuracy: The Using of DTW For most time series data mining algorithms, the quality of the output depends almost exclusively on the distance measure used. A consensus has emerged that DTW is the best in most domains, almost always outperforming the Euclidean Distance (ED). Convergence of DTW and ED for increasing data sizes? – Not for clustering!
Invariance to Irrelevant Data: the Using of DP It has been suggested that the successful clustering of time series requires the ability to ignore some data objects. Anomalous objects themselves are unclusterable; Interference with the clustering of clusterable data. DP, in contrast to clustering algorithms such as K-means, can ignore anomalous objects.
Efficiency: Pruning Using Both Boundings Both DTW and DP are slow. – CPU constrained, not I/O constrained. In some problems (notably similarity search), the lower-bounding pruning is the main technique used to produce speedup, whose effectiveness tends to improve on large datasets. This is not effective in clustering due to the need to know the distance between all pairs, or at least all distances within a certain range. Also, due to the non-metric character of DTW, it is hard to build an index for speeding up. This work exploits both the lower and upper boundings of DTW in the framework of DP.
Interruptablity: Going Anytime What if the pruning is still not sufficient? - User interruption - This work further adapts the proposed method to an anytime algorithm. Anytime algorithms are algorithms that can return a valid solution to a problem, even if interrupted before ending. Small setup time Best-so-far answer Monotonicity & Diminishing returns
Robustness to Parameter Settings: the Using of DP Many clustering algorithms require the user to set many parameters. DP requires only two parameters. Moreover, they are relatively intuitive and not particularly sensitive to user choice.
Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation
Internal Logic & Required Parameters of DP The DP algorithm assumes that the cluster centers are surrounded by lower local density neighbors and are at a relatively higher distance from any point with a higher local density. For a certain point i, the Local Density ρ i is the number of points that are closer to it than some cutoff distance d c ; the Distance from Points of Higher Density is the minimum distance δ i from point i to all the points of higher density. The DP algorithm requires two pre-set parameters: The cutoff distance d c The number of clusters k (can be determined in a knee-down manner) See Rodriguez, A., & Laio, A. Clustering by Fast Search and Find of Density Peaks. Science, 344(6191), , for more!
Four Phases of DP Local Density Calculation Distance to Higher Density Points Computation Cluster Center Selection Cluster Assignment
Phase 1: Local Density Calculation
Phase 2: Distance to Higher Density Points Computation
Phase 3: Cluster Center Selection The cluster centers are selected using a simple heuristic: points with higher values of (ρ i ×δ i ) are more likely to be centers.
Phase 4: Cluster Assignment
Why DP? Capability of ignoring outlier. Capability of handling datasets whose clusters can form arbitrary shapes. Few user-set parameters and low sensitivity. Amiability to distance computation pruning and conversion to an anytime algorithm.
Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation
Pruning Using DTW Bounds The proposed algorithm, TADPole (Time-series Anything DP), requires distance computations in the following two phases: Phase 1: local density computation Phase 2: distance to higher density points computation (NN distance computation)
Pruning in the Local Density Computation Phase
Pruning in the NN Distance Computation Phase
Multidimensional Time Series Clustering Independent calculation → Summation
Multidimensional Time Series Clustering
Pruning Effectiveness: Baselines Brute force: all-pair distance matrix computed. Oracle (post-hoc): only necessary distance computations are needed. Local density calculation phase: only distance computations contributing to the actual density of an object considered. NN distance calculation phase: only the actual NN distances considered.
Pruning Effectiveness: Illustration Dataset: StarLightCurves
Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation
Going Anytime: Which Phase is Amiable? TADPole requires distance computations in the following two phases: Phase 1: local density computation Not amiable to anytime ordering - setup time Phase 2: NN distance computation Amiable to anytime ordering!
Going Anytime: Contestants Oracle: In each step of the algorithm, this order cheatingly chooses the object that maximizes the current Rand Index. Top-to-bottom, left-to-right? Too brittle to “luck”! Random ordering: less brittle to luck. The proposed heuristic: ρ × ub
Going Anytime: Effectiveness Illustration
Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation
Clustering Quality & Efficiency Evaluation TADPole is at least an order of magnitude faster than the rival methods.
Parameter Sensitivity Evaluation Performed on Symbols dataset with k = 6
Conclusions & Comments Pruning using both bounds Anytime algorithm More borrowing than originating!