Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.

Slides:

Advertisements

Similar presentations

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Advertisements

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Yasuhiro Fujiwara (NTT Cyber Space Labs)

F AST A PPROXIMATE C ORRELATION FOR M ASSIVE T IME - SERIES D ATA SIGMOD’10 Abdullah Mueen, Suman Nath, Jie Liu 1.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.

Energy Characterization and Optimization of Embedded Data Mining Algorithms: A Case Study of the DTW-kNN Framework Huazhong University of Science & Technology,

Mining Time Series.

Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Locally Constraint Support Vector Clustering

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.

Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs Qiang Zhu, Xiaoyue Wang, Eamonn Keogh, 1 Sang-Hee Lee Dept. Of Computer.

Making Time-series Classification More Accurate Using Learned Constraints © Chotirat “Ann” Ratanamahatana Eamonn Keogh 2004 SIAM International Conference.

1 Life-and-Death Problem Solver in Go Author: Byung-Doo Lee Dept of Computer Science, Univ. of Auckland Presented by: Xiaozhen Niu.

Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.

Clustering Color/Intensity

Effective Gaussian mixture learning for video background subtraction Dar-Shyang Lee, Member, IEEE.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Cluster Analysis (1).

1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.

NEW APPROACH TO CALCULATION OF RANGE OF POLYNOMIALS USING BERNSTEIN FORMS.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

A Multiresolution Symbolic Representation of Time Series

Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin The Chinese.

Exposure In Wireless Ad-Hoc Sensor Networks S. Megerian, F. Koushanfar, G. Qu, G. Veltri, M. Potkonjak ACM SIG MOBILE 2001 (Mobicom) Journal version: S.

1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Exact Indexing of Dynamic Time Warping

FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space

1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.

Focused Matrix Factorization for Audience Selection in Display Advertising BHARGAV KANAGAL, AMR AHMED, SANDEEP PANDEY, VANJA JOSIFOVSKI, LLUIS GARCIA-PUEYO,

Analysis of Constrained Time-Series Similarity Measures

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

The BIRCH Algorithm Davitkov Miroslav, 2011/3116

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

Clustering Uncertain Data Speaker: Ngai Wang Kay.

Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.

Mining Time Series.

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Semi-Supervised Time Series Classification & DTW-D REPORTED BY WANG YAWEN.

Distributed Spatio-Temporal Similarity Search Demetrios Zeinalipour-Yazti University of Cyprus Song Lin

Presenter : Lin, Shu-Han Authors : Jeen-Shing Wang, Jen-Chieh Chiang

Efficient Processing of Top-k Spatial Preference Queries

k-Shape: Efficient and Accurate Clustering of Time Series

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

Exact indexing of Dynamic Time Warping

DTW-D: Time Series Semi-Supervised Learning from a Single Example Yanping Chen 1.

Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.

The Goldilocks Problem Tudor Hulubei Eugene C. Freuder Department of Computer Science University of New Hampshire Sponsor: Oracle.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.

University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li Department of Computer and Information Science.

Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.

Efficient Clustering of Uncertain Data Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip Speaker: Wang Kay Ngai.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Dense-Region Based Compact Data Cube

Data Driven Resource Allocation for Distributed Learning

Parallel Density-based Hybrid Clustering

Clustering Uncertain Taxi data

A Time Series Representation Framework Based on Learned Patterns

Anooshiravan Sharabiani, Houshang Darabi

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University of California, Riverside University of Texas at Dallas 1

Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation

Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation

Problem Description The problem this work plans to address is robustly clustering large time series datasets with invariance to irrelevant data. Accuracy Invariance to irrelevant data Scalability (Efficiency, Interruputability) Robustness to parameter settings

Accuracy: The Using of DTW For most time series data mining algorithms, the quality of the output depends almost exclusively on the distance measure used. A consensus has emerged that DTW is the best in most domains, almost always outperforming the Euclidean Distance (ED). Convergence of DTW and ED for increasing data sizes? – Not for clustering!

Invariance to Irrelevant Data: the Using of DP It has been suggested that the successful clustering of time series requires the ability to ignore some data objects. Anomalous objects themselves are unclusterable; Interference with the clustering of clusterable data. DP, in contrast to clustering algorithms such as K-means, can ignore anomalous objects.

Efficiency: Pruning Using Both Boundings Both DTW and DP are slow. – CPU constrained, not I/O constrained. In some problems (notably similarity search), the lower-bounding pruning is the main technique used to produce speedup, whose effectiveness tends to improve on large datasets. This is not effective in clustering due to the need to know the distance between all pairs, or at least all distances within a certain range. Also, due to the non-metric character of DTW, it is hard to build an index for speeding up. This work exploits both the lower and upper boundings of DTW in the framework of DP.

Interruptablity: Going Anytime What if the pruning is still not sufficient? - User interruption - This work further adapts the proposed method to an anytime algorithm. Anytime algorithms are algorithms that can return a valid solution to a problem, even if interrupted before ending. Small setup time Best-so-far answer Monotonicity & Diminishing returns

Robustness to Parameter Settings: the Using of DP Many clustering algorithms require the user to set many parameters. DP requires only two parameters. Moreover, they are relatively intuitive and not particularly sensitive to user choice.

Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation

Internal Logic & Required Parameters of DP The DP algorithm assumes that the cluster centers are surrounded by lower local density neighbors and are at a relatively higher distance from any point with a higher local density. For a certain point i, the Local Density ρ i is the number of points that are closer to it than some cutoff distance d c ; the Distance from Points of Higher Density is the minimum distance δ i from point i to all the points of higher density. The DP algorithm requires two pre-set parameters: The cutoff distance d c The number of clusters k (can be determined in a knee-down manner) See Rodriguez, A., & Laio, A. Clustering by Fast Search and Find of Density Peaks. Science, 344(6191), , for more!

Four Phases of DP Local Density Calculation Distance to Higher Density Points Computation Cluster Center Selection Cluster Assignment

Phase 1: Local Density Calculation

Phase 2: Distance to Higher Density Points Computation

Phase 3: Cluster Center Selection The cluster centers are selected using a simple heuristic: points with higher values of (ρ i ×δ i ) are more likely to be centers.

Phase 4: Cluster Assignment

Why DP? Capability of ignoring outlier. Capability of handling datasets whose clusters can form arbitrary shapes. Few user-set parameters and low sensitivity. Amiability to distance computation pruning and conversion to an anytime algorithm.

Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation

Pruning Using DTW Bounds The proposed algorithm, TADPole (Time-series Anything DP), requires distance computations in the following two phases: Phase 1: local density computation Phase 2: distance to higher density points computation (NN distance computation)

Pruning in the Local Density Computation Phase

Pruning in the NN Distance Computation Phase

Multidimensional Time Series Clustering Independent calculation → Summation

Multidimensional Time Series Clustering

Pruning Effectiveness: Baselines Brute force: all-pair distance matrix computed. Oracle (post-hoc): only necessary distance computations are needed. Local density calculation phase: only distance computations contributing to the actual density of an object considered. NN distance calculation phase: only the actual NN distances considered.

Pruning Effectiveness: Illustration Dataset: StarLightCurves

Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation

Going Anytime: Which Phase is Amiable? TADPole requires distance computations in the following two phases: Phase 1: local density computation Not amiable to anytime ordering - setup time Phase 2: NN distance computation Amiable to anytime ordering!

Going Anytime: Contestants Oracle: In each step of the algorithm, this order cheatingly chooses the object that maximizes the current Rand Index. Top-to-bottom, left-to-right? Too brittle to “luck”! Random ordering: less brittle to luck. The proposed heuristic: ρ × ub

Going Anytime: Effectiveness Illustration

Outline Introduction, Related Work & Background Density Peaks (DP) Clustering Algorithm Pruning Using DTW Boundings Going Anytime: Distance Computation-Ordering Heuristic Experimental Evaluation

Clustering Quality & Efficiency Evaluation TADPole is at least an order of magnitude faster than the rival methods.

Parameter Sensitivity Evaluation Performed on Symbols dataset with k = 6

Conclusions & Comments Pruning using both bounds Anytime algorithm More borrowing than originating!