k-Shape: Efficient and Accurate Clustering of Time Series

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
Aggregating local image descriptors into compact codes
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Word Spotting DTW.
Relevance Feedback Retrieval of Time Series Data Eamonn J. Keogh & Michael J. Pazzani Prepared By/ Fahad Al-jutaily Supervisor/ Dr. Mourad Ykhlef IS531.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Energy Characterization and Optimization of Embedded Data Mining Algorithms: A Case Study of the DTW-kNN Framework Huazhong University of Science & Technology,
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Mining Time Series.
74 th EAGE Conference & Exhibition incorporating SPE EUROPEC 2012 Automated seismic-to-well ties? Roberto H. Herrera and Mirko van der Baan University.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Efficient Query Filtering for Streaming Time Series
Making Time-series Classification More Accurate Using Learned Constraints © Chotirat “Ann” Ratanamahatana Eamonn Keogh 2004 SIAM International Conference.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Computing motion between images
Based on Slides by D. Gunopulos (UCR)
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Using Relevance Feedback in Multimedia Databases
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
A Multiresolution Symbolic Representation of Time Series
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Time Series I.
Exact Indexing of Dynamic Time Warping
FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space
WEMAREC: Accurate and Scalable Recommendation through Weighted and Ensemble Matrix Approximation Chao Chen ⨳ , Dongsheng Li
Analysis of Constrained Time-Series Similarity Measures
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.
On Distinguishing the Multiple Radio Paths in RSS-based Ranging Dian Zhang, Yunhuai Liu, Xiaonan Guo, Min Gao and Lionel M. Ni College of Software, Shenzhen.
Mining Time Series.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Discovering Deformable Motifs in Time Series Data Jin Chen CSE Fall 1.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.
Exact indexing of Dynamic Time Warping
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
COMP 5331 Project Roadmap I will give a brief introduction (e.g. notation) on time series. Giving a notion of what we are playing with.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Data Mining K-means Algorithm
Supervised Time Series Pattern Discovery through Local Importance
Enumeration of Time Series Motifs of All Lengths
A Time Series Representation Framework Based on Learned Patterns
AIM: Clustering the Data together
GPX: Interactive Exploration of Time-series Microarray Data
Cluster Analysis.
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

k-Shape: Efficient and Accurate Clustering of Time Series John Paparrizos and Luis Gravano Sigmod 2015

Time Series: Sequentially Collected Observations Time series are ubiquitous and abundant in many disciplines Medicine: Electrocardiogram (ECG) Engineering: Human gaits [Weyand et al., Journal of Applied Physiology 2000] Atria activation Ventricle activation Recovery wave Ground Force Voltage Time Time

Time-Series Analysis: Similar Challenges Across Tasks Popular time-series analysis tasks: Manipulation of time series is challenging and time consuming Operations handle data distortions, noise, missing data, high dimensionality, … Each domain has different requirements and needs Choice of distance measure is critical for effective time-series analysis Our focus Querying Classification Clustering Class A Class B Input Output A or B? query new time series

Shape-Based Clustering: Group Series with Similar Patterns Input: a set of time series Output: k time-series clusters Funnel cluster Cylinder cluster Bell cluster Amplitude Amplitude Amplitude Time Time Time Group time series into clusters based on their shape similarity (i.e., regardless of any differences in amplitude and phase) distance: 483 distance: 37 Ignore differences in phase Ignore differences in amplitude distance: 694 distance: 17 Amplitude Amplitude Amplitude Amplitude Offer shift invariance Offer scaling and translation invariances Time Time Time Time

Shape-Based Clustering: k-Means Objective: Find the partition P* that minimizes the within-cluster sum of squared distances between time series and centroids k-Means finds a locally optimal solution by iteratively performing two steps: Assignment step: assigns time series to clusters of their nearest centroids Refinement step: updates centroids to reflect changes in cluster membership Centroid computation finds the time series that minimizes the sum of squared distances to all other time series in the cluster Requirements: a distance measure and a centroid computation method

Shape-Based Clustering: Existing Approaches Choice of distance measures: Choice of centroid computation methods: Arithmetic mean of the coordinates of time series (AVG) Non-linear alignment and averaging filters (NLAAF) Prioritized shape averaging (PSA) Dynamic time warping barycenter averaging (DBA) Issues with existing approaches: Cannot scale as they rely on expensive methods (e.g., DTW) Euclidean Distance (ED) Dynamic Time Warping (DTW) ED DTW

k-Shape: A Novel Instantiation of k-Means for Efficient Shape-Based Clustering k-Shape account for shapes of time series during clustering k-Shape is a scale-, translate-, and shift-invariant clustering method Distance measure: A normalized version of the cross-correlation measure Centroid computation: A novel method based on that distance measure Amplitude Correlation Time Possible shifts Input Our centroid Amplitude k-Means centroid Time

k-Shape’s Scale-, Translate-, and Shift-Invariant Distance Measure Cross-correlation measure: Keep one sequence static and slide the other to compute the correlation (i.e., inner product) for each shift Intuition: Determine shift that exhibits the maximum correlation Amplitude Time Correlation Shifts

k-Shape’s Scale-, Translate-, and Shift-Invariant Distance Measure Key issue: Time-series normalizations and cross-correlation sequence Shape-Based Distance Measure (SBD): Z-normalize the time series (i.e., mean=0 and std=1) Divide the cross-correlation sequence by the geometric mean of the autocorrelations of the individual time series Naïve implementation makes SBD appear as slow as DTW SBD becomes very efficient with the use of Fast Fourier Transform Perfectly aligned time series of length 1024. Cross-correlation should be maximized at shift 1024 Inadequate normalizations produce misleading results Amplitude Amplitude ✖ ✔ Time Shifts

K-Sharp- Coefficient Normalization

k-Shape’s Time-Series Centroid Computation Method Centroid computation of k-means problematic for misaligned time series k-Shape performs two steps in its centroid computation: First step: Align time series towards a reference sequence Second step: Compute the time series that maximizes the sum of squared correlations to all other time series in the cluster Amplitude Amplitude Time Time Input Our centroid k-Means centroid Amplitude Amplitude Amplitude Time Time Time Time

k-Shape’s Algorithm

Experimental Settings Datasets: Largest public collection of annotated time-series datasets (UCR Archive) – 48 real and synthetic datasets Baselines for distance measures (Dist): Euclidean Distance (ED): efficient – yet – accurate measure Dynamic Time Warping (DTW): the currently best performing – but expensive – distance measure constrained Dynamic Time Warping (cDTW): constrained version of DTW with improved accuracy and efficiency Baselines for scalable clustering methods: k-AVG+Dist: k-means with a Dist KSC: k-means with pairwise scaling and shifting in distance and centroid k-DBA: k-means with DTW and suitable centroid computation (i.e., DBA) Evaluation metrics: For runtime: CPU time utilization For distance measures: one nearest neighbor classification accuracy For clustering methods: Rand Index For statistical analysis: Wilcoxon test and Friedman test [Keogh et al. 2011] [Faloutsos et al., SIGMOD 1994] [Keogh, VLDB 2002] [Keogh, VLDB 2002]

SBD Against State-of-the-Art Distance Measures Results are relative to ED LB subscript: Keogh’s lower bounding; 5, 10, and opt superscripts: 5%, 10%, and optimal % of time-series length Distance Measure > = < Better Average Accuracy Runtime DTW DTWLB 29 2 17 ✔ 0.788 15573x 6040x cDTWopt 31 15 0.814 2873x 322x cDTW5 34 3 11 0.809 1558x 122x cDTW10 33 1 14 0.804 2940x 364x SBDNoFFT SBDNoPow2 SBD 30 12 6 0.795 224x 8.7x 4.4x Accuracy: SBD significantly outperforms ED SBD is as competitive as DTW and cDTW Efficiency: SBD is one and two orders of magnitude faster than cDTW and DTW, respectively

k-Shape Against Scalable Clustering Methods Results are relative to k-AVG+ED Algorithm > = < Better Worse Rand Index Runtime k-AVG+SBD 32 1 15 ✖ 0.745 3.6x k-AVG+DTW 10 38 ✔ 0.584 3444x KSC 22 26 0.636 448x k-DBA 18 30 0.733 3892x k-Shape+DTW 19 28 0.698 4175x k-Shape 36 11 0.772 12.4x Accuracy: k-Shape is the only scalable method that outperforms k-AVG+ED k-Shape outperforms both KSC and k-DBA Efficiency: k-Shape is one and two orders of magnitude faster than KSC and k-DBA, respectively

Shape-Based Time-Series Clustering: Conclusion k-Shape outperforms all state-of-the-art partitional, hierarchical, and spectral clustering approaches Execpt one. This method achieving similar performance but it is two orders of magnitude slower than k-Shape and its distance measure requires tuning, unlike that for k-Shape Overall, k-Shape is a domain-independent, accurate, and scalable approach for time-series clustering.

Thank you!