Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if.

Slides:

Advertisements

Similar presentations

Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.

Advertisements

In Search of Meaning for Time Series Subsequence Clustering

SAX: a Novel Symbolic Representation of Time Series

Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin Rakthanmanon Eamonn Keogh Stefano Lonardi Scott Evans.

Rule Discovery from Time Series Presented by: Murali K. Kadimi.

Clustering of Streaming Time Series is Meaningless

Efficient Anomaly Monitoring over Moving Object Trajectory Streams joint work with Lei Chen (HKUST) Ada Wai-Chee Fu (CUHK) Dawei Liu (CUHK) Yingyi Bu (Microsoft)

Mining Time Series.

08/25/2004KDD ‘041 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides.

Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.

© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,

Jessica Lin, Eamonn Keogh, Stefano Loardi

1 Data Mining Techniques Instructor: Ruoming Jin Fall 2006.

Visually Mining and Monitoring Massive Time Series Amy Karlson V. Shiv Naga Prasad 15 February 2004 CMSC 838S Images courtesy of Jessica Lin and Eamonn.

We have seen the Classic Time Series Data Mining Tasks… Clustering Classification Query by Content.

Using Relevance Feedback in Multimedia Databases

Chaotic Mining: Knowledge Discovery Using the Fractal Dimension Daniel Barbara George Mason University Information and Software Engineering Department.

1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.

Data Mining: A Closer Look

Pattern Matching with Acceleration Data Pramod Vemulapalli.

OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.

How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.

Combined Central and Subspace Clustering for Computer Vision Applications Le Lu 1 René Vidal 2 1 Computer Science Department, Johns Hopkins University,

Mining Time Series.

Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.

Abdullah Mueen Eamonn Keogh University of California, Riverside.

Discovering Deformable Motifs in Time Series Data Jin Chen CSE Fall 1.

Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp

Exact indexing of Dynamic Time Warping

Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.

Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.

NSF Career Award IIS University of California Riverside Eamonn Keogh Efficient Discovery of Previously Unknown Patterns and Relationships.

Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.

Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.

CSC 562: Final Project Dave Pizzolo Artificial Neural Networks.

An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.

Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.

Classroom Procedures. When you walk in to class… Go straight to your seat Take out your notebook Start on the “Beginning Work” in your notebooks No more.

Survey on Different Data Mining Techniques for E- Crimes

Chapter 5 Unsupervised learning

Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09 Supervisor Dr Koh Speaker Nonhlanhla Shongwe April 26,

Machine Learning with Spark MLlib

Queensland University of Technology

Recommendation in Scholarly Big Data

Percolation of Clustered Wireless Networks

Machine Learning for Computer Security

Slides by Eamonn Keogh (UC Riverside)

MATLAB Distributed, and Other Toolboxes

DATA MINING © Prentice Hall.

How Many Ways Can 945 Be Written as the Difference of Squares?

How to use… [matrixProfile, profileIndex, motifIndex, discordIndex] = interactiveMatrixProfile(data, subLen); Input data: input time series subLen: subsequence.

Parallel Density-based Hybrid Clustering

Writing your personal project report

Seek First to Understand, Then to Be Understood

Jin Shieh and Eamonn Keogh University of California - Riverside

How to Get Your Paper Rejected

When Security Games Go Green

We understand classification algorithms in terms of the expressiveness or representational power of their decision boundaries. However, just because your.

How to have a Life as Well as a Career – and Why?

Estimating the Expected Warning Time of Outbreak-Detection Algorithms

How to Get Your Paper Rejected

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Discrete Mathematics and its Applications

Validating a Random Number Generator

NP-COMPLETE Prof. Manjusha Amritkar Assistant Professor Department of Information Technology Hope Foundation’s International Institute of Information.

Actively Learning Ontology Matching via User Interaction

Topic 5: Cluster Analysis

Unsupervised Learning: Clustering

CSE591: Data Mining by H. Liu

Presentation transcript:

Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if You send me an email telling me the class number/ university in advance. My name and email address appears on the first slide (if you are using all or most of the slides), or on each slide (if you are just taking a few slides). You may freely use these slides for a conference presentation, if You send me an email telling me the conference name in advance. My name appears on each slide you use. You may not use these slides for tutorials, or in a published work (tech report/ conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is highly likely I will grant you permission. (c) Eamonn Keogh, eamonn@cs.ucr.edu

Clustering of Streaming Time Series is Meaningless Jessica Lin, Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 {jessica,eamonn,wagner}@cs.ucr.edu

“If you publish this paper, you will get in bad trouble…” Anonymous reviewer of this paper, as submitted to SIAM’s SDM workshop on clustering (from where it was rejected).

Outline of Talk What Does it Mean to be Meaningless? Clustering Time Series Whole Clustering Subsequence Clustering Demonstrating that Subsequence Clustering is Meaningless (Not) Finding Rules in Time Series. A Case Study Conclusions

What Does it Mean to be Meaningless? An algorithm is meaningless if it’s output is independent of it’s input. With the exception of random number generators, meaningless algorithms are useless.

Time Series Data Mining Clustering Classification Query by Content Motif Discovery

Time Series Clustering Whole Clustering: The notion of clustering here is similar to that of conventional clustering of discrete objects. Given a set of individual time series data, the objective is to group similar time series into the same cluster. Subsequence Clustering: Given a single time series, individual time series (subsequences) are extracted with a sliding window. Clustering is then performed on the extracted time series.

Whole Clustering Whole Clustering: The notion of clustering here is similar to that of conventional clustering of discrete objects. Given a set of individual time series data, the objective is to group similar time series into the same cluster.

Subsequence Clustering (STS) Subsequence Clustering: Given a single time series, individual time series (subsequences) are extracted with a sliding window. Clustering is then performed on the extracted time series. 20 40 60 80 100 120

Why do Subsequence Clustering? Finding association rules in time series Anomaly detection in time series Indexing of time series Classifying time series Clustering of streaming time series has also been proposed as a knowledge discovery tool in its own right.

Subsequence Clustering is Meaningless!! Subsequence clustering is meaningless, and the more that 100 papers that use it a subroutine make zero contribution.

Measuring the meaningfulness of clustering 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bears Bears

Measuring the meaningfulness of clustering 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bears Bears K-Means K-Means 10 10 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10

Measuring the meaningfulness of clustering 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bears Bears 10 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10

10 1 2 3 4 5 6 7 8 9 Intuitively, the cluster_distance(Bears, Bears) is the distance between each red dot, and the green dot closest to it

Measuring the meaningfulness of clustering 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bulls Bears

Measuring the meaningfulness of clustering 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bulls Bears K-Means K-Means 10 10 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10

Measuring the meaningfulness of clustering 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bears Bears 10 10 1 2 3 4 5 6 7 8 9 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Measuring the meaningfulness of clustering 10 9 8 Intuitively, the cluster_distance(Bears, Bulls) is the distance between each red dot, and the green dot closest to it 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10

cluster_distance(Bears, Bears) cluster_distance(Bears, Bulls) within_set_X_distance, means cluster_distance between multiple runs of clustering on “Bears” and another copy of “Bears”. between_set_X_and_Y_distance means cluster_distance between multiple runs of clustering on “Bears” and “Bulls”. cluster_distance(Bears, Bears) cluster_distance(Bears, Bulls) 10 1 2 3 4 5 6 7 8 9 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10

So clustering_meaningfulness(X,Y) should be close to zero if X and Y are two different datasets, say monkeys and motorcycles. If clustering_meaningfulness(X,Y) is close to one, it tells us that the clustering algorithms thinks that X and Y are same thing!

An Experiment Lets take two different datasets, and cluster them using Whole Clustering Subsequence Clustering and measure the clustering_meaningfulness Let us use K-means clustering, and try a range of values for K, and for time series lengths. We will use random walk, and S&P stock market data…

clustering_meaningfulness(random walk, S&P)

A Implication of the Experiment Suppose I am a consultant, and people pay me to do subsequence clustering on their data. One day 5 customers submit five different datasets for me to cluster Rainfall in Ireland over the last century. The value of Yahoos stock over the last decade. The price of butter in Brazil since WWII George Bush’s popularity in the polls The mean length of the cheetahs leg over the last million years I could cluster any one of these datasets, and give the same results to all the customers, and they would never know!

Maybe the problem is with K-Means… …no, it is true for any clustering algorithm

What the #@&* is going on? Let us take a look at the cluster centers created by subsequence clustering… Sine waves!!!

For subsequence clustering, no matter what the input, the output is a set of (out of phase) sine waves 20 40 60 80 100 120 140

Why Sine Waves? Slutsky’s Theorem (informally stated) Any time series will converge to a sine wave after repeated applications of moving window smoothing Evgeny Slutsky (1880-1948) 20 40 60 80 100 120

In our paper we have much more to say about why subsequence clustering does not and cannot give meaningful results. However, let us address the argument on the following slide instead…

A Counter Argument to our Claim “Since many papers have been published which use time series subsequence clustering as a subroutine, and these papers produce successful results, time series subsequence clustering must be a meaningful operation.”

(Not) Finding rules in time series G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discovery from time series. (1998). In Proc. of the 4th KDD Extended by: Mori, T. & Uehara, K. (2001). Extraction of Primitive Motion and Discovery of Association Rules from Human Motion. Cotofrei, P. & Stoffel, K (2002). Classification Rules + Time = Temporal Rules. Fu, T. C., Chung, F. L., Ng, V. & Luk, R. (2001). Pattern Discovery from Stock Time Series Using Self-Organizing Maps. Harms, S. K., Deogun, J. & Tadesse, T. (2002). Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences. Hetland, M. L. & Sætrom, P. (2002). Temporal Rules Discovery Using Genetic Programming and Specialized Hardware. Jin, X., Lu, Y. & Shi, C. (2002). Distribution Discovery: Local Analysis of Temporal Rules. Yairi, T., Kato, Y. & Hori, K. (2001). Fault Detection by Mining Association Rules in House-keeping Data. Tino, P., Schittenkopf, C. & Dorffner, G. (2000). Temporal Pattern Recognition in Noisy Non-stationary Time Series Based on Quantization into Symbolic Streams. and many more

A Simple Experiment... Our reimplementation The punch line is… “if stock rises then falls greatly, follow a smaller rise, then we can expect to see within 20 time units, a pattern of rapid decrease followed by a leveling out.” Our reimplementation The punch line is… Our data is random walk!

The Bottom Line All the researchers that are finding rules in time series are fooling themselves

What we are NOT Claiming Clustering of time series is meaningless Sliding windows is always a bad things Clustering of discrete sequences with sliding windows is flawed People are deliberately publishing results that they know are meaningless

Conclusions I Subsequence clustering of time series, as defined in our paper, and as used by dozens of researchers, is completely meaningless.

Conclusions II If the best data miners in the world can fool themselves, fool the reviewers, and fool the community at large, it suggests that the data mining community is doing very poor evaluation of its “contributions”. We need to play devils advocate with our own work!

Note We very much welcome feedback on this work. If you would like us to try a dataset or algorithm, just let us know.

Questions? Thanks to Christos Faloutsos, Frank Höppner, Howard Hamilton, Daniel Barbara, Magnus Lie Hetland, Hongyuan Zha, Sergio Focardi,, Shoji Hirano, Shusaku Tsumoto, and Zbigniew Struzik for their comments. Special thanks to Michalis Vlachos for pointing to Slutsky’s work Datasets and code used in this paper can be found at.. www.cs.ucr.edu/~eamonn/TSDMA/index.html