Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if.

Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if You send me an telling me the class number/ university in advance. My name and address appears on the first slide (if you are using all or most of the slides), or on each slide (if you are just taking a few slides). You may freely use these slides for a conference presentation, if You send me an telling me the conference name in advance. My name appears on each slide you use. You may not use these slides for tutorials, or in a published work (tech report/ conference paper/ thesis/ journal etc). If you wish to do this, me first, it is highly likely I will grant you permission. (c) Eamonn Keogh,

Clustering of Streaming Time Series is Meaningless
Jessica Lin, Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA

“If you publish this paper, you will get in bad trouble…”
Anonymous reviewer of this paper, as submitted to SIAM’s SDM workshop on clustering (from where it was rejected).

Outline of Talk What Does it Mean to be Meaningless?
Clustering Time Series Whole Clustering Subsequence Clustering Demonstrating that Subsequence Clustering is Meaningless (Not) Finding Rules in Time Series. A Case Study Conclusions

What Does it Mean to be Meaningless?
An algorithm is meaningless if it’s output is independent of it’s input. With the exception of random number generators, meaningless algorithms are useless.

Time Series Data Mining
Clustering Classification Query by Content Motif Discovery

Time Series Clustering
Whole Clustering: The notion of clustering here is similar to that of conventional clustering of discrete objects. Given a set of individual time series data, the objective is to group similar time series into the same cluster. Subsequence Clustering: Given a single time series, individual time series (subsequences) are extracted with a sliding window. Clustering is then performed on the extracted time series.

Whole Clustering Whole Clustering: The notion of clustering here is similar to that of conventional clustering of discrete objects. Given a set of individual time series data, the objective is to group similar time series into the same cluster.

Subsequence Clustering (STS)
Subsequence Clustering: Given a single time series, individual time series (subsequences) are extracted with a sliding window. Clustering is then performed on the extracted time series. 20 40 60 80 100 120

Why do Subsequence Clustering?
Finding association rules in time series Anomaly detection in time series Indexing of time series Classifying time series Clustering of streaming time series has also been proposed as a knowledge discovery tool in its own right.

Subsequence Clustering is Meaningless!!
Subsequence clustering is meaningless, and the more that 100 papers that use it a subroutine make zero contribution.

Measuring the meaningfulness of clustering
10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bears Bears

10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bears Bears K-Means K-Means 10 10 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10

10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bears Bears 10 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10

10 1 2 3 4 5 6 7 8 9 Intuitively, the cluster_distance(Bears, Bears) is the distance between each red dot, and the green dot closest to it

10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bulls Bears

10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bulls Bears K-Means K-Means 10 10 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10

10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Bears Bears 10 10 1 2 3 4 5 6 7 8 9 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

10 9 8 Intuitively, the cluster_distance(Bears, Bulls) is the distance between each red dot, and the green dot closest to it 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10

cluster_distance(Bears, Bears) cluster_distance(Bears, Bulls)
within_set_X_distance, means cluster_distance between multiple runs of clustering on “Bears” and another copy of “Bears”. between_set_X_and_Y_distance means cluster_distance between multiple runs of clustering on “Bears” and “Bulls”. cluster_distance(Bears, Bears) cluster_distance(Bears, Bulls) 10 1 2 3 4 5 6 7 8 9 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10

So clustering_meaningfulness(X,Y) should be close to zero if X and Y are two different datasets, say monkeys and motorcycles. If clustering_meaningfulness(X,Y) is close to one, it tells us that the clustering algorithms thinks that X and Y are same thing!

An Experiment Lets take two different datasets, and cluster them using
Whole Clustering Subsequence Clustering and measure the clustering_meaningfulness Let us use K-means clustering, and try a range of values for K, and for time series lengths. We will use random walk, and S&P stock market data…

clustering_meaningfulness(random walk, S&P)

A Implication of the Experiment
Suppose I am a consultant, and people pay me to do subsequence clustering on their data. One day 5 customers submit five different datasets for me to cluster Rainfall in Ireland over the last century. The value of Yahoos stock over the last decade. The price of butter in Brazil since WWII George Bush’s popularity in the polls The mean length of the cheetahs leg over the last million years I could cluster any one of these datasets, and give the same results to all the customers, and they would never know!

Maybe the problem is with K-Means…
…no, it is true for any clustering algorithm

What the #@&* is going on?
Let us take a look at the cluster centers created by subsequence clustering… Sine waves!!!

For subsequence clustering, no matter what the input, the output is a set of (out of phase) sine waves 20 40 60 80 100 120 140

Why Sine Waves? Slutsky’s Theorem (informally stated)
Any time series will converge to a sine wave after repeated applications of moving window smoothing Evgeny Slutsky ( ) 20 40 60 80 100 120

In our paper we have much more to say about why subsequence clustering does not and cannot give meaningful results. However, let us address the argument on the following slide instead…

A Counter Argument to our Claim
“Since many papers have been published which use time series subsequence clustering as a subroutine, and these papers produce successful results, time series subsequence clustering must be a meaningful operation.”

(Not) Finding rules in time series
G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discovery from time series. (1998). In Proc. of the 4th KDD Extended by: Mori, T. & Uehara, K. (2001). Extraction of Primitive Motion and Discovery of Association Rules from Human Motion. Cotofrei, P. & Stoffel, K (2002). Classification Rules + Time = Temporal Rules. Fu, T. C., Chung, F. L., Ng, V. & Luk, R. (2001). Pattern Discovery from Stock Time Series Using Self-Organizing Maps. Harms, S. K., Deogun, J. & Tadesse, T. (2002). Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences. Hetland, M. L. & Sætrom, P. (2002). Temporal Rules Discovery Using Genetic Programming and Specialized Hardware. Jin, X., Lu, Y. & Shi, C. (2002). Distribution Discovery: Local Analysis of Temporal Rules. Yairi, T., Kato, Y. & Hori, K. (2001). Fault Detection by Mining Association Rules in House-keeping Data. Tino, P., Schittenkopf, C. & Dorffner, G. (2000). Temporal Pattern Recognition in Noisy Non-stationary Time Series Based on Quantization into Symbolic Streams. and many more

A Simple Experiment... Our reimplementation The punch line is…
“if stock rises then falls greatly, follow a smaller rise, then we can expect to see within 20 time units, a pattern of rapid decrease followed by a leveling out.” Our reimplementation The punch line is… Our data is random walk!

The Bottom Line All the researchers that are finding rules in time series are fooling themselves

What we are NOT Claiming
Clustering of time series is meaningless Sliding windows is always a bad things Clustering of discrete sequences with sliding windows is flawed People are deliberately publishing results that they know are meaningless

Conclusions I Subsequence clustering of time series, as defined in our paper, and as used by dozens of researchers, is completely meaningless.

Conclusions II If the best data miners in the world can fool themselves, fool the reviewers, and fool the community at large, it suggests that the data mining community is doing very poor evaluation of its “contributions”. We need to play devils advocate with our own work!

Note We very much welcome feedback on this work. If you would like us to try a dataset or algorithm, just let us know.

Questions? Thanks to Christos Faloutsos, Frank Höppner, Howard Hamilton, Daniel Barbara, Magnus Lie Hetland, Hongyuan Zha, Sergio Focardi,, Shoji Hirano, Shusaku Tsumoto, and Zbigniew Struzik for their comments. Special thanks to Michalis Vlachos for pointing to Slutsky’s work Datasets and code used in this paper can be found at..

Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if.

Similar presentations

Presentation on theme: "Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if.

Similar presentations

Presentation on theme: "Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if."— Presentation transcript:

Similar presentations

About project

Feedback