Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if.

Slides:

Advertisements

Similar presentations

Generating Ideas #1: Research Patterns

Advertisements

SAX: a Novel Symbolic Representation of Time Series

Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin Rakthanmanon Eamonn Keogh Stefano Lonardi Scott Evans.

Indexing DNA Sequences Using q-Grams

Prepared by Abdullah Mueen and Eamonn Keogh

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Mining Time Series.

A Compression Based Distance Measure for Texture Bilson J. L. Campana Eamonn J. Keogh University of California – Riverside 1.

Locally Constraint Support Vector Clustering

x – independent variable (input)

08/25/2004KDD ‘041 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides.

Cognitive modelling (Cognitive Science MSc.) Fintan Costello

CS 589 Information Risk Management 6 February 2007.

A Kolmogorov Complexity Approach for Measuring Attack Path Complexity By Nwokedi C. Idika & Bharat Bhargava Presented by Bharat Bhargava.

While we believe our paper is self contained, this presentation contains: 1.Augmented and larger scale versions of experiments shown in the paper. 2.Additional.

Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.

Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.

Jessica Lin, Eamonn Keogh, Stefano Loardi

Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.

Time Series Bitmap Experiments This file contains full color, large scale versions of the experiments shown in the paper, and additional experiments which.

Time Series Bitmap Experiments This file contains full color, large scale versions of the experiments shown in the paper, and additional experiments which.

Chapter 1 Data Storage. 2 Chapter 1: Data Storage 1.1 Bits and Their Storage 1.2 Main Memory 1.3 Mass Storage 1.4 Representing Information as Bit Patterns.

The 3-class ECG problem: (left) the best clustering was our approach, the second best (right) was Euclidian distance.

Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.

Scalable Text Mining with Sparse Generative Models

Part I: Classification and Bayesian Learning

1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.

Time Series Anomaly Detection Experiments This file contains full color, large scale versions of the experiments shown in the paper, and additional experiments.

Time Series Data Analysis - II

Information theory, fitness and sampling semantics colin johnson / university of kent john woodward / university of stirling.

©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.

Mixed-level English classrooms What my paper is about: Basically my paper is about confirming with my research that the use of technology in the classroom.

Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.

Introduction. » How the course works ˃Homework ˃Project ˃Exams ˃Grades » prerequisite ˃CSCI 6441: Mandatory prerequisite ˃Take the prereq or get permission.

Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside.

On Data Mining, Compression, and Kolmogorov Complexity. C. Faloutsos and V. Megalooikonomou Data Mining and Knowledge Discovery, 2007.

Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL BING HU THANAWIN RAKTHANMANON YUAN HAO SCOTT EVANS1 STEFANO LONARDI EAMONN.

Approximation Algorithms Pages ADVANCED TOPICS IN COMPLEXITY THEORY.

Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.

Algorithmic Information Theory, Similarity Metrics and Google Varun Rao.

A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.

Mining Time Series.

How to Write Your Best TOEFL Essay. There are 5 steps to doing your best. #1. Go in with a plan. #2 Plan again. #3. Stay on topic. #5. Edit. #4. Make.

Semi-Supervised Time Series Classification & DTW-D REPORTED BY WANG YAWEN.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.

I can be You: Questioning the use of Keystroke Dynamics as Biometrics —Paper by Tey Chee Meng, Payas Gupta, Debin Gao Presented by: Kai Li Department of.

Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.

NSF Career Award IIS University of California Riverside Eamonn Keogh Efficient Discovery of Previously Unknown Patterns and Relationships.

Doing Research and Writing a Paper. Objective the main metric of a contribution of a researcher is the quality (and quantity) of his/her research papers.

Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

VizTree Huyen Dao and Chris Ackermann. Introducing example

Why searching over feature subsets is hard Suppose you have the following classification problem, with 100 features, where is happens that Features 1 and.

Informative Paragraph Writing 101

語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山助教: 熊信寬

Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.

Matrix Profile Examples

A guide to scoring well on Free Response Questions

Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if.

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Supervised Time Series Pattern Discovery through Local Importance

Basic machine learning background with Python scikit-learn

Chapter 1 Data Storage.

We understand classification algorithms in terms of the expressiveness or representational power of their decision boundaries. However, just because your.

Jia-Bin Huang Virginia Tech

CS249: Neural Language Model

Presentation transcript:

Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if You send me an telling me the class number/ university in advance. My name and address appears on the first slide (if you are using all or most of the slides), or on each slide (if you are just taking a few slides). You may freely use these slides for a conference presentation, if You send me an telling me the conference name in advance. My name appears on each slide you use. You may not use these slides for tutorials, or in a published work (tech report/ conference paper/ thesis/ journal etc). If you wish to do this, me first, it is highly likely I will grant you permission. (c) Eamonn Keogh,

Towards Parameter Free Data Mining Eamonn Keogh Computer Science & Engineering Department, University of California Riverside Stefano Lonardi Chotirat Ann Ratanamahatana

Outline of Talk Why parameter laden data mining algorithms are troublesome… A parameter free/lite approach Empirical evaluation Parameter free/lite clustering Parameter free/lite anomaly detection Parameter free/lite classifcation Conclusions

Parameters Anecdotal observations A data mining paper I once reviewed had 14 parameters… More than 5 parameters seems the norm for most papers… Some algorithms have only 4 parameters, but the algorithm is tested on synthetic data, created by a algorithm with 3 or 4 parameters…. Is a surfeit of parameters a bad thing?

Parameter laden algorithms are often hard to implement By definition, parameter laden algorithms tend to be hard to implement. But don’t we want people to use our work? This fact (combined with general laziness) probably accounts for the general dearth of strawman comparisons in most papers. See Keogh & Kasetty SIGKDD02.

Parameter laden algorithms make it hard to reproduce others results We re-implement many other papers (63 for this paper alone). Yet we are rarely able to reproduce others results, even when using the same datasets. One reason is that we typically don’t know how someone set the 5, 6, 7, 8, 9 parameters used in the original paper. Are irreproducible results scientifically meaningful?

Parameter laden algorithms make it hard not to overfit Over fitting is a problem in data mining. If we have 5 to 10 parameters to set, avoiding over fitting is nearly impossible. This makes it hard to evaluate the contribution of a paper. So you got 95% accuracy on the “bull/bear” classification problem. Did you… Spend 2 minutes adjusting the parameters (I am somewhat impressed) Spend 2 weeks adjusting the parameters (I am not impressed) Yet often we don’t know which! (See Pedro Domingos POE papers)

Parameter laden algorithms are nearly useless to the non-specialist If we ask a Cardiologist / COE / Lawyer / School Administrator to use an algorithm with 10 parameters, then the best we can hope for is that they give up in frustration!

The Central Claim of this Paper A very simple (12 lines of code) parameter free algorithm can outperform many (most/all?) other algorithms, for many data mining tasks.

Quick Review The Kolmogorov complexity K(x) of a string x is defined as the length of the shortest program capable of producing x on a universal computer — such as a Turing machine. The conditional Kolmogorov complexity K(x|y) of x to y is defined as the length of the shortest program that computes x when y is given as an auxiliary input to the program. The function K(xy) is the length of the shortest program that outputs y concatenated to x.

The Similarity Metric The Similarity Metric Ming Li et. al. Ming Li, Xin Chen, Xin Li, Bin Ma, Paul M. B. Vitányi: The similarity metric. SODA 2003: This is the optimal similarity measure, but intractable… This is a cheap approximation… C(x) means the size of x compressed by an off-the- shelf compression algorithm like Winzip, Stuffit etc Ming Li

The (very) Minor Extension proposed here This is the optimal similarity measure, but intractable… This is a cheap approximation… This is a cheaper approximation… Note we flipped the numerator and denominator to go from similarity to dissimilarity

function dist = CDM(A, B) save A.txt A –ASCII % Save variable A as A.txt zip('A.zip', 'A.txt'); % Compress A.txt A_file = dir('A.zip'); % Get file information save B.txt B –ASCII % Save variable B as B.txt zip('B.zip', 'B.txt'); % Compress B.txt B_file = dir('B.zip'); % Get file information A_n_B = [A; B]; % Concatenate A and B save A_n_B.txt A_n_B –ASCII % Save A_n_B.txt zip('A_n_B.zip', 'A_n_B.txt'); % Compress A_n_B.txt A_n_B_file = dir('A_n_B.zip'); % Get file information % Return CDM dissimilarity dist = A_n_B_file.bytes / (A_file.bytes + B_file.bytes); Compression Based Distance Measure

But wait! Isn’t the choice of compression algorithm itself a parameter! No! We are trying to approximate the optimal Kolmogorov compression. So the best compression algorithm to use is the one that gives the smallest file size

Baboon Barbary Ape Chimpanzee Pygmy Chimpanzee Human Gorilla Orangutan Sumatran Orangutan Gibbon Capuchin Malayan Flying Lemur Ring-Tailed Lemur Oyster pongines prosimians cercopithecoids hominoids greater apes panines anthropoids catarrhines taxonomy level controversial primates The clustering achieved by CDM on 16,300 symbols from the mitochondrial DNA of 12 primates, and one “outlier” species. Sang-Hee Lee (Noted physical anthropologist) This is the correct clustering for these species

Argentina Mexico Spain Brazil Catalan Italy Denmark Norway Sweden Germany USA Argentina Mexico Spain Brazil Catalan Italy Denmark Norway Sweden Germany USA Danish Norwegian German French Dutch Italian Latin English Maori Danish Norwegian German French Dutch Italian Latin English Maori The clustering achieved on the text from various Yahoo portals (Jan-15 th 2004). The smallest webpage had 1,615 characters, excluding white spaces The clustering achieved on the text from the first fifty chapters of Genesis

But DNA strings and natural language text are inherently discrete. Much of data mining is on real valued data, like images, video and time series Yeah, what about time series?

What about Time Series? Part I In the example below, 1 and 3 are very similar ECGS...

What about Time Series? Part II b b b a c c c a baabccbc SAXIFY! In the SAX representation, every bit has about the same importance…

How well does CDM time series clustering work? We compared to 51 other measures (we re-implemented every time series measure from SIGKDD, SIGMOD, ICDM, ICDE, VLDB, ICML, SSDM, PKDD and PAKDD conferences in the last decade). We optimized the parameters of the above We tested on diverse problems Homogenous datasets: All ECGs etc Heterogeneous datasets: A mixed bag of 18 diverse datasets We used a formal metric to score the clusterings (but here we will just show a few pictures)

CDM does best (by a huge margin) on the heterogeneous data problems

CDM’s clustering CDM does best (by a huge margin) on the homogenous data problems The second best clustering (of 51 approaches) MIT-BIH Noise Stress Test Database (nstdb): record 118e6 Long Term ST Database (ltstdb): record BIDMC Congestive Heart Failure Database (chfdb): record chf02 BIDMC Congestive Heart Failure Database (chfdb): record chf15

OK, you can cluster stuff, what about other data mining problems, like classification, motif discovery, anomaly detection Yes, what about anomaly detection?

Parameter-Free Anomaly Detection The connection between the notion of similarity and anomalies is inherent in the English idiom. When confronted with an anomalous object or occurrence, people usually exclaim "I have never seen anything like it!" or "It is like nothing you have ever seen".

function loc_of_anomaly = kolmogorov_anomaly(data) loc_of_anomaly = 1; while size(data,1) > 2 if CDM(data(1:floor(end/2),:),data); < CDM(data(ceil(end/2):end,:),data); loc_of_anomaly = loc_of_anomaly + size(data,1) / 2; data = data(ceil(end/2):end,:); else data = data(1:floor(end/2),:); end … A Parameter-Free Anomaly Detection Algorithm

A) Our Approach B) Support Vector Machine (SVM) based approach (6 parameters). C) Immunology (IMM) inspired approach (5 parameters). D) Association Rule (AR) based approach (5 parameters). E) TSA-tree Wavelet based approach (3 parameters). For each experiment we spend one hour of CPU time, and one hour of human time trying to find the best parameters and only reported the best results. A problem from SIGKDD 2003…

What happens when we change the anomaly? A) Our Approach B) Support Vector Machine (SVM) based approach (6 parameters). C) Immunology (IMM) inspired approach (5 parameters). D) Association Rule (AR) based approach (5 parameters). E) TSA-tree Wavelet based approach (3 parameters).

A small excerpt from dataset 118e06 from the MIT-BIH Noise Stress Test Database. The full dataset is 21,600 data points long. Here, we show only a subsection containing the two most interesting events detected by our algorithm The gray markers are independent annotations by a cardiologist indicating Premature Ventricular Contractions

L-1u L-1n L-1t L-1v The results of using our algorithm on various datasets from the Aerospace Corp collection. The bolder the line, the stronger the anomaly. Note that because of the way we plotted these there is a tendency to locate the beginning of the anomaly as opposed to the most anomalous part. Aerospace Corp Problems

Hand resting at side Hand above holster Aiming at target Actor misses holster Briefly swings gun at target, but does not aim Laughing and flailing hand The parameter-free algorithm generalizes to multi-dimensional time series, without changing a single line of code!

What about classification? Yes, what about classification?

Euclidean DTWCDM ECG: signal %16.25 %6.25 % ECG: signal %11.25 %7.50 % Gun: 2 classes 5.00 %0.00 % Gun: 4 classes %12.5 %5.00 % Parameter-Free (Structural) Time Series Classification

What are the limitations of parameter-free algorithms? The data cannot be arbitrarily small, but the amazing clustering results show here only used 1,000 datapoints per object!

Conclusions Parameter laden algorithms are bad! They are often hard to implement They make it hard to reproduce others results They make it difficult to judge the significance of a contribution They make it hard not to overfit They are next to useless to the non-specialist Parameter free/lite algorithms are good! They require little coding effort They make it trivial for others to reproduce your results They make it nearly impossible to overfit They have a real chance of being used by the non-specialist You should use a parameter free/lite approach before doing anything more complex!

Questions? All datasets and code used in this talk can be found atwww.cs.ucr.edu/~eamonn/TSDMA/index.html