UCR CS 235 Data Mining Winter 2011 TA: Abdullah Mueen Eamonn Keogh.

UCR CS 235 Data Mining Winter 2011 TA: Abdullah Mueen Eamonn Keogh

Important Note All information about grades/homeworks/ projects etc will be given out at the next meeting Someone give me a 15 minute warning before the end of this class

What Is Data Mining? Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. “data mining” is not Databases (Deductive) query processing. Expert systems or small ML/statistical programs

What is Data Mining? example from yesterdays newspapers

What is Data Mining? Example from the NBA
Play-by-play information recorded by teams Who is on the court Who shoots Results Coaches want to know what works best Plays that work well against a given team Good/bad player matchups Advanced Scout (from IBM Research) is a data mining tool to answer these questions Starks+Houston+ Ward playing

What is Data Mining? Example from Keogh/Mueen
Beet Leafhopper (Circulifer tenellus) plant membrane Stylet voltage source input resistor V 50 100 150 200 10 20 to insect conductive glue voltage reading to soil near plant Approximately 14.4 minutes of insect telemetry 10,000 20,000 30,000 x 10 4 3 Instance at 3,664 2 Instance at 9,036 1 100 200 300 400 500

All these examples show…
Lots of raw data in Some data mining Facts, rules, patterns out Lots of data Some rules or facts or patterns

Knowledge Discovery in Databases: Process
Interpretation/ Evaluation There exists a planet at… Knowledge Data Mining Patterns Preprocessing Preprocessed Data Selection Target Data Mine for: Selection Aggregation Abstraction Visualization Transformation/Conversion Statistical Analysis “Cleaning” Data adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

Data Mining: Confluence of Multiple Disciplines
Database Systems Statistics Data Mining Machine Learning Visualization Algorithm Other Disciplines

Course Outline Introduction: What is data mining? Data mining process
What makes it a new and unique discipline? Relationship between Data Warehousing, On-line Analytical Processing, and Data Mining Data mining tasks - Clustering, Classification, Rule learning, etc. Data mining process Task identification Data preparation/cleansing Association Rule mining Problem Description Algorithms Classification Bayesian Nearest Neighbor Linear Classifier Tree-based approaches Prediction Regression Neural Networks Clustering Distance-based approaches Density-based approaches Anomaly Detection Distance based Density based Model based Similarity Search

Data Mining: Classification Schemes
General functionality Descriptive data mining Predictive data mining Different views, different classifications Kinds of data to be mined Kinds of knowledge to be discovered Kinds of techniques utilized Kinds of applications adapted

Data Mining: History of the Field
Knowledge Discovery in Databases workshops started ‘89 Now a conference under the auspices of ACM SIGKDD IEEE conference series started 2001 Key founders / technology contributors: Usama Fayyad, JPL (then Microsoft, then his own company, Digimine, now Yahoo! Research labs, now CEO at Open Insights) Gregory Piatetsky-Shapiro (then GTE, now his own data mining consulting company, Knowledge Stream Partners) Rakesh Agrawal (IBM Research) The term “data mining” has been around since at least 1983 – as a pejorative term in the statistics community

Data Mining: The big players

A data mining problem… Wei Wang - School of Life Science, Fudan University, China Wei Wang - Nonlinear Systems Laboratory, Department of Mechanical Engineering, MIT Wei Wang - University of Maryland Baltimore County Wei Wang - University of Naval Engineering Wei Wang - ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences Wei Wang - Rutgers University, New Brunswick, NJ, USA Wei Wang - Purdue University Indianapolis Wei Wang - INRIA Sophia Antipolis, Sophia Antipolis, France Wei Wang - Institute of Computational Linguistics, Peking University Wei Wang - National University of Singapore Wei Wang - Nanyang Technological University, Singapore Wei Wang - Computer and Electronics Engineering, University of Nebraska Lincoln, NE, USA Wei Wang - The University of New South Wales, Australia Wei Wang - Language Weaver, Inc. Wei Wang - The Chinese University of Hong Kong, Mechanical and Automation Engineering Wei Wang - Center for Engineering and Scientific Computation, Zhejiang University, China Wei Wang - Fudan University, Shanghai, China Wei Wang - University of North Carolina at Chapel Hill

What Can Data Mining Do? Classify Cluster Summarize
Categorical, Regression Cluster Summarize Summary statistics, Summary rules Link Analysis / Model Dependencies Association rules Sequence analysis Time-series analysis, Sequential associations Detect Deviations

Why is Data Mining Hard? Scalability High Dimensionality
Heterogeneous and Complex Data Data Ownership and Distribution Non-traditional Analysis Over fitting Privacy issues

Scale of Data Organization Scale of Data
Walmart ~ 20 million transactions/day Google ~ 8.2 billion Web pages Yahoo ~10 GB Web data/hr NASA satellites ~ 1.2 TB/day NCBI GenBank ~ 22 million genetic sequences France Telecom 29.2 TB UK Land Registry 18.3 TB AT&T Corp 26.2 TB “The great strength of computers is that they can reliably manipulate vast amounts of data very quickly. Their great weakness is that they don’t have a clue as to what any of that data actually means” (S. Cass, IEEE Spectrum, Jan 2004)

What Can Data Mining Do? Classify Cluster Summarize
Categorical, Regression Cluster Summarize Summary statistics, Summary rules Link Analysis / Model Dependencies Association rules Sequence analysis Time-series analysis, Sequential associations Detect Deviations

The Classification Problem
(informal definition) Katydids Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is. Grasshoppers Katydid or Grasshopper?

spam Given a collection of annotated data… Spam or ?

Spanish Given a collection of annotated data… Polish Spanish or Polish?

Stinging Nettle Given a collection of annotated data… False Nettle Stinging Nettle or False Nettle?

Greek Gunopulos Papadopoulos Kollios Dardanos Given a collection of annotated data… Irish Keogh Gough Greenhaugh Hadleigh Tsotras Greek or Irish?

(informal definition) Katydids Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is. Grasshoppers Katydid or Grasshopper?

For any domain of interest, we can measure features
Color {Green, Brown, Gray, Other} Has Wings? Abdomen Length Thorax Length Antennae Length Mandible Size Spiracle Diameter Leg Length

We can store features in a database.
My_Collection We can store features in a database. Insect ID Abdomen Length Antennae Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 6 2.9 1.9 7 6.1 6.6 8 0.5 1.0 9 8.3 10 8.1 Katydids The classification problem can now be expressed as: Given a training database (My_Collection), predict the class label of a previously unseen instance previously unseen instance = 11 5.1 7.0 ???????

Grasshoppers Katydids 10 1 2 3 4 5 6 7 8 9 Antenna Length
Abdomen Length

Grasshoppers Katydids 10 1 2 3 4 5 6 7 8 9
We will also use this lager dataset as a motivating example… 10 1 2 3 4 5 6 7 8 9 Antenna Length Each of these data objects are called… exemplars (training) examples instances tuples Abdomen Length

We will return to the previous slide in two minutes
We will return to the previous slide in two minutes. In the meantime, we are going to play a quick game. I am going to show you some classification problems which were shown to pigeons! Let us see if you are as smart as a pigeon!

Pigeon Problem 1 Examples of class A 3 4 1.5 5 6 8 2.5 5
Examples of class B

Pigeon Problem 1 What class is this object? Examples of class A 3 4
Examples of class B What about this one, A or B?

This is a B! Pigeon Problem 1 Here is the rule.
Examples of class A Examples of class B Here is the rule. If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Pigeon Problem 2 Oh! This ones hard! Examples of class A
Examples of class B Even I know this one

Pigeon Problem 2 Examples of class A Examples of class B
The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B. So this one is an A.

Pigeon Problem 3 Examples of class A Examples of class B 6 6 4 4 5 6
This one is really hard! What is this, A or B?

Pigeon Problem 3 It is a B! Examples of class A Examples of class B
The rule is as follows, if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

Why did we spend so much time with this game?
Because we wanted to show that almost all classification problems have a geometric interpretation, check out the next 3 slides…

Pigeon Problem 1 Here is the rule again.
Left Bar 10 1 2 3 4 5 6 7 8 9 Right Bar Examples of class A Examples of class B Here is the rule again. If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Pigeon Problem 2 Left Bar Examples of class A Examples of class B 4 4
10 1 2 3 4 5 6 7 8 9 Right Bar Examples of class A Examples of class B Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B.

Pigeon Problem 3 Left Bar 100 10 20 30 40 50 60 70 80 90 Right Bar Examples of class A Examples of class B The rule again: if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

Grasshoppers Katydids 10 1 2 3 4 5 6 7 8 9 Antenna Length
Abdomen Length

Katydids Grasshoppers 11 5.1 7.0 ??????? 10 1 2 3 4 5 6 7 8 9
previously unseen instance = 11 5.1 7.0 ??????? We can “project” the previously unseen instance into the same space as the database. We have now abstracted away the details of our particular problem. It will be much easier to talk about points in space. 10 1 2 3 4 5 6 7 8 9 Antenna Length Katydids Grasshoppers Abdomen Length

Simple Linear Classifier
10 1 2 3 4 5 6 7 8 9 R.A. Fisher If previously unseen instance above the line then class is Katydid else class is Grasshopper Katydids Grasshoppers

The simple linear classifier is defined for higher dimensional spaces…

… we can visualize it as being an n-dimensional hyperplane

It is interesting to think about what would happen in this example if we did not have the 3rd dimension…

We can no longer get perfect accuracy with the simple linear classifier…
We could try to solve this problem by user a simple quadratic classifier or a simple cubic classifier.. However, as we will later see, this is probably a bad idea…

Which of the “Pigeon Problems” can be solved by the Simple Linear Classifier?
10 1 2 3 4 5 6 7 8 9 Perfect Useless Pretty Good 100 10 20 30 40 50 60 70 80 90 10 1 2 3 4 5 6 7 8 9 Problems that can be solved by a linear classifier are call linearly separable.

A Famous Problem Virginica R. A. Fisher’s Iris Dataset. 3 classes
50 of each class The task is to classify Iris plants into one of 3 varieties using the Petal Length and Petal Width. Setosa Versicolor Iris Setosa Iris Versicolor Iris Virginica

Virginica Setosa Versicolor
We can generalize the piecewise linear classifier to N classes, by fitting N-1 lines. In this case we first learned the line to (perfectly) discriminate between Setosa and Virginica/Versicolor, then we learned to approximately discriminate between Virginica and Versicolor. Setosa Versicolor Virginica If petal width > – (0.325 * petal length) then class = Virginica Elseif petal width…

We have now seen one classification algorithm, and we are about to see more. How should we compare them? Predictive accuracy Speed and scalability time to construct the model time to use the model efficiency in disk-resident databases Robustness handling noise, missing values and irrelevant features, streaming data Interpretability: understanding and insight provided by the model

Predictive Accuracy I How do we estimate the accuracy of our classifier? We can use K-fold cross validation We divide the dataset into K equal sized sections. The algorithm is tested K times, each time leaving out one of the K section from building the classifier, but using it to test the classifier instead Number of correct classifications Number of instances in our database Accuracy = K = 5 Insect ID Abdomen Length Antennae Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 6 2.9 1.9 7 6.1 6.6 8 0.5 1.0 9 8.3 10 8.1 Katydids

Predictive Accuracy II
Using K-fold cross validation is a good way to set any parameters we may need to adjust in (any) classifier. We can do K-fold cross validation for each possible setting, and choose the model with the highest accuracy. Where there is a tie, we choose the simpler model. Actually, we should probably penalize the more complex models, even if they are more accurate, since more complex models are more likely to overfit (discussed later). Accuracy = 94% Accuracy = 100% Accuracy = 100% 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Predictive Accuracy III
Number of correct classifications Number of instances in our database Accuracy = Accuracy is a single number, we may be better off looking at a confusion matrix. This gives us additional useful information… True label is... Cat Dog Pig 100 9 90 1 45 10 Classified as a…

Speed and Scalability I
We need to consider the time and space requirements for the two distinct phases of classification: Time to construct the classifier In the case of the simpler linear classifier, the time taken to fit the line, this is linear in the number of instances. Time to use the model In the case of the simpler linear classifier, the time taken to test which side of the line the unlabeled instance is. This can be done in constant time. As we shall see, some classification algorithms are very efficient in one aspect, and very poor in the other.

Speed and Scalability II
For learning with small datasets, this is the whole picture However, for data mining with massive datasets, it is not so much the (main memory) time complexity that matters, rather it is how many times we have to scan the database. This is because for most data mining operations, disk access times completely dominate the CPU times. For data mining, researchers often report the number of times you must scan the database.

Robustness I We need to consider what happens when we have:
Noise For example, a persons age could have been mistyped as 650 instead of 65, how does this effect our classifier? (This is important only for building the classifier, if the instance to be classified is noisy we can do nothing). Missing values For example suppose we want to classify an insect, but we only know the abdomen length (X-axis), and not the antennae length (Y-axis), can we still classify the instance? 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10

Robustness II We need to consider what happens when we have:
Irrelevant features For example, suppose we want to classify people as either Suitable_Grad_Student Unsuitable_Grad_Student And it happens that scoring more than 5 on a particular test is a perfect indicator for this problem… 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 If we also use “hair_length” as a feature, how will this effect our classifier?

Robustness III We need to consider what happens when we have:
Streaming data For many real world problems, we don’t have a single fixed dataset. Instead, the data continuously arrives, potentially forever… (stock market, weather data, sensor data etc) Can our classifier handle streaming data? 10 1 2 3 4 5 6 7 8 9 10

Interpretability Some classifiers offer a bonus feature. The structure of the learned classifier tells use something about the domain. As a trivial example, if we try to classify peoples health risks based on just their height and weight, we could gain the following insight (Based of the observation that a single linear classifier does not work well, but two linear classifiers do). There are two ways to be unhealthy, being obese and being too skinny. Weight Height

Nearest Neighbor Classifier
10 1 2 3 4 5 6 7 8 9 Evelyn Fix Joe Hodges Antenna Length If the nearest instance to the previously unseen instance is a Katydid class is Katydid else class is Grasshopper Katydids Grasshoppers Abdomen Length

We can visualize the nearest neighbor algorithm in terms of a decision surface…
Note the we don’t actually have to construct these surfaces, they are simply the implicit boundaries that divide the space into regions “belonging” to each instance. This division of space is called Dirichlet Tessellation (or Voronoi diagram, or Theissen regions).

The nearest neighbor algorithm is sensitive to outliers…
The solution is to…

We can generalize the nearest neighbor algorithm to the K- nearest neighbor (KNN) algorithm.
We measure the distance to the nearest K instances, and let them vote. K is typically chosen to be an odd number. K = 1 K = 3

The nearest neighbor algorithm is sensitive to irrelevant features…
Suppose the following is true, if an insects antenna is longer than 5.5 it is a Katydid, otherwise it is a Grasshopper. Using just the antenna length we get perfect classification! Training data 1 2 3 4 5 6 7 8 9 10 6 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 5 Suppose however, we add in an irrelevant feature, for example the insects mass. Using both the antenna length and the insects mass with the 1-NN algorithm we get the wrong classification! 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

How do we mitigate the nearest neighbor algorithms sensitivity to irrelevant features?
Use more training instances Ask an expert what features are relevant to the task Use statistical tests to try to determine which features are useful Search over feature subsets (in the next slide we will see why this is hard)

Why searching over feature subsets is hard
Suppose you have the following classification problem, with 100 features, where is happens that Features 1 and 2 (the X and Y below) give perfect classification, but all 98 of the other features are irrelevant… Only Feature 2 Only Feature 1 Using all 100 features will give poor results, but so will using only Feature 1, and so will using Feature 2! Of the 2100 –1 possible subsets of the features, only one really works.

1 2 3 4 3,4 2,4 1,4 2,3 1,3 1,2 2,3,4 1,3,4 1,2,4 1,2,3 1,2,3,4 Forward Selection Backward Elimination Bi-directional Search

The nearest neighbor algorithm is sensitive to the units of measurement
X axis measured in centimeters Y axis measure in dollars The nearest neighbor to the pink unknown instance is red. X axis measured in millimeters Y axis measure in dollars The nearest neighbor to the pink unknown instance is blue. One solution is to normalize the units to pure numbers. Typically the features are Z-normalized to have a mean of zero and a standard deviation of one. X = (X – mean(X))/std(x)

We can speed up nearest neighbor algorithm by “throwing away” some data. This is called data editing. Note that this can sometimes improve accuracy! We can also speed up classification with indexing One possible approach. Delete all instances that are surrounded by members of their own class.

Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case… 10 1 2 3 4 5 6 7 8 9 Max (p=inf) Manhattan (p=1) Weighted Euclidean Mahalanobis

…In fact, we can use the nearest neighbor algorithm with any distance/similarity function
For example, is “Faloutsos” Greek or Irish? We could compare the name “Faloutsos” to a database of names using string edit distance… edit_distance(Faloutsos, Keogh) = 8 edit_distance(Faloutsos, Gunopulos) = 6 Hopefully, the similarity of the name (particularly the suffix) to other Greek names would mean the nearest nearest neighbor is also a Greek name. ID Name Class 1 Gunopulos Greek 2 Papadopoulos 3 Kollios 4 Dardanos 5 Keogh Irish 6 Gough 7 Greenhaugh 8 Hadleigh Specialized distance measures exist for DNA strings, time series, images, graphs, videos, sets, fingerprints etc…

Peter Piotr Edit Distance Example Piter Pioter Piotr Peter Pyotr
How similar are the names “Peter” and “Piotr”? Assume the following cost function Substitution 1 Unit Insertion 1 Unit Deletion 1 Unit D(Peter,Piotr) is 3 It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion. Assume that each of these operators has a cost associated with it. The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C. Note that for now we have ignored the issue of how we can find this cheapest transformation Peter Piter Pioter Piotr Substitution (i for e) Insertion (o) Piotr Pyotr Petros Pietro Pedro Pierre Piero Peter Deletion (e)

Dear SIR, I am Mr. John Coleman and my sister is Miss Rose Colemen, we are the children of late Chief Paul Colemen from Sierra Leone. I am writing you in absolute confidence primarily to seek your assistance to transfer our cash of twenty one Million Dollars ($21, ) now in the custody of a private Security trust firm in Europe the money is in trunk boxes deposited and declared as family valuables by my late father as a matter of fact the company does not know the content as money, although my father made them to under stand that the boxes belongs to his foreign partner. …

This mail is probably spam
This mail is probably spam. The original message has been attached along with this report, so you can recognize or block similar unwanted mail in future. See for more details. Content analysis details: (12.20 points, 5 required) NIGERIAN_SUBJECT2 (1.4 points) Subject is indicative of a Nigerian spam FROM_ENDS_IN_NUMS (0.7 points) From: ends in numbers MIME_BOUND_MANY_HEX (2.9 points) Spam tool pattern in MIME boundary URGENT_BIZ (2.7 points) BODY: Contains urgent matter US_DOLLARS_ (1.5 points) BODY: Nigerian scam key phrase ($NN,NNN,NNN.NN) DEAR_SOMETHING (1.8 points) BODY: Contains 'Dear (something)' BAYES_ (1.6 points) BODY: Bayesian classifier says spam probability is 30 to 40% [score: ]

Acknowledgements Some of the material used in this lecture is drawn from other sources: Chris Clifton Jiawei Han Dr. Hongjun Lu (Hong Kong Univ. of Science and Technology) Graduate students from Simon Fraser Univ., Canada, notably Eugene Belchev, Jian Pei, and Osmar R. Zaiane Graduate students from Univ. of Illinois at Urbana-Champaign Dr. Bhavani Thuraisingham (MITRE Corp. and UT Dallas)

Making Good Figures We are going to see many figures this quarter
I personally feel that making good figures is very important to a papers chance of acceptance.

2 1 3 4 6 5 3 3 1 1 1 1 1 1 Fig. 1. Sequence graph example
Fig. 1. A sample sequence graph. The line thickness encodes relative entropy Fig. 1. Sequence graph example What's wrong with this figure? Let me count the ways… None of the arrows line up with the “circles”. The “circles” are all different sizes and aspect ratios, the (normally invisible) white bounding box around the numbers breaks the arrows in many places. The figure captions has almost no information. Circles are not aligned… On the right is my redrawing of the figure with PowerPoint. It took me 300 seconds This figure is an insult to reviewers. It says, “we expect you to spend an unpaid hour to review our paper, but we don’t think it worthwhile to spend 5 minutes to make clear figures”

Fig. 1. Sequence graph example
Note that there are figures drawn seven hundred years ago that have much better symmetry and layout. Peter Damian, Paulus Diaconus, and others, Various saints lives: Netherlands, S. or France, N. W.; 2nd quarter of the 13th century Lets us see some more examples of poor figures, then see some principles that can help

This figure wastes 80% of the space it takes up.
In any case, it could be replace by a short English sentence: “We found that for selectivity ranging from 0 to 0.05, the four methods did not differ by more than 5%” Why did they bother with the legend, since you can’t tell the four lines apart anyway? mm-GNAT: index structure for arbitrary Lp norm and Figure 9 show the results of search by L∞ norm. The number of distance calculations on L∞-based mm-GNAT is smaller than that on L2- based and L1 based mm-GNATs.

This figure wastes almost a quarter of a page.
The ordering on the X-axis is arbitrary, so the figure could be replaced with the sentence “We found the average performance was 198 with a standard deviation of 11.2”. The paper in question had 5 similar plots, wasting an entire page. and Figure 9 show the results of search by L∞ norm. The number of distance calculations on L∞-based mm-GNAT is smaller than that on L2- based and L1 based mm-GNATs.

The figure below takes up 1/6 of a page, but it only reports 3 numbers.
An Energy-Efficient Data Collection Framework for Wireless Sensor Networks by Exploiting Spatiotemporal Correlation Chong Liu, Student Member, IEEE, Kui Wu, Member, IEEE, and Jian Pei, Senior Member, IEEE

The figure below takes up 1/6 of a page, but it only reports 2 numbers!
Actually, it really only reports one number! Only the relative times really matter, so they could have written “We found that FTW is 1007 times faster than the exact calculation, independent of the sequence length”.

Both figures below describe the classification of time series motions…
It is not obvious from this figure which algorithm is best. The caption has almost zero information You need to read the text very carefully to understand the figure Redesign by Keogh At a glance we can see that the accuracy is very high. We can also see that DTW tends to win when the... The data is plotted in Figure 5. Note that any correctly classified motions must appear in the upper left (gray) triangle. 1 In this region our algorithm wins Chuanjun Li, B. Prabhakaran and S.Q. Zheng, Similarity Measure for Multi-Attribute Data, Proc. ICASSP IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, PA USA, March 18-23, 2005, II II-1152. In this region DTW wins 1 Figure 5. Each of our 100 motions plotted as a point in 2 dimensions. The X value is set to the distance to the nearest neighbor from the same class, and the Y value is set to the distance to the nearest neighbor from any other class.

This should be a bar chart, the four items are unrelated
(in any case this should probably be a table, not a figure)

This pie chart takes up a lot of space to communicate two numbers ( Better as a table, or as simple text) People that have heard of Pacman People that have not heard of Pacman A Database Architecture For Real-Time Motion Retrieval

Principles to make Good Figures
Think about the point you want to make, should it be done with words, a table, or a figure. If a figure, what kind? Color helps (but you cannot depend on it) Linking helps (sometimes called brushing) Direct labeling helps Meaningful captions helps Minimalism helps (Omit needless elements) Finally, taking great care, taking pride in your work, helps

TitleMiscellany with various astronomical, calendrical, medical, and philosophical textsOriginEngland Date14th century LanguageLatinScriptGothicArtistsPerhaps the de Foxton Master (see Scott 1996)Decoration1 large initial in red and blue with red and blue pen-flourishing (f. 33). Large initials in blue with red pen-flourishing or in red with blue pen-flourishing. Smaller initials in blue with red pen-flourishing and in red with blue pen-flourishing. Rubrics, paraphs, and underlining in red.Dimensions in mm230 x 170 (165 x 130) in two column

Direct labeling helps It removes one level of indirection, and allows the figures to be self explaining (see Edward Tufte: Visual Explanations, Chapter 4) D C E B A Figure 10. Stills from a video sequence; the right hand is tracked, and converted into a time series: A) Hand at rest: B) Hand moving above holster. C) Hand moving down to grasp gun. D Hand moving to shoulder level, E) Aiming Gun.

Linking helps interpretability I
How did we get from here To here? What is Linking? Linking is connecting the same data in two views by using the same color (or thickness etc). In the figures below, color links the data in the pie chart, with data in the scatterplot. It is not clear from the above figure. See next slide for a suggested fix. Fish 50 45 Fowl 40 35 30 25 20 15 Neither Both 10 5 10 20 30 40 50 60

Linking helps interpretability II
In this figure, the color of the arrows inside the fish link to the colors of the arrows on the time series. This tells us exactly how we go from a shape to a time series. Note that there are other links, for example in II, you can tell which fish is which based on color or link thickness linking. Minimalism helps: In this case, numbers on the X-axis do not mean anything, so they are deleted.

Do we need all the numbers to annotate the X and Y axis?
1 EBEL ABEL Detection Rate DROP1 DROP2 False Alarm Rate 1 Direct labeling helps Note that the line thicknesses differ by powers of 2, so even in a B/W printout you can tell the four lines apart. Don’t cover the data with the labels! You are implicitly saying “the results are not that important”. Do we need all the numbers to annotate the X and Y axis? Can we remove the text “With Ranking”? Incremental exemplar learning schemes for classification on embedded devices Minimalism helps: delete the “with Ranking”, the X-axis numbers, the grid…

Covering the data with the labels is a common sin

Color helps - Direct labeling helps - Meaningful captions help
These two images, which are both use to discuss an anomaly detection algorithm, illustrate many of the points discussed in previous slides. Color helps - Direct labeling helps - Meaningful captions help The images should be as self contained as possible, to avoid forcing the reader to look back to the text for clarification multiple times. Note that while Figure 6 use color to highlight the anomaly, it also uses the line thickness (hard to see in PowerPoint) thus this figure works also well in B/W printouts

Sometime between next week and the end of the quarter
Find a poor figure in a data mining paper. Create a fixed version of it. Present one to three slides about it at the beginning of a class. Before you start to work on the poor figure, run it by me.

UCR CS 235 Data Mining Winter 2011 TA: Abdullah Mueen Eamonn Keogh.

Similar presentations

Presentation on theme: "UCR CS 235 Data Mining Winter 2011 TA: Abdullah Mueen Eamonn Keogh."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UCR CS 235 Data Mining Winter 2011 TA: Abdullah Mueen Eamonn Keogh.

Similar presentations

Presentation on theme: "UCR CS 235 Data Mining Winter 2011 TA: Abdullah Mueen Eamonn Keogh."— Presentation transcript:

Similar presentations

About project

Feedback