CS 235 Nearest Neighbor Classification

Slides:



Advertisements
Similar presentations
Debugging ACL Scripts.
Advertisements

CATHERINE AND ANNIE Python: Part 3. Intro to Loops Do you remember in Alice when you could use a loop to make a character perform an action multiple times?
Prepared by Abdullah Mueen and Eamonn Keogh
Chapter 6 Structured Data 1. A data structure, or record, in COBOL is a method of combining several variables into one larger variable. – Example: 2.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Three kinds of learning
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
Introduction to a Programming Environment
1 Project 5: Median. 2 The median of a collection of numbers is the member for which there are an equal number less than or equal and greater than or.
JQuery Page Slider. Our goal is to get to the functionality of the Panic Coda web site.Panic Coda web site.
Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.
Matlab Workshop 1/10/07 Lesson 1: Matlab as a graphing calculator.
Problem of the Day  Why are manhole covers round?
Change in your CAD Project File - it happens all the time in robotics.
Semi-Supervised Time Series Classification Li Wei Eamonn Keogh University of California, Riverside {wli,
Using Lists Games Programming in Scratch. Games Programming in Scratch Extension – Using Lists Learning Objectives Create a temporary data store (list)
CS 106 Introduction to Computer Science I 03 / 02 / 2007 Instructor: Michael Eckmann.
Info Read SEGY Wavelet estimation New Project Correlate near offset far offset Display Well Tie Elog Strata Geoview Hampson-Russell References Create New.
The test data is now online:. Everyone has their own two datasets (small and large). You can find your dataset by looking at the code below.
Data Science Credibility: Evaluating What’s Been Learned
Creativity of Algorithms & Simple JavaScript Commands
Presented by Yuting Liu
CS Fall 2016 (Shavlik©), Lecture 5
Games Programming in Scratch
Prof. Mark Glauser Created by: David Marr
Slides by Eamonn Keogh (UC Riverside)
Week 13: Searching and Sorting
CS005 Introduction to Programming
CS005 Introduction to Programming
What to do when a test fails
OptiSystem applications: BER analysis of BPSK with RS encoding
CS1371 Introduction to Computing for Engineers
CS 235 Decision Tree Classification
Aditya P. Mathur Purdue University
8/28/15 Today I will explain the steps to the scientific method
ROUNDING AND SIGNIFICANT FIGURES
Macrosystems EDDIE: Getting Started + Troubleshooting Tips
Eamonn Keogh CS005 Introduction to Programming: Matlab Eamonn Keogh
Project 2 datasets are now online.
Lesson 1: Introduction to Trifacta Wrangler
Lesson 1: Introduction to Trifacta Wrangler
We understand classification algorithms in terms of the expressiveness or representational power of their decision boundaries. However, just because your.
The Five Stages of Writing
Instructor: Shengyu Zhang
CS005 Introduction to Programming
CMSC201 Computer Science I for Majors Lecture 19 – Recursion
Properties of the Real Numbers Part I
07/12/2018 Starter L.O. To be able to Solve a quadratic by factorising
The datasets have 100 instances, and 100 features
Lectures on Graph Algorithms: searching, testing and sorting
ARRAYS 1 GCSE COMPUTER SCIENCE.
Do While (condition is true) … Loop
Lesson 1 – Chapter 1C Trifacta Interface Navigation
Nested Loops & The Step Parameter
Sorting "There's nothing in your head the sorting hat can't see. So try me on and I will tell you where you ought to be." -The Sorting Hat, Harry Potter.
Ensemble learning.
Building Java Programs
Project 2, The Final Project
Seating “chart” Front Back 4 rows 5 rows 5 rows 4 rows 2 rows 2 rows
CMSC201 Computer Science I for Majors Final Exam Information
MTM for volunteers 2019.
Review of Previous Lesson
Evaluating Classifiers
of “Redistribution” (i.e., “Borrowing and Carrying”)
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
CSCE 206 Lab Structured Programming in C
Simple Case Studies Using MASS
Software Development Techniques
Sign Language Recognition With Unsupervised Feature Learning
Presentation transcript:

CS 235 Nearest Neighbor Classification Dear Students, please complete the exercises in this document. They will not be graded or collected, but follow-ups assignments that expand on this example will be. We are going to do Nearest Neighbor Classification. Important note: Matlab has (a very nice) built in Nearest Neighbor Classification tool, but we are not going to use it! I want us to see more explicitly what is happening, and we are going to need to modify our code later. To be clear, I have run a bunch of examples below, but I expect you to start by re-running the code yourself to confirm all my results. All the file you need are here: http://www.cs.ucr.edu/~eamonn/CS235/

CS 235 Nearest Neighbor Classification: Warm up We are going to use a standard data format, used by 95% of the datasets out there. The class labels will be distinct integers starting at ‘1’. So, for example ‘1’ might be spam, ‘2’ might be work-emails etc. The class labels will be in the first column (this is inconsistent with some examples in my slides). The rest of the columns are the features. This instance is in class 1 This instance is in class 2

Data Format Each of the datasets comes in two parts, a TRAIN partition and a TEST partition. For example, for the synthetic control dataset we have two files, synthetic_control_TEST and synthetic_control_TRAIN The two files will be in the same format, but are generally of different sizes. The files are in the standard ASCII format that can be read directly by most tools/languages. For example, to read the two synthetic control dataset s into Matlab, we can type… >> TRAIN = load('synthetic_control_TRAIN'); >> TEST  = load('synthetic_control_TEST' ); …at the command line. There is one data instance per row. The first value in the row is the class label (an integer between 1 and the number of classes). The rest of the row are the data values, and individual time series. This instance is in class 1 This instance is in class 2

What is the data we are going to classify What is the data we are going to classify? In a sense, it does not matter, at least for the code we have to write. However, they are time series objects. Think of heartbeats, or 60 days of the value of your stocks, etc In this case, the time series classes correspond to 6 different situations that can occur during an hour of manufacturing some auto parts. For example, class 3 corresponds to a steady increasing of temperature, but class 2 corresponds to an oscillating temperature (with each item having a slightly different period) etc. It is always a good idea to plot some data as a sanity check, as I have done. TRAIN = load('synthetic_control_TRAIN'); subplot(2,3,1) for i = 1: 300 if TRAIN(i,1) == 1 subplot(2,3,1);, plot(TRAIN(i,2:end));, hold on; elseif TRAIN(i,1) == 2 subplot(2,3,2);, plot(TRAIN(i,2:end));, hold on; elseif TRAIN(i,1) == 3 subplot(2,3,3);, plot(TRAIN(i,2:end));, hold on; elseif TRAIN(i,1) == 4 subplot(2,3,4);, plot(TRAIN(i,2:end));, hold on; elseif TRAIN(i,1) == 5 subplot(2,3,5);, plot(TRAIN(i,2:end));, hold on; elseif TRAIN(i,1) == 6 subplot(2,3,6);, plot(TRAIN(i,2:end));, hold on; end

Minor point For many datasets, you could swap the feature columns, and it would make no difference. For example, if we have {GPA, AGE, SAT}, it would make no difference if we rearranged them to {AGE, SAT, GPA}. But for time series, the columns are arranged in a meaningful order, we cannot permute them. {GPA, AGE, SAT}

What is the default error-rate for this problem? >> [class count] = mode(TRAIN(:,1)); >> 1-count/size(TRAIN(:,1),1) ans = 0.8333 The default error rate is 0.8333 or 83.33% Which means…. The default accuracy rate is 0.1667 or 16.67% (do you understand why the code above is computing the default error rate?)

1NN Classification You should run this simple piece of matlab code. function UCR_time_series_test %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% (C) Eamonn Keogh %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% TRAIN = load('synthetic_control_TRAIN'); % Only these two lines need to be changed to test a different dataset. % TEST  = load('synthetic_control_TEST' ); % Only these two lines need to be changed to test a different dataset. % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% TRAIN_class_labels = TRAIN(:,1);     % Pull out the class labels. TRAIN(:,1) = [];                     % Remove class labels from training set. TEST_class_labels = TEST(:,1);       % Pull out the class labels. TEST(:,1) = [];                      % Remove class labels from testing set. correct = 0;  % Initialize the number we got correct for i = 1 : length(TEST_class_labels) % Loop over every instance in the test set        classify_this_object = TEST(i,:);    this_objects_actual_class = TEST_class_labels(i);    predicted_class = Classification_Algorithm(TRAIN,TRAIN_class_labels, classify_this_object);    if predicted_class == this_objects_actual_class        correct = correct + 1; % we got one more correct    end;    disp([int2str(i), ' out of ', int2str(length(TEST_class_labels)), ' done']) % Report progress end; %%%%%%%%%%%%%%%%% Create Report %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% disp(['The dataset you tested has ', int2str(length(unique(TRAIN_class_labels))), ' classes']) disp(['The training set is of size ', int2str(size(TRAIN,1)),', and the test set is of size ',int2str(size(TEST,1)),'.']) disp(['The time series are of length ', int2str(size(TRAIN,2))]) disp(['The error rate was ',num2str((length(TEST_class_labels)-correct )/length(TEST_class_labels))]) %%%%%%%%%%%%%%%%% End Report %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Here is a sample classification algorithm, it is the simple (yet very competitive) one-nearest % neighbor using the Euclidean distance. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function predicted_class = Classification_Algorithm(TRAIN,TRAIN_class_labels,unknown_object) best_so_far = inf;  for i = 1 : length(TRAIN_class_labels)      compare_to_this_object = TRAIN(i,:);      distance = sqrt(sum((compare_to_this_object - unknown_object).^2)); % Euclidean distance         if distance < best_so_far           predicted_class = TRAIN_class_labels(i);      best_so_far = distance;     end end; 1NN Classification You should run this simple piece of matlab code. We will go over this in class, but… You need to understand this code! You will be modifying it extensively. >> UCR_time_series_test 1 out of 300 done 2 out of 300 done … 299 out of 300 done 300 out of 300 done The dataset you tested has 6 classes The training set is of size 300, and the test set is of size 300. The time series are of length 60 The error rate was 0.12

>> UCR_time_series_test 1 out of 300 done … 300 out of 300 done The dataset you tested has 6 classes The training set is of size 300, and the test set is of size 300. The time series are of length 60 The error rate was 0.12 Our training set had 300 items, and we got an error rate of 0.12 What would happen if we had a smaller training set? To see, we just need to add two new lines of code. We first shuffle the data, then we keep only the first 200 datapoints. Add 2 lines Our training set now has 200 items, and our error rate went up to 0.136 You might get a little different results, because of the random shuffle (run the code a few times…. >> UCR_time_series_test 1 out of 300 done … 300 out of 300 done The dataset you tested has 6 classes The training set is of size 200, and the test set is of size 300. The time series are of length 60 The error rate was 0.136

As shown below, lets try an even smaller dataset, only 20% of the original size. You should run the code ten times, to get a feel for the variability (recall, each run is a random subset of the full training data) Our training set now has 60 items, and our error rate went up to 0.23667 You might get a little different results, because of the random shuffle (run the code a few times…. >> UCR_time_series_test 1 out of 300 done … 300 out of 300 done The dataset you tested has 6 classes The training set is of size 60, and the test set is of size 300. The time series are of length 60 The error rate was 0.23667

Old (full original) New (truncated) (return to the default code I gave you) We have a full hour of data for each object. What would happen if we only had the first 30 minutes? We can find out by adding these two lines of code to lines 4 and 5 of the code I gave you. TRAIN(:,32:end) =[]; TEST(:,32:end) =[]; To see what difference it makes to the error-rate, see next slide. Old (full original) New (truncated)

We went from this error rate with 60 minutes >> UCR_time_series_test 1 out of 300 done … 300 out of 300 done The dataset you tested has 6 classes The training set is of size 300, and the test set is of size 300. The time series are of length 60 The error rate was 0.12 We went from this error rate with 60 minutes .. to this error rate with only 30 minutes Let’s cut out more data, and look only at the first ten minutes! The error rate is now much higher. TRAIN(:,12:end) =[]; TEST(:,12:end) =[]; >> UCR_time_series_test 1 out of 300 done … 300 out of 300 done The dataset you tested has 6 classes The training set is of size 300, and the test set is of size 300. The time series are of length 30 The error rate was 0.1433 >> UCR_time_series_test 1 out of 300 done … 300 out of 300 done The dataset you tested has 6 classes The training set is of size 300, and the test set is of size 300. The time series are of length 10 The error rate was 0.2533

(Red Passion Flower Butterfly) I want you to appreciate the generality of the simple code/algorithm you have seen. You have classified industrial traces. However, many kinds of data can be represented as a time series / row of features, so the code you have can classify fish, butterfly's, faces, handwriting, gestures, golf swings, leaves, types of stars etc. 100 200 300 400 500 600 700 800 900 Heliconius melpomene (The Postman) Heliconius erato (Red Passion Flower Butterfly) Acer circinatum (Oregon Vine Maple)

Assignment The code I gave you assumes two data files, the train and the test. I want you to modify it, to do leave-one-out cross-validation. To be clear, leave-one-out cross-validation does not use the TEST set, only the TRAIN set. You can write it from scratch, or you can modify my code. In either case, I can’t imagine this will take you more than 20 or so lines of code. If you find yourself writing more than 50 lines of code, stop! Talk to me, you have the wrong idea. You are going to be using this code for more, graded assignments, so make sure it works correctly and is well commented. function error_rate = leave_one_out_cross_validation TRAIN = load('synthetic_control_TRAIN'); % % TEST = load('synthetic_control_TEST' ); Commented out, we don’t use this for cross validation