It is expanded, starting on page 50….

Slides:



Advertisements
Similar presentations
1 Classification using instance-based learning. 3 March, 2000Advanced Knowledge Management2 Introduction (lazy vs. eager learning) Notion of similarity.
Advertisements

Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
x – independent variable (input)
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Classification Continued
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Dear SIR, I am Mr. John Coleman and my sister is Miss Rose Colemen, we are the children of late Chief Paul Colemen from Sierra Leone. I am writing you.
CS Instance Based Learning1 Instance Based Learning.
Nearest Neighbor Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook.
1 E. Fatemizadeh Statistical Pattern Recognition.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Machine Learning Machine learning explores the study and construction of algorithms that can learn from data. Basic Idea: Instead of trying to create a.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Blame Canada. What features can we cheaply measure from coins? 1.Diameter 2.Thickness 3.Weight 4.Electrical Resistance 5.? Probably not color or other.
Lecture Slides Elementary Statistics Twelfth Edition
Introduction to Machine Learning
Support Vector Machines
Data Mining Introduction to Classification using Linear Classifiers
2- Syllable Classification
The Classification Problem
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Objective The student will be able to:
Instance Based Learning
*Currently, self driving cars do a bit of both.
Chapter 6 Classification and Prediction
Data Mining (and machine learning)
CS 4/527: Artificial Intelligence
Introduction to Data Mining, 2nd Edition by
Project 2 datasets are now online.
*Currently, self driving cars do a bit of both.
K Nearest Neighbor Classification
Data Mining Practical Machine Learning Tools and Techniques
2. Solving Schrödinger’s Equation
The basic notions related to machine learning
Data Structures Review Session
The Classification Problem
Data Mining (and machine learning)
COSC 4335: Other Classification Techniques
Nearest Neighbor Classifiers
CS5112: Algorithms and Data Structures for Applications
CSCI N317 Computation for Scientific Applications Unit Weka
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Ensemble learning.
Machine Learning in Practice Lecture 17
Support Vector Machines
Nearest Neighbors CSC 576: Data Mining.
Nearest Neighbor Classifiers
AS-Level Maths: Core 2 for Edexcel
Feature Selection Methods
Minwise Hashing and Efficient Search
Evaluating Classifiers
State-Space Searches.
State-Space Searches.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
State-Space Searches.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining CSCI 307, Spring 2019 Lecture 6
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Dear Reader, This is a slightly expanded version of classification1 slides It is expanded, starting on page 50…. …to allow a brief review of some concepts …to show some new motivating examples I just wanted to avoid forcing you to keep swapping between files if you are following along on a tablet ;-)

2- Syllable Classification We can deactivate (knock-out, KO) genes in mice, and see what happens to their songs… Recall this slide? 1- Syllable Extraction 2- Syllable Classification …1311 … 4521… 13327 …12521 … 12521… 12521 Normal mouse P53-KO

Somehow I figured out that the first sound was a ‘1’, and the second was a ‘2’. How can we do this?

The Classification Problem (informal definition) Symbol 1 Given a collection of annotated data. In this case 5 instances Symbol 1 of and 5 of Symbol 2, decide what type of sound the unlabeled example is. Symbol 2 Symbol 1 or Symbol 2

Data Mining/Machine Learning Machine learning explores the study and construction of algorithms  that can learn from data. Basic Idea: Instead of trying to create a very complex program to do X. Use a (relatively) simple program that can learn to do X. Example: Instead of trying to program a car to drive (If light(red) && NOT(pedestrian) || speed(X) <= 12 && .. ), create a program that watches human drive, and learns how to drive*. *Currently, self driving cars do a bit of both.

Why Machine Learning I Why do machine learning instead of just writing an explicit program? It is often much cheaper, faster and more accurate. It may be possible to teach a computer something that we are not sure how to program. For example: We could explicitly write a program to tell if a person is obese If (weightkg /(heightm  heightm)) > 30, printf(“Obese”) We would find it hard to write a program to tell is a person is sad However, we could easily obtain a 1,000 photographs of sad people/ not sad people, and ask a machine learning algorithm to learn to tell them apart.

The Classification Problem (informal definition) Katydids Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is. Grasshoppers Katydid or Grasshopper?

The Classification Problem (informal definition) Canadian Given a collection of annotated data. In this case 3 instances Canadian of and 3of American, decide what type of insect the unlabeled example is. American Canadian or American?

Color {Green, Brown, Gray, Other} For any domain of interest, we can measure features Color {Green, Brown, Gray, Other} Has Wings? Abdomen Length Thorax Length Antennae Length Mandible Size Spiracle Diameter Leg Length

Sidebar 1 In data mining, we usually don’t have a choice of what features to measure. The data is not usually collect with data mining in mind. The features we really want may not be available: Why? ____________________ We typically have to use (a subset) of whatever data we are given.

Sidebar 2 In data mining, we can sometimes generate new features. For example Feature X = Abdomen Length/ Antennae Length Abdomen Length Antennae Length

We can store features in a database. My_Collection We can store features in a database. Insect ID Abdomen Length Antennae Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 6 2.9 1.9 7 6.1 6.6 8 0.5 1.0 9 8.3 10 8.1 Katydids The classification problem can now be expressed as: Given a training database (My_Collection), predict the class label of a previously unseen instance previously unseen instance = 11 5.1 7.0 ???????

Grasshoppers Katydids 10 1 2 3 4 5 6 7 8 9 Antenna Length Abdomen Length

Grasshoppers Katydids We will also use this lager dataset as a motivating example… 10 1 2 3 4 5 6 7 8 9 Antenna Length Each of these data objects are called… exemplars (training) examples instances tuples Abdomen Length

We will return to the previous slide in two minutes We will return to the previous slide in two minutes. In the meantime, we are going to play a quick game. I am going to show you some classification problems which were shown to pigeons! Let us see if you are as smart as a pigeon!

Pigeon Problem 1 Examples of class A Examples of class B 3 4 1.5 5 6 8 3 4 1.5 5 6 8 2.5 5 Examples of class B 5 2.5 5 2 8 3 4.5 3

Pigeon Problem 1 What class is this object? Examples of class A 3 4 1.5 5 6 8 2.5 5 Examples of class B 5 2.5 5 2 8 3 4.5 3 8 1.5 What about this one, A or B? 4.5 7

This is a B! Pigeon Problem 1 Here is the rule. Examples of class A 3 4 1.5 5 6 8 2.5 5 Examples of class B 5 2.5 5 2 8 3 4.5 3 8 1.5 Here is the rule. If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Pigeon Problem 2 Oh! This ones hard! Examples of class A Examples of class B 4 4 5 2.5 2 5 5 3 2.5 3 8 1.5 Even I know this one 5 5 6 6 7 7 3 3

Pigeon Problem 2 Examples of class A Examples of class B The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B. 4 4 5 2.5 2 5 5 3 2.5 3 5 5 So this one is an A. 6 6 7 7 3 3

Pigeon Problem 3 Examples of class A Examples of class B 6 6 4 4 5 6 7 5 4 8 7 7 This one is really hard! What is this, A or B? 1 5 6 3 3 7

Pigeon Problem 3 It is a B! Examples of class A Examples of class B 6 6 4 4 5 6 7 5 4 8 7 7 The rule is as follows, if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B. 1 5 6 3 3 7

Why did we spend so much time with this stupid game? Because we wanted to show that almost all classification problems have a geometric interpretation, check out the next 3 slides…

Pigeon Problem 1 Examples of class A Examples of class B Left Bar 10 1 2 3 4 5 6 7 8 9 Right Bar Examples of class A 3 4 1.5 5 6 8 2.5 5 Examples of class B 5 2.5 5 2 8 3 4.5 3 Here is the rule again. If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Pigeon Problem 2 Examples of class A Examples of class B Left Bar 4 4 10 1 2 3 4 5 6 7 8 9 Right Bar Examples of class A Examples of class B 4 4 5 2.5 2 5 5 3 2.5 3 5 5 Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B. 6 6 3 3

Pigeon Problem 3 Examples of class A Examples of class B Left Bar 100 10 20 30 40 50 60 70 80 90 Right Bar Examples of class A Examples of class B 4 4 5 6 7 5 4 8 7 7 1 5 6 3 The rule again: if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B. 3 7

Grasshoppers Katydids 10 1 2 3 4 5 6 7 8 9 Antenna Length Abdomen Length

Katydids Grasshoppers previously unseen instance = 11 5.1 7.0 ??????? We can “project” the previously unseen instance into the same space as the database. We have now abstracted away the details of our particular problem. It will be much easier to talk about points in space. 10 1 2 3 4 5 6 7 8 9 Antenna Length Katydids Grasshoppers Abdomen Length

Simple Linear Classifier 10 1 2 3 4 5 6 7 8 9 R.A. Fisher 1890-1962 If previously unseen instance above the line then class is Katydid else class is Grasshopper Katydids Grasshoppers

Simple Quadratic Classifier Simple Cubic Classifier Simple Quartic Classifier Simple Quintic Classifier Simple….. 10 1 2 3 4 5 6 7 8 9 If previously unseen instance above the line then class is Katydid else class is Grasshopper Katydids Grasshoppers

The simple linear classifier is defined for higher dimensional spaces…

… we can visualize it as being an n-dimensional hyperplane

It is interesting to think about what would happen in this example if we did not have the 3rd dimension…

We can no longer get perfect accuracy with the simple linear classifier… We could try to solve this problem by user a simple quadratic classifier or a simple cubic classifier.. However, as we will later see, this is probably a bad idea…

Which of the “Pigeon Problems” can be solved by the Simple Linear Classifier? 10 1 2 3 4 5 6 7 8 9 Perfect Useless Pretty Good 100 10 20 30 40 50 60 70 80 90 10 1 2 3 4 5 6 7 8 9 Problems that can be solved by a linear classifier are call linearly separable.

What would happen if we created a new feature Z, where: Revisiting Sidebar 2 What would happen if we created a new feature Z, where: Z= abs(X.value - X.value) 10 1 2 3 4 5 6 7 8 9 All blue points are perfectly aligned, so we can only see one 1 2 3 4 5 6 7 8 9 10

A Famous Problem Virginica R. A. Fisher’s Iris Dataset. 3 classes 50 of each class The task is to classify Iris plants into one of 3 varieties using the Petal Length and Petal Width. Setosa Versicolor Iris Setosa Iris Versicolor Iris Virginica

We can generalize the piecewise linear classifier to N classes, by fitting N-1 lines. In this case we first learned the line to (perfectly) discriminate between Setosa and Virginica/Versicolor, then we learned to approximately discriminate between Virginica and Versicolor. Setosa Versicolor Virginica If petal width > 3.272 – (0.325 * petal length) then class = Virginica Elseif petal width…

We have now seen one classification algorithm, and we are about to see more. How should we compare them? Predictive accuracy Speed and scalability time to construct the model time to use the model efficiency in disk-resident databases Robustness handling noise, missing values and irrelevant features, streaming data Interpretability: understanding and insight provided by the model

Predictive Accuracy I Hold Out Data How do we estimate the accuracy of our classifier? We can use Hold Out data We divide the dataset into 2 partitions, called train and test. We build our models on train, and see how well we do on test. train 10 1 2 3 4 5 6 7 8 9 Insect ID Abdomen Length Antennae Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 6 2.9 1.9 7 6.1 6.6 8 0.5 1.0 9 8.3 10 8.1 Katydids 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 test 6 2.9 1.9 Grasshopper 7 6.1 6.6 Katydid 8 0.5 1.0 9 8.3 10 8.1 4.7 Katydids

Predictive Accuracy II How do we estimate the accuracy of our classifier? We can use K-fold cross validation We divide the dataset into K equal sized sections. The algorithm is tested K times, each time leaving out one of the K section from building the classifier, but using it to test the classifier instead Number of correct classifications Number of instances in our database Accuracy = K = 5 Insect ID Abdomen Length Antennae Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 6 2.9 1.9 7 6.1 6.6 8 0.5 1.0 9 8.3 10 8.1 Katydids

The Default Rate How accurate can we be if we use no features? The answer is called the Default Rate, the size of the most common class, over the size of the full dataset. Examples: I want to predict the sex of some pregnant friends babies. The most common class is ‘boy’, so I will always say ‘boy’. I do just a tiny bit better than random guessing. I want to predict the sex of the nurse that will give me a flu shot next week. The most common class is ‘female’, so I will say ‘female’. No features

Predictive Accuracy III Using K-fold cross validation is a good way to set any parameters we may need to adjust in (any) classifier. We can do K-fold cross validation for each possible setting, and choose the model with the highest accuracy. Where there is a tie, we choose the simpler model. Actually, we should probably penalize the more complex models, even if they are more accurate, since more complex models are more likely to overfit (discussed later). Accuracy = 94% Accuracy = 99% Accuracy = 100% 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Predictive Accuracy III Number of correct classifications Number of instances in our database Accuracy = Accuracy is a single number, we may be better off looking at a confusion matrix. This gives us additional useful information… True label is... Cat Dog Pig 100 9 90 1 45 10 Classified as a…

Speed and Scalability I We need to consider the time and space requirements for the two distinct phases of classification: Time to construct the classifier In the case of the simpler linear classifier, the time taken to fit the line, this is linear in the number of instances. Time to use the model In the case of the simpler linear classifier, the time taken to test which side of the line the unlabeled instance is. This can be done in constant time. As we shall see, some classification algorithms are very efficient in one aspect, and very poor in the other.

Speed and Scalability II For learning with small datasets, this is the whole picture However, for data mining with massive datasets, it is not so much the (main memory) time complexity that matters, rather it is how many times we have to scan the database. This is because for most data mining operations, disk access times completely dominate the CPU times. For data mining, researchers often report the number of times you must scan the database.

Robustness I We need to consider what happens when we have: Noise For example, a persons age could have been mistyped as 650 instead of 65, how does this effect our classifier? (This is important only for building the classifier, if the instance to be classified is noisy we can do nothing). Missing values For example suppose we want to classify an insect, but we only know the abdomen length (X-axis), and not the antennae length (Y-axis), can we still classify the instance? 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10

Robustness II We need to consider what happens when we have: Irrelevant features For example, suppose we want to classify people as either Suitable_Grad_Student Unsuitable_Grad_Student And it happens that scoring more than 5 on a particular test is a perfect indicator for this problem… 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 If we also use “hair_length” as a feature, how will this effect our classifier?

Robustness III We need to consider what happens when we have: Streaming data For many real world problems, we don’t have a single fixed dataset. Instead, the data continuously arrives, potentially forever… (stock market, weather data, sensor data etc) Can our classifier handle streaming data? 10 1 2 3 4 5 6 7 8 9 10

Interpretability Some classifiers offer a bonus feature. The structure of the learned classifier tells use something about the domain. As a trivial example, if we try to classify peoples health risks based on just their height and weight, we could gain the following insight (Based of the observation that a single linear classifier does not work well, but two linear classifiers do). There are two ways to be unhealthy, being obese and being too skinny. Weight Height

Review begins here

Pigeon Problems Revisited Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images

Fig 2. Examples of benign (left) and malignant (right) breast specimens stained with hematoxylin and eosin, at different magnifications. Levenson RM, Krupinski EA, Navarro VM, Wasserman EA (2015) Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images. PLoS ONE 10(11): e0141357. doi:10.1371/journal.pone.0141357 http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0141357

Simple Linear Classifier Review: Given a labeled dataset, we can use a machine learning algorithm (in this case a linear classifier) to automatically learn a program (in this case, a simple IF statement) to classify the data. Simple Linear Classifier Decision Boundary A good way to think about algorithms 10 1 2 3 4 5 6 7 8 9 If previously unseen instance above the line then class is Katydid else class is Grasshopper Katydids Grasshoppers

Review: Even if we only think about the simple polynomial classifiers linear classifier | quadratic classifier | cubic classifier | etc We are often forced to make choices. Should be we use height and weight or just weight? Should be we use linear classifier or quadratic classifier? We could make choices in many ways, but in most cases, we can just try many possibilities with K-fold cross validation, and pick the best. Lets review K-fold cross validation in the next slide… Accuracy = 94% Accuracy = 99% Accuracy = 100% 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Predictive Accuracy review How do we estimate the accuracy of our classifier? We can use K-fold cross validation We divide the dataset into K equal sized sections. The algorithm is tested K times, each time leaving out one of the K section from building the classifier, but using it to test the classifier instead Number of correct classifications Number of instances in our database Accuracy = K = 5 Insect ID Abdomen Length Antennae Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 6 2.9 1.9 7 6.1 6.6 8 0.5 1.0 9 8.3 10 8.1 Katydids

We have now seen one classification algorithm, and we are about to see more. How should we compare them? review Predictive accuracy Speed and scalability time to construct the model time to use the model efficiency in disk-resident databases Robustness handling noise, missing values and irrelevant features, streaming data Interpretability: understanding and insight provided by the model

Review ends here Remember this example? Here someone just gave us the features. Lets see an example where we have to think about what features we need….

(Western Pipistrelle (Parastrellus hesperus) Photo by Michael Durham

We can easily measure two features of bat calls We can easily measure two features of bat calls. Their characteristic frequency and their call duration Why not measure the loudness? Characteristic frequency Call duration Western pipistrelle calls

How well would the linear classifier work here? (next slide)

Nearest Neighbor Classifier 10 1 2 3 4 5 6 7 8 9 Evelyn Fix 1904-1965 Joe Hodges 1922-2000 Antenna Length If the nearest instance to the previously unseen instance is a Katydid class is Katydid else class is Grasshopper Katydids Grasshoppers Abdomen Length

We can visualize the nearest neighbor algorithm in terms of a decision surface… Note the we don’t actually have to construct these surfaces, they are simply the implicit boundaries that divide the space into regions “belonging” to each instance. This division of space is called Dirichlet Tessellation (or Voronoi diagram, or Theissen regions).

Breadfruit (Artocarpus altilis) Hawaii 2015

The nearest neighbor algorithm is sensitive to outliers… The solution is to…

We can generalize the nearest neighbor algorithm to the K-nearest neighbor (KNN) algorithm. We measure the distance to the nearest K instances, and let them vote. K is typically chosen to be an odd number. K = 1 K = 3 Note, this “K” has nothing to do with “K” in K-fold cross validation!

The nearest neighbor algorithm is sensitive to irrelevant features… Suppose the following is true, if an insects antenna is longer than 5.5 it is a Katydid, otherwise it is a Grasshopper. Using just the antenna length we get perfect classification! Training data 1 2 3 4 5 6 7 8 9 10 6 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 5

The nearest neighbor algorithm is sensitive to irrelevant features… Suppose the following is true, if an insects antenna is longer than 5.5 it is a Katydid, otherwise it is a Grasshopper. Using just the antenna length we get perfect classification! Training data 1 2 3 4 5 6 7 8 9 10 6 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 5 Suppose however, we add in an irrelevant feature, for example the insects mass. Using both the antenna length and the insects mass with the 1-NN algorithm we get the wrong classification! 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

How do we mitigate the nearest neighbor algorithms sensitivity to irrelevant features? Use more training instances Ask an expert what features are relevant to the task Use statistical tests to try to determine which features are useful Search over feature subsets (in the next slide we will see why this is hard)

Why searching over feature subsets is hard Suppose you have the following classification problem, with 100 features, where is happens that Features 1 and 2 (the X and Y below) give perfect classification, but all 98 of the other features are irrelevant… Only Feature 2 Only Feature 1 Using all 100 features will give poor results, but so will using only Feature 1, and so will using Feature 2! Of the 2100 –1 possible subsets of the features, only one really works.

1 2 3 4 3,4 2,4 1,4 2,3 1,3 1,2 2,3,4 1,3,4 1,2,4 1,2,3 1,2,3,4 Forward Selection Backward Elimination Bi-directional Search

The nearest neighbor algorithm is sensitive to the units of measurement X axis measured in centimeters Y axis measure in dollars The nearest neighbor to the pink unknown instance is red. X axis measured in millimeters Y axis measure in dollars The nearest neighbor to the pink unknown instance is blue. One solution is to normalize the units to pure numbers. Typically the features are Z-normalized to have a mean of zero and a standard deviation of one. X = (X – mean(X))/std(x)

We can also speed up classification with indexing We can speed up nearest neighbor algorithm by “throwing away” some data. This is called data editing. Note that this can sometimes improve accuracy! (the paper I asked you to read will make that clearer) We can also speed up classification with indexing One possible approach. Delete all instances that are surrounded by members of their own class.

Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case… 10 1 2 3 4 5 6 7 8 9 Max (p=inf) Manhattan (p=1) Weighted Euclidean Mahalanobis

…In fact, we can use the nearest neighbor algorithm with any distance/similarity function For example, is “Faloutsos” Greek or Irish? We could compare the name “Faloutsos” to a database of names using string edit distance… edit_distance(Faloutsos, Keogh) = 8 edit_distance(Faloutsos, Gunopulos) = 6 Hopefully, the similarity of the name (particularly the suffix) to other Greek names would mean the nearest nearest neighbor is also a Greek name. ID Name Class 1 Gunopulos Greek 2 Papadopoulos 3 Kollios 4 Dardanos 5 Keogh Irish 6 Gough 7 Greenhaugh 8 Hadleigh Specialized distance measures exist for DNA strings, time series, images, graphs, videos, sets, fingerprints etc…

Peter Piotr Edit Distance Example Piter Pioter Piotr Pyotr Peter How similar are the names “Peter” and “Piotr”? Assume the following cost function Substitution 1 Unit Insertion 1 Unit Deletion 1 Unit D(Peter,Piotr) is 3 It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion. Assume that each of these operators has a cost associated with it. The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C. Note that for now we have ignored the issue of how we can find this cheapest transformation Peter Piter Pioter Piotr Substitution (i for e) Insertion (o) Piotr Pyotr Petros Pietro Pedro Pierre Piero Peter Deletion (e)

The Classification Problem (informal definition) Canadian Given a collection of annotated data. In this case 3 instances Canadian of and 3of American, decide what type of insect the unlabeled example is. American Canadian or American?

Blame Canada

For any domain of interest, we can measure features Color {Green, Brown, Gray, Other} Has Wings? Abdomen Length Thorax Length Antennae Length Mandible Size Spiracle Diameter Leg Length

What features can we cheaply measure from coins? Diameter Thickness Weight Electrical Resistance ? Probably not color or other optical features.

The Ideal Case In the best case, we would find a single feature that would strongly separate the coins. Diameter is clearly such a feature for the simpler case of pennies vs. quarters. Nominally 19.05 mm Nominally 24.26 mm Decision threshold 20mm 25mm

Usage ? Decision threshold 20mm 25mm Once we learn the threshold, we no longer need to keep the data. When an unknown coin comes in, we measure the feature of interest, and see which side of the decision threshold it lands on. IF diameter(unknown_coin) < 22 coin_type = ‘penny’ ELSE coin_type = ‘quarter’ END ? Decision threshold 20mm 25mm

Let us revisit the original problem of classifying Canadian vs Let us revisit the original problem of classifying Canadian vs. American Quarters Which of our features (if any) are useful? Diameter Thickness Weight Electrical Resistance I measured these features for 50 Canadian and 50 American quarters….

Diameter Here I have 99% blue on the right side, but the left side is about 50/50 green/blue. 1

Thickness Here I have all green on the left side, but the right side is about 50/50 green/blue. 2

Weight The weight feature seems very promising. 3 Weight The weight feature seems very promising. It is not perfect, but the left side is about 92% blue, and the right side about 92% green

The electrical resistance feature seems promising. 4 The electrical resistance feature seems promising. Again, it is not perfect, but the left side is about 89% blue, and the right side about 89% green. Electrical Resistance

Diameter Thickness We can try all possible pairs of features. {Diameter, Weight} {Diameter, Electrical Resistance} {Thickness, Weight} {Thickness, Electrical Resistance} {Weight, Electrical Resistance} This combination does not work very well. Thickness 1 2 1,2

Diameter 1 3 1,3 Weight

For brevity, some combinations are omitted Let us jump to the last combination…

3 4 3,4 Weight Electrical Resistance

Diameter Thickness Weight 5 -5 -10 5 Thickness 5 Weight -5 -5 -10 1 2 3 We can also try all possible triples of features. {Diameter, Thickness, Weight} {Diameter, Thickness, Electrical Resistance} etc This combination does not work that well. 1,2 1,3 2,3 1,2,3

Electrical Resistance Diameter Thickness 1 2 3 4 1,2 1,3 2,3 1,4 2,4 3,4 Weight 1,2,3 1,2,4 1,3,4 2,3,4 1,2,3,4 Electrical Resistance

We typically resort to greedy search. Given a set of N features, there are 2N -1 feature subsets we can test. In this case, we can test all of them (exhaustive search), but in general, this is not possible. 10 features = 1,023 20 features = 1,048,576 100 features = 1,267,650,600,228,229,401,496,703,205,376 We typically resort to greedy search. Greedy Forward Section Initial state: Empty Set: No features Operators: Add a single feature. Evaluation Function: K-fold cross validation. 1 2 3 4 1,2 1,3 2,3 1,4 2,4 3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2,3,4

The Default Rate How accurate can we be if we use no features? The answer is called the Default Rate, the size of the most common class, over the size of the full dataset. Examples: I want to predict the sex of some pregnant friends babies. The most common class is ‘boy’, so I will always say ‘boy’. I do just a tiny bit better than random guessing. I want to predict the sex of the nurse that will give me a flu shot next week. The most common class is ‘female’, so I will say ‘female’. No features review

Greedy Forward Section Initial state: Empty Set: No features Operators: Add a feature. Evaluation Function: K-fold cross validation. 1 2 3 4 100 80 1,2 1,3 2,3 1,4 2,4 3,4 60 40 1,2,3 1,2,4 1,3,4 2,3,4 20 {} 1,2,3,4

Greedy Forward Section Initial state: Empty Set: No features Operators: Add a feature. Evaluation Function: K-fold cross validation. 1 2 3 4 100 80 1,2 1,3 2,3 1,4 2,4 3,4 60 40 1,2,3 1,2,4 1,3,4 2,3,4 20 {} {3} 1,2,3,4

Greedy Forward Section Initial state: Empty Set: No features Operators: Add a feature. Evaluation Function: K-fold cross validation. 1 2 3 4 100 80 1,2 1,3 2,3 1,4 2,4 3,4 60 40 1,2,3 1,2,4 1,3,4 2,3,4 20 {} {3} {3,4} 1,2,3,4

Greedy Forward Section Initial state: Empty Set: No features Operators: Add a feature. Evaluation Function: K-fold cross validation. 1 2 3 4 100 80 1,2 1,3 2,3 1,4 2,4 3,4 60 40 1,2,3 1,2,4 1,3,4 2,3,4 20 {} {3} {3,4} {1,3,4} 1,2,3,4

Setting parameters and overfitting You need to classify widgets, you get a training set.. You could use a Linear Classifier or Nearest Neighbor … Nearest Neighbor You could use 1NN, 3NN, 5NN… You could use Euclidean Distance, LP1, Lpinf, Mahalanobis… You could do some data editing… You could do some feature weighting… You could …. “Linear Classifier” You could use a Constant classifier You could use a Linear Classifier You could use a Quadratic Classifier You could…. Model Selection Parameter Selection Or parameter tuning, tweaking

Setting parameters and overfitting You need to classify widgets, you get a training set.. You could use a Linear Classifier or Nearest Neighbor … Nearest Neighbor You could use 1NN, 3NN, 5NN… You could use Euclidean Distance, LP1, Lpinf, Mahalanobis… You could do some data editing… You could do some feature weighting… You could …. “Linear Classifier” You could use a Constant classifier You could use a Linear Classifier You could use a Quadratic Classifier You could….

Overfitting Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.

Suppose we need to solve a classification problem We are not sure if we should us the.. Simple linear classifier or the Simple quadratic classifier How do we decide which to use? We do cross validation or leave-one out and choose the best one.

Simple linear classifier gets 81% accuracy Simple quadratic classifier 99% accuracy 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90

Simple linear classifier gets 96% accuracy Simple quadratic classifier 97% accuracy

This problem is greatly exacerbated by having too little data Simple linear classifier gets 90% accuracy Simple quadratic classifier 95% accuracy

What happens as we have more and more training examples? The accuracy for all models goes up! The chance of making a mistake (choosing the wrong model) goes down Even if we make a mistake, it will not matter too much (because we would learn a degenerate quadratic, which is basically a straight line) Simple linear 70% accuracy Simple quadratic 90% accuracy Simple linear 90% accuracy Simple quadratic 95% accuracy Simple linear 99.999999% accuracy Simple quadratic 99.999999% accuracy

One Solution: Charge Penalty for complex models For example, for the simple {polynomial} classifier, we could “charge” 1% for every increase in the degree of the polynomial Simple linear classifier gets 90.5% accuracy, minus 0, equals 90.5% Simple quadratic classifier 97.0% accuracy, minus 1, equals 96.0% Simple cubic classifier 97.05% accuracy, minus 2, equals 95.05% Accuracy = 90.5% Accuracy = 97.0% Accuracy = 97.05% 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

One Solution: Charge Penalty for complex models For example, for the simple {polynomial} classifier, we could charge 1% for every increase in the degree of the polynomial. There are more principled ways to charge penalties In particular, there is a technique called Minimum Description Length (MDL) We will revisit this after we have seen more classifiers