K Nearest Neighbors and Instance-based methods Villanova University Machine Learning Project
Learning by Analogy: Case-based Reasoning Case-based systems are a significant chunk of artificial intelligence in their own right. A case-based system has two major components: Case base Problem solver The case base contains a growing set of cases, analogous to either a knowledge base or a training set. Problem solver has A case retriever and A case reasoner. May also have a case installer. Villanova University Machine Learning Project K Nearest Neighbors
Case-Based Retrieval Cases are described as a set of features Retrieval uses methods such as Nearest neighbor: compare all features to all cases in data set and choose closest match Indexed: compute and store some indices with each case and retrieve matching indices Domain-based model clustering: CB is organized into a domain model; insertion is harder, but retrieval is easier. Villanova University Machine Learning Project CSC 8520 Spring 2010. Paula Matuszek K Nearest Neighbors
Machine Learning Project Examples Glass classification in Weka features are values for Na, K, etc Text classification: “documents like this one” Features are the word frequencies in the document Villanova University Machine Learning Project K Nearest Neighbors
Simple Case-Based Reasoning Example A frequency matrix for diagnosing system problems is a simple case-based example Representation is a matrix of observed symptoms and causes Each case is an entry in cell of the matrix Critic is actual outcome of case Learner adds entry to appropriate cells Performer matches symptoms, chooses possible causes Villanova University Machine Learning Project K Nearest Neighbors
Machine Learning Project Car Diagnosis Battery dead Out of gas Alternator bad Battery bad Car won’t start case 2 case 3 Car stalls at stoplights case 4 case 5 Car misfires in rainy weather Lights won’t come on Villanova University Machine Learning Project K Nearest Neighbors
Machine Learning Project Case-based Reasoning Definition of relevant features is critical: Need to get the ones which influence outcomes At the right level of granularity The reasoner can be a complex planning and what-if reasoning system, or a simple query for missing data. Only really becomes a “learning” system if there is a case installer as well. Can grow cumulatively. Villanova University Machine Learning Project K Nearest Neighbors
Machine Learning Project K-Nearest Neighbor All instances form the trained system For a new case, determine “distance” to each training instance. Typically Euclidian distance Manhattan distance Weighted distance metrics Use the k nearest instances to determine class Villanova University Machine Learning Project K Nearest Neighbors 18
Machine Learning Project Distance Measures Euclidian: shortest distance between two points in a straight line Manhattan: “block distance”. Shortest path between two points using only 90 degree angles Weighted: Variant on Euclidian giving more weight to some directions. Villanova University Machine Learning Project K Nearest Neighbors
Machine Learning Project Example Feature 1 ? ? Feature 2 Villanova University Machine Learning Project K Nearest Neighbors
Machine Learning Project KNN: What Value for K? Tradeoff between looking at more neighbors (larger k). Ignore noise better, less risk of a outliers distorting decision. But computationally more expensive looking at fewer neighbors. Faster, Does not risk forcing distant neighbors into decision Start with k = 1, then 3, etc, until accuracy drops. Weka has a capability to do this automatically Villanova University Machine Learning Project K Nearest Neighbors
Machine Learning Project KNN Advantages Incremental. Each new instance for which we get feedback can be added to the training data. Training is very fast (lazy!) All information is retained Can learn quite complex relationships Villanova University Machine Learning Project K Nearest Neighbors
Machine Learning Project KNN Disadvantages Uses a lot of storage, since all instances are retained Slow at query time Sensitive to irrelevant features Does not create a general model which can be examined. Villanova University Machine Learning Project K Nearest Neighbors
Machine Learning Project KNN in Weka The basic KNN classifier in Weka is IBk, under the Lazy category. (InstanceBasedK) Default k value is 1. Settable in the Choose window (right-click) Setting cross-Validate to True will use hold-one-out cross validation to choose the best k between 1 and the value set in the parameters. windowSize can be used to set a limit on the number of training cases. New cases replace oldest. A value of zero (the default) means no limit. Villanova University Machine Learning Project K Nearest Neighbors
Machine Learning Project IBk Outputs IBk gives us the same output sections as J48 However, under Classifier Model we see IB1 instance-based classifier using 1 nearest neighbour(s) for classification IBk does not show us anything comparable to the decision tree of J48. Instance-based methods could only show the entire database of examples For KNN we will be most interested in what happens with new examples. Villanova University Machine Learning Project K Nearest Neighbors