Nearest Neighbors CSC 576: Data Mining.

Slides:



Advertisements
Similar presentations
1 Classification using instance-based learning. 3 March, 2000Advanced Knowledge Management2 Introduction (lazy vs. eager learning) Notion of similarity.
Advertisements

Nonparametric Methods: Nearest Neighbors
K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Indian Statistical Institute Kolkata
Lazy vs. Eager Learning Lazy vs. eager learning
Classification and Decision Boundaries
Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)
Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Carla P. Gomes Module: Nearest Neighbor Models (Reading: Chapter.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Data Mining Classification: Alternative Techniques
Distance Measures Tan et al. From Chapter 2.
1 Nearest Neighbor Learning Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AI.
CES 514 – Data Mining Lec 9 April 14 Mid-term k nearest neighbor.
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
Cluster Analysis (1).
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
INSTANCE-BASE LEARNING
CS Instance Based Learning1 Instance Based Learning.
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
Module 04: Algorithms Topic 07: Instance-Based Learning
1 Lazy Learning – Nearest Neighbor Lantz Ch 3 Wk 2, Part 1.
K Nearest Neighborhood (KNNs)
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Overview Data Mining - classification and clustering
CS Machine Learning Instance Based Learning (Adapted from various sources)
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
Distance and Similarity Measures
Clustering CSC 600: Data Mining Class 21.
Linear Regression CSC 600: Data Mining Class 12.
k-Nearest neighbors and decision tree
Artificial Neural Networks
Data Science Algorithms: The Basic Methods
Lecture 2-2 Data Exploration: Understanding Data
Instance Based Learning
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Information Retrieval
Classification Nearest Neighbor
K Nearest Neighbors and Instance-based methods
Instance Based Learning (Adapted from various sources)
K Nearest Neighbor Classification
Classification Nearest Neighbor
Nearest-Neighbor Classifiers
Revision (Part II) Ke Chen
Data Mining extracting knowledge from a large amount of data
Revision (Part II) Ke Chen
Instance Based Learning
Classification Algorithms
COSC 4335: Other Classification Techniques
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier
Nearest Neighbor Classifiers
Junheng, Shengming, Yunsheng 11/2/2018
MIS2502: Data Analytics Clustering and Segmentation
Clustering Wei Wang.
CSC 558 – Data Analytics II, Prep for assignment 1 – Instance-based (lazy) machine learning January 2018.
Data Mining Classification: Alternative Techniques
Nearest Neighbor Classifiers
Group 9 – Data Mining: Data
CSE4334/5334 Data Mining Lecture 7: Classification (4)
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Nearest Neighbors CSC 576: Data Mining

Today… Measures of Similarity Distance Measures Nearest Neighbors

Similarity and Dissimilarity Measures Used by a number of data mining techniques: Nearest neighbors Clustering …

How to measure “proximity”? Proximity: similarity or dissimilarity between two objects Similarity: numerical measure of the degree to which two objects are alike Usually in range [0,1] 0 = no similarity 1 = complete similarity Dissimilarity: measure of the degree in which two objects are different

Feature Space Abstract n-dimensional space Each instance is plotted in a feature space One axis for each descriptive feature Difficult to visualize when # of features > 3

As the differences between the values of the descriptive features grows, so too does the distance between the points in the feature space that represent these instances.

Distance Metric Distance Metric: A function that returns the distance between two instances a and b. Must, mathematically have the criteria: Non-negativity: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎,𝑏 ≥0 Identity: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎,𝑏 =0⟺𝑎=𝑏 Symmetry: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎,𝑏 =𝑚𝑒𝑡𝑟𝑖𝑐(𝑏,𝑎) Triangular Inequality: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎,𝑐 ≤𝑚𝑒𝑡𝑟𝑖𝑐 𝑎,𝑏 +𝑚𝑒𝑡𝑟𝑖𝑐(𝑏,𝑐)

Dissimilarities between Data Objects A common measure for the proximity between two objects is the Euclidean Distance: x and y are two data objects n dimensions In high school, we typically used this for calculating the distance between two objects, when there were only two dimensions. Defined for one-dimension, two-dimensions, three-dimensions, …, any n-dimensional space

Example 𝑑 𝑖 𝑑 12 ,𝑖 𝑑 5 = 5.00−2.75 2 + 2.50−7.50 2 = 30.0625 =5.4829

Dissimilarities between Data Objects Typically the Euclidean Distance is used as a first choice when applying Nearest Neighbors and Clustering x and y are two data objects n dimensions Other distance metrics: Generalized by the Minkowski distance metric: r is a parameter

Dissimilarities between Data Objects Minkowski Distance Metric: r = 1 L1 norm “Manhattan”, “taxicab” Distance r = 2 Euclidean Distance L2 norm The larger the value of r the more emphasis is placed on the features with large differences in values because these differences are raised to the power of r. The r parameter should not be confused with the number of attributes/dimensions n

Distance Matrix Once a distance metric is chosen, the proximity between all of the objects in the dataset can be computed Can be represented in a distance matrix Pairwise distances between points

Distance Matrix L1 norm distance. “Manhattan” Distance. L2 norm distance. Euclidean Distance.

Using Weights So far, all attributes treated equally when computing proximity In some situations: some features are more important than others Decision up to the analyst Modified Minkowski distance definition to include weights:

Eager Learner Models So far in this course, we’re performed prediction by: Downloading or constructing a dataset Learning a model Using model to classify/predict test instances Sometimes called eager learners: Designed to learn a model that maps the input attributes to the class label, as soon as training data becomes available.

Lazy Learner Models Opposite strategy: Delay process of modeling the training data, until it is necessary to classify/predict a test instance. Example: Nearest neighbors

Nearest Neighbors k-nearest neighbors k = parameter, chosen by analyst For a given test instance, use the k “closest” points (nearest neighbors) for performing classification “closest” points: defined by some proximity metric, such as Euclidean Distance

Algorithm Can’t have a CS class without pseudocode!

Requires three things The set of stored records Distance Metric to compute distance between records The value of k, the number of nearest neighbors to retrieve To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Definition of Nearest Neighbor The k-nearest neighbors of a given example x are the k points that are closest to x. Classification changes depending on the chosen k Majority Voting Tie Scenario: Randomly choose classification? For binary problems, usually an odd k is used to avoid ties.

Voronoi Tessellations and Decision Boundaries When k-NN is searching for the nearest neighbor, it is partitioning the abstract feature space into a Voronoi tessellation Each region belongs to an instance Contains all the points in the space whose distance to that instance is less than the distance to any other instance

Decision Boundary: the boundary between regions of the feature space in which different target levels will be predicted. Generate the decision boundary by aggregating the neighboring regions that make the same prediction.

One of the great things about nearest neighbor algorithms is that we can add in new data to update the model very easily.

What’s up with the top-right instance? Is it noise? The decision boundary is likely not ideal because of id13. k-NN is a set of local models, each defined by a single instance Sensitive to noise! How to mitigate noise? Choose a higher value of k.

Different Values of k Which is the ideal value of k? Setting k to a high value is riskier with an imbalanced dataset. Majority target level begins to dominate the feature space. k=1 k=3 k=5 k=15 Ideal decision boundary Choose k by running evaluation experiments on a training or validation set.

Choosing the right k If k is too small, sensitive to noise points in the training data Susceptible to overfitting If k is too large, neighborhood may include points from other classes Susceptible to misclassification

Computational Issues? Computation can be costly if the number of training examples is large. Efficient indexing techniques are available to reduce the amount of computations needed when finding the nearest neighbors of a test example Sorting training instances?

Majority Voting Every neighbor has the same impact on the classification: Distance-weighted: Far away neighbors have a weaker impact on the classification.

Scaling Issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example, with three dimensions: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M Income will dominate if these variables aren’t standardized.

Standardization Treat all features “equally” so that one feature doesn’t dominate the others Common treatment to all variables: Standardize each variable: Mean = 0 Standard Deviation = 1 Normalize each variable: Max = 1 Min = 0

Query / test instance to classify: Salary = 56000 Age = 35

Final Thoughts on Nearest Neighbors Nearest-neighbors classification is part of a more general technique called instance-based learning Use specific instances for prediction, rather than a model Nearest-neighbors is a lazy learner Performing the classification can be relatively computationally expensive No model is learned up-front

Classifier Comparison Eager Learners Lazy Learners Decision Trees, SVMs Model Building: potentially slow Classifying Test Instance: fast Nearest Neighbors Model Building: fast (because there is none!) Classifying Test Instance: slow

Classifier Comparison Eager Learners Lazy Learners Decision Trees, SVMs finding a global model that fits the entire input space Nearest Neighbors classification decisions are made locally (small k values), and are more susceptible to noise

Footnotes In many cases, the initial dataset is not needed once similarities and dissimilarities have been computed “transforming the data into a similarity space”

References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Science from Scratch, 1st Edition, Grus Data Mining and Business Analytics in R, 1st edition, Ledolter An Introduction to Statistical Learning, 1st edition, James et al. Discovering Knowledge in Data, 2nd edition, Larose et al. Introduction to Data Mining, 1st edition, Tam et al.