Nearest Neighbors CSC 576: Data Mining
Today… Measures of Similarity Distance Measures Nearest Neighbors
Similarity and Dissimilarity Measures Used by a number of data mining techniques: Nearest neighbors Clustering …
How to measure “proximity”? Proximity: similarity or dissimilarity between two objects Similarity: numerical measure of the degree to which two objects are alike Usually in range [0,1] 0 = no similarity 1 = complete similarity Dissimilarity: measure of the degree in which two objects are different
Feature Space Abstract n-dimensional space Each instance is plotted in a feature space One axis for each descriptive feature Difficult to visualize when # of features > 3
As the differences between the values of the descriptive features grows, so too does the distance between the points in the feature space that represent these instances.
Distance Metric Distance Metric: A function that returns the distance between two instances a and b. Must, mathematically have the criteria: Non-negativity: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎,𝑏 ≥0 Identity: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎,𝑏 =0⟺𝑎=𝑏 Symmetry: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎,𝑏 =𝑚𝑒𝑡𝑟𝑖𝑐(𝑏,𝑎) Triangular Inequality: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎,𝑐 ≤𝑚𝑒𝑡𝑟𝑖𝑐 𝑎,𝑏 +𝑚𝑒𝑡𝑟𝑖𝑐(𝑏,𝑐)
Dissimilarities between Data Objects A common measure for the proximity between two objects is the Euclidean Distance: x and y are two data objects n dimensions In high school, we typically used this for calculating the distance between two objects, when there were only two dimensions. Defined for one-dimension, two-dimensions, three-dimensions, …, any n-dimensional space
Example 𝑑 𝑖 𝑑 12 ,𝑖 𝑑 5 = 5.00−2.75 2 + 2.50−7.50 2 = 30.0625 =5.4829
Dissimilarities between Data Objects Typically the Euclidean Distance is used as a first choice when applying Nearest Neighbors and Clustering x and y are two data objects n dimensions Other distance metrics: Generalized by the Minkowski distance metric: r is a parameter
Dissimilarities between Data Objects Minkowski Distance Metric: r = 1 L1 norm “Manhattan”, “taxicab” Distance r = 2 Euclidean Distance L2 norm The larger the value of r the more emphasis is placed on the features with large differences in values because these differences are raised to the power of r. The r parameter should not be confused with the number of attributes/dimensions n
Distance Matrix Once a distance metric is chosen, the proximity between all of the objects in the dataset can be computed Can be represented in a distance matrix Pairwise distances between points
Distance Matrix L1 norm distance. “Manhattan” Distance. L2 norm distance. Euclidean Distance.
Using Weights So far, all attributes treated equally when computing proximity In some situations: some features are more important than others Decision up to the analyst Modified Minkowski distance definition to include weights:
Eager Learner Models So far in this course, we’re performed prediction by: Downloading or constructing a dataset Learning a model Using model to classify/predict test instances Sometimes called eager learners: Designed to learn a model that maps the input attributes to the class label, as soon as training data becomes available.
Lazy Learner Models Opposite strategy: Delay process of modeling the training data, until it is necessary to classify/predict a test instance. Example: Nearest neighbors
Nearest Neighbors k-nearest neighbors k = parameter, chosen by analyst For a given test instance, use the k “closest” points (nearest neighbors) for performing classification “closest” points: defined by some proximity metric, such as Euclidean Distance
Algorithm Can’t have a CS class without pseudocode!
Requires three things The set of stored records Distance Metric to compute distance between records The value of k, the number of nearest neighbors to retrieve To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)
Definition of Nearest Neighbor The k-nearest neighbors of a given example x are the k points that are closest to x. Classification changes depending on the chosen k Majority Voting Tie Scenario: Randomly choose classification? For binary problems, usually an odd k is used to avoid ties.
Voronoi Tessellations and Decision Boundaries When k-NN is searching for the nearest neighbor, it is partitioning the abstract feature space into a Voronoi tessellation Each region belongs to an instance Contains all the points in the space whose distance to that instance is less than the distance to any other instance
Decision Boundary: the boundary between regions of the feature space in which different target levels will be predicted. Generate the decision boundary by aggregating the neighboring regions that make the same prediction.
One of the great things about nearest neighbor algorithms is that we can add in new data to update the model very easily.
What’s up with the top-right instance? Is it noise? The decision boundary is likely not ideal because of id13. k-NN is a set of local models, each defined by a single instance Sensitive to noise! How to mitigate noise? Choose a higher value of k.
Different Values of k Which is the ideal value of k? Setting k to a high value is riskier with an imbalanced dataset. Majority target level begins to dominate the feature space. k=1 k=3 k=5 k=15 Ideal decision boundary Choose k by running evaluation experiments on a training or validation set.
Choosing the right k If k is too small, sensitive to noise points in the training data Susceptible to overfitting If k is too large, neighborhood may include points from other classes Susceptible to misclassification
Computational Issues? Computation can be costly if the number of training examples is large. Efficient indexing techniques are available to reduce the amount of computations needed when finding the nearest neighbors of a test example Sorting training instances?
Majority Voting Every neighbor has the same impact on the classification: Distance-weighted: Far away neighbors have a weaker impact on the classification.
Scaling Issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example, with three dimensions: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M Income will dominate if these variables aren’t standardized.
Standardization Treat all features “equally” so that one feature doesn’t dominate the others Common treatment to all variables: Standardize each variable: Mean = 0 Standard Deviation = 1 Normalize each variable: Max = 1 Min = 0
Query / test instance to classify: Salary = 56000 Age = 35
Final Thoughts on Nearest Neighbors Nearest-neighbors classification is part of a more general technique called instance-based learning Use specific instances for prediction, rather than a model Nearest-neighbors is a lazy learner Performing the classification can be relatively computationally expensive No model is learned up-front
Classifier Comparison Eager Learners Lazy Learners Decision Trees, SVMs Model Building: potentially slow Classifying Test Instance: fast Nearest Neighbors Model Building: fast (because there is none!) Classifying Test Instance: slow
Classifier Comparison Eager Learners Lazy Learners Decision Trees, SVMs finding a global model that fits the entire input space Nearest Neighbors classification decisions are made locally (small k values), and are more susceptible to noise
Footnotes In many cases, the initial dataset is not needed once similarities and dissimilarities have been computed “transforming the data into a similarity space”
References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Science from Scratch, 1st Edition, Grus Data Mining and Business Analytics in R, 1st edition, Ledolter An Introduction to Statistical Learning, 1st edition, James et al. Discovering Knowledge in Data, 2nd edition, Larose et al. Introduction to Data Mining, 1st edition, Tam et al.