When is “Nearest Neighbor Meaningful? Authors: Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan Uri Shaft Presentation by: Vuk Malbasa For CIS664 Prof. Vasilis Megalooekonomou
Overview Introduction Examples of when NN is useful and not Conditions under which NN is not useful Application of results Meaningful applications of high- dimensional NN Experimental Studies Conclusions
Introduction ? Nearest neighbor is a technique where an unseen example is assumed to have similar properties to the already classified point closest to it. The examples on the left are cases where it is obvious that using NN is a useful. Are there cases where this technique is not useful?
Examples Query point Center of circle Query point Histogram of distances to other points
Conditions under which NN is not useful D min D max (1+ε)D min Definition: A nearest neighbor query is unstable for a given ε if the distance from the query point to most data points is less than (1 + ε) times the distance from the query point to its nearest neighbor. It can be shown that under certain conditions for any fixed ε > 0, as dimensionality rises, the probability that a query is unstable converges to 1.
Conditions under which NN is not useful If for a given scenario (set of data points and a set of query points) the equation below is satisfied then NN is not useful. Stated differently, as the dimensionality of data m is increased then if the variance of the distribution scaled by the overall magnitude of the distance converges to zero then NN is meaningless. (1)
Application of results Example 1: The data distribution and query distribution are IID in all dimensions All appropriate moments are finite Query point in chosen independently of data points In this case queries are unstable.
Application of results Example 2: Same as previous example but all dimensions of both query points and data points are completely dependant. value for dimension 1 = value for dimension 2 … In this case queries are not unstable and NN is meaningful.
Application of results Example 3 Every dimension is unique, but all dimensions are correlated with all other dimensions and the variance of each additional dimension increases. First independent variables U 1, …, U m are generated such that U i ~ Uniform(0,sqrt(i)) X 1 =U 1, for 2 ≤ i ≤ m X i =U i + (X i-1 /2) In this case queries are unstable.
Meaningful applications of high- dimensional NN Query point matches one of the data points exactly. Query point falls within some small distance of one of the data points (this becomes increasingly more difficult as dimensionality rises). Data is clustered into several clusters with a fixed maximum distance ε, and the query point falls within one of these clusters. (If the query point isn’t required to fall within some cluster then the query is unstable). Implicit low dimensionality (underlying dimensionality of data is low regardless of actual dimensionality).
Experimental Studies Conditions described in (1) describe what happens as dimensionality approaches infinity Experiments are needed to observe the rate of this convergence.
Experimental Studies
Conclusions Query instability is an indication of a meaningless query. While there are situations where high dimensional NN queries are meaningful, they are very specific and differ from the “independent dimensions” basis. The distinction in distance decreases fastest in the first 20 dimensions.
Conclusions Make sure that the distance distribution between query points and data points allows for enough contrast. When evaluating a NN processing technique, test it on meaningful workloads.
Thanks!
?