k-Nearest Neighbors Search in High Dimensions Tomer Peled Dan Kushnir Tell me who your neighbors are, and I'll know who you are
Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) l2 extension Applications (Dan)
Nearest Neighbor Search Problem definition Given: a set P of n points in Rd Over some metric find the nearest neighbor p of q in P The solution strategy is: Select Features feature space Rd Select distance metric for example l1 or l2 Build data-structure for fast near-neighbor queries Scale-ability with n & with d is important Q? Distance metric
Applications Classification Clustering Segmentation q ? Indexing Dimension reduction (e.g. lle) Weight q ? color
Naïve solution No preprocess Given a query point q query time = O(nd) Go over all n points Do comparison in Rd query time = O(nd) Keep in mind
Common solution Use a data structure for acceleration Scale-ability with n & with d is important
When to use nearest neighbor High level algorithms Parametric Non-parametric Density estimation Probability distribution estimation Nearest neighbors complex models Sparse data High dimensions Assuming no prior knowledge about the underlying probability structure
Nearest Neighbor q? Closest min pi P dist(q,pi)
r, - Nearest Neighbor q? (1 + ) r dist(q,p1) r r2=(1 + ) r1
Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) l2 extension Applications (Dan)
The simplest solution Lion in the desert Lion in the desert
Quadtree Split the first dimension into 2 Repeat iteratively Stop when each cell has no more than 1 data point Extension to the K-dimensional case
Quadtree - structure X1,Y1 X1,Y1 P≥X1 P<X1 P≥Y1 P<Y1 P<X1 Extension to the K-dimensional case X
Quadtree - Query In many cases works X1,Y1 X1,Y1 P≥X1 P<X1 P≥Y1 Extension to the K-dimensional case X In many cases works
Quadtree – Pitfall1 In some cases doesn’t X1,Y1 X1,Y1 P<X1 P<Y1 Extension to the K-dimensional case P<X1 X In some cases doesn’t
Quadtree – Pitfall1 In some cases nothing works Y X Extension to the K-dimensional case In some cases nothing works X
Quadtree – pitfall 2 O(2d) X Y Extension to the K-dimensional case Smarty, Perky - ךןםמ O(2d) Could result in Query time Exponential in #dimensions
Space partition based algorithms Could be improved Multidimensional access methods / Volker Gaede, O. Gunther
Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) l2 extension Applications (Dan)
Curse of dimensionality Query time or space O(nd) D>10..20 worst than sequential scan For most geometric distributions Techniques specific to high dimensions are needed O( min(nd, nd) ) Naive Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002 Data structures scales purely with the dimension Prooved by Barkol & Rabani 2000 & Beame-Vee 2002 A quantitative analysis and performance study for Similarity Search Methods in High Dimensional Spaces / R. Weber, H. Schek, and S. Blott. Enough points n required for learning distribution in Rd (for n~ed the curse of dimensionality is meaningless)
Curse of dimensionality Some intuition 2 22 For Region bounding 23 2d
Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) l2 extension Applications (Dan) Compromise towards approximate Nearest neighbors with probability
Preview General Solution – Locality sensitive hashing Implementation for Hamming space Generalization to l1 & l2
Hash function Key=Numbers giving access to a storage
Hash function Data_Item Hash function Key Bin/Bucket A function that maps a data item to a numeric value by use of a transformation. A hash function can convert a number that has meaning to a user, such as a key or other identifier, into a value for the location of that data in a structure such as a table. Dyson, Dictionary of Networking Key Bin/Bucket
Hash function X=Number in the range 0..n Data structure X modulo 3 A function that maps a data item to a numeric value by use of a transformation. A hash function can convert a number that has meaning to a user, such as a key or other identifier, into a value for the location of that data in a structure such as a table. Dyson, Dictionary of Networking 0..2 Storage Address Usually we would like related Data-items to be stored at the same bin
Recall r, - Nearest Neighbor q? (1 + ) r r dist(q,p1) r dist(q,p2) (1 + ) r r2=(1 + ) r1
Locality sensitive hashing q? (1 + ) r r (r, ,p1,p2) Sensitive ≡Pr[I(p)=I(q)] is “high” if p is “close” to q ≡Pr[I(p)=I(q)] is “low” if p is”far” from q P1 P2 r2=(1 + ) r1
Preview General Solution – Locality sensitive hashing Implementation for Hamming space Generalization to l1 & l2 Why Hamming space It’s easier It’s the same as L1 It’s almost the same as L2
Hamming Space Hamming space = 2N binary strings Hamming distance = #changed digits Manhattan distance between two vertices in an n-dimensional hypercube, where n is the length of the words. Richard W. Hamming. Error-detecting and error-correcting codes, Bell System Technical Journal 29(2):147-160, 1950. http://en.wikipedia.org/wiki/Hamming_distance Richard Hamming a.k.a Signal distance Richard Hamming
Hamming Space Hamming space 010100001111 Hamming distance 010100001111 010010000011 Manhattan distance between two vertices in an n-dimensional hypercube, where n is the length of the words. Richard W. Hamming. Error-detecting and error-correcting codes, Bell System Technical Journal 29(2):147-160, 1950. http://en.wikipedia.org/wiki/Hamming_distance SUM(X1 XOR X2)
L1 to Hamming Space Embedding 2 Axis: positive integers Let C=Max(axis_val) Represent each point feature as padded unary value Concatenate all values into d’=C*d value p d’=C*d 8 11000000000 11111111000 11000000000 11111111000
Hash function p ∈ Hd’ Lj Hash function j=1..L, k=3 digits Gj(p)=p|Ij 11000000000 11111111000 11000000000 11111111000 1 1 p ∈ Hd’ Lj Hash function j=1..L, k=3 digits Gj(p)=p|Ij Bits sampling from p Store p into bucket p|Ij 2k buckets 101
Construction p 1 2 L
Query q 1 2 L
Alternative intuition random projections 2 Pass a TH Partition the space into 2 p d’=C*d 8 11000000000 11111111000 11000000000 11111111000
Alternative intuition random projections 2 Pass a TH Partition the space into 2 The TH is analogue to sampling of digit from the Hamming representation #THs = #sampled digits Sampling k digits is analogue to random partitioning of the space using k thresholds p 8 11000000000 11111111000 11000000000 11111111000
Alternative intuition random projections 2 Pass a TH Partition the space into 2 p 8 11000000000 11111111000 11000000000 11111111000
Alternative intuition random projections 11000000000 11111111000 11000000000 11111111000 1 1 110 111 100 101 p 101 23 Buckets 000 001
k samplings
Repeating
Repeating L times
Repeating L times
Secondary hashing Support volume tuning 2k buckets Skip 011 Simple Hashing α = memory utilization parameter M*B=α*n α=2 Size=B M Buckets Support volume tuning dataset-size vs. storage volume
The above hashing is locality-sensitive Probability (p,q in same bucket) = k=1 Pr k=2 Probability Distance (q,pi) Distance (q,pi) Adopted from Piotr Indyk’s slides
Preview General Solution – Locality sensitive hashing Implementation for Hamming space Generalization to l2
Direct L2 solution New hashing function Still based on sampling Using mathematical trick P-stable distribution for Lp distance Gaussian distribution for L2 distance
(Weighted Gaussians) = Weighted Gaussian Central limit theorem v1* +v2* …+vn* = +… (Weighted Gaussians) = Weighted Gaussian Generalized Central Limit Theorem : the only possible non-trivial limit of normalized sums of independent identically distributed terms is stable. http://academic2.american.edu/~jpnolan/stable/stable.html Image taken from http://www.statisticalengineering.com/central_limit_theorem.htm See: 1.6 Sums of stable random variables from Stable Distributions - Models for Heavy Tailed Data/ John P. Nolan
Central limit theorem v1..vn = Real Numbers v1* X1 +v2* X2 +… …+vn* Xn = v1..vn = Real Numbers Generalized Central Limit Theorem : the only possible non-trivial limit of normalized sums of independent identically distributed terms is stable. http://academic2.american.edu/~jpnolan/stable/stable.html Image taken from http://www.statisticalengineering.com/central_limit_theorem.htm See: 1.6 Sums of stable random variables from Stable Distributions - Models for Heavy Tailed Data/ John P. Nolan X1:Xn = Independent Identically Distributed (i.i.d)
Central limit theorem Dot Product Norm Generalized Central Limit Theorem : the only possible non-trivial limit of normalized sums of independent identically distributed terms is stable. http://academic2.american.edu/~jpnolan/stable/stable.html Image taken from http://www.statisticalengineering.com/central_limit_theorem.htm See: 1.6 Sums of stable random variables from Stable Distributions - Models for Heavy Tailed Data/ John P. Nolan
Norm Distance Features vector 1 Features vector 2 Distance Dot product is calculated once per vector (n) Instead of calculating all distances (n^2) The distances sketch now resort to subtraction of scalars
Norm Distance Dot Product Dot Product Distance Dot product is calculated once per vector (n) Instead of calculating all distances (n^2) The distances sketch now resort to subtraction of scalars
+b w The full Hashing 22 77 42 [34 82 21] d random* numbers Features vector phase Random[0,w] 22 77 42 +b 1 [34 82 21] d w 7944 Discretization step
The full Hashing 7944 +34 100 7944 7800 7900 8000 8100 8200
+34 100 The full Hashing 7944 phase Discretization step Random[0,w]
i.i.d from p-stable distribution The full Hashing i.i.d from p-stable distribution Features vector phase Random[0,w] v +b 1 a d w Discretization step
Generalization: P-Stable distribution Central Limit Theorem Gaussian (normal) distribution Lp p=eps..2 Generalized Central Limit Theorem P-stable distribution Cauchy for L2
Reported in Email by Alexander Andoni P-Stable summary r, - Nearest Neighbor Works for Generalizes to 0<p<=2 Improves query time Latest results Reported in Email by Alexander Andoni Query time = O (dn1/(1+)log(n) ) O (dn1/(1+)^2log(n) )
Parameters selection 90% Probability Best quarry time performance For Euclidean Space
Parameters selection … Single projection hit an - Nearest Neighbor with Pr=p1 k projections hits an - Nearest Neighbor with Pr=p1k L hashings fail to collide with Pr=(1-p1k)L To ensure Collision (e.g. 1-δ≥90%) 1- (1-p1k)L≥ 1-δ For Euclidean Space Accept Neighbors Reject Non-Neighbors
… Parameters selection K … Parameters selection time Candidates verification Candidates extraction Tc is Time to calculate all candidates in Hash buckets The distance function k
Pros. & Cons. Better Query Time than Spatial Data Structures Scales well to higher dimensions and larger data size ( Sub-linear dependence ) Predictable running time Extra storage over-head Inefficient for data with distances concentrated around average works best for Hamming distance (although can be generalized to Euclidean space) In secondary storage, linear scan is pretty much all we can do (for high dim) requires radius r to be fixed in advance From Pioter Indyk slides
Conclusion ..but at the end everything depends on your data set Try it at home Visit: http://web.mit.edu/andoni/www/LSH/index.html Email Alex Andoni Andoni@mit.edu Test over your own data (C code under Red Hat Linux )
LSH - Applications Searching video clips in databases .("Hierarchical, Non-Uniform Locality Sensitive Hashing and Its Application to Video Identification“, Yang, Ooi, Sun). Searching image databases (see the following). Image segmentation (see the following). Image classification (“Discriminant adaptive Nearest Neighbor Classification”, T. Hastie, R Tibshirani). Texture classification (see the following). Clustering (see the following). Embedding and manifold learning (LLE, and many others) Compression – vector quantization. Search engines (“LSH Forest: SelfTuning Indexes for Similarity Search”, M. Bawa, T. Condie, P. Ganesan”). Genomics (“Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing”, J. Buhler). In short: whenever K-Nearest Neighbors (KNN) are needed.
Motivation A variety of procedures in learning require KNN computation. KNN search is a computational bottleneck. LSH provides a fast approximate solution to the problem. LSH requires hash function construction and parameter tunning. Sensitive in the sense that they can find the right neighbors and reject wrong ones with the prescribed probabilitiesץ For some application spaces this requires a good training as we shall see.
Outline Fast Pose Estimation with Parameter Sensitive Hashing G. Shakhnarovich, P. Viola, and T. Darrell. Finding sensitive hash functions. Mean Shift Based Clustering in High Dimensions: A Texture Classification Example B. Georgescu, I. Shimshoni, and P. Meer Tuning LSH parameters. LSH data structure is used for algorithm speedups.
Fast Pose Estimation with Parameter Sensitive Hashing G Fast Pose Estimation with Parameter Sensitive Hashing G. Shakhnarovich, P. Viola, and T. Darrell Given an image x, what are the parameters θ, in this image? i.e. angles of joints, orientation of the body, etc. The Problem:
Ingredients Input query image with unknown angles (parameters). Database of human poses with known angles. Image feature extractor – edge detector. Distance metric in feature space dx. Distance metric in angles space:
Example based learning Construct a database of example images with their known angles. Given a query image, run your favorite feature extractor. Compute KNN from database. Use these KNNs to compute the average angles of the query. Input: query Find KNN in database of examples Output: Average angles of KNN
The algorithm flow Features extraction PSH (LSH) LWR (Regression) Input Query Features extraction Processed query PSH (LSH) Database of examples LWR (Regression) Output Match
The image features Image features are multi-scale edge histograms: PSH Feature Extraction PSH LWR The image features Image features are multi-scale edge histograms: B A The features space is the concatenated histograms of all different windows Windows are of size 8 16 32
PSH: The basic assumption Feature Extraction PSH LWR PSH: The basic assumption There are two metric spaces here: feature space ( ) and parameter space ( ). We want similarity to be measured in the angles space, whereas LSH works on the feature space. Assumption: The feature space is closely related to the parameter space.
Insight: Manifolds PSH Feature Extraction PSH LWR Insight: Manifolds Manifold is a space in which every point has a neighborhood resembling a Euclid space. But global structure may be complicated: curved. For example: lines are 1D manifolds, planes are 2D manifolds, etc.
Is this Magic? Feature Space Parameters Space (angles) q Parameters space is lower….
Parameter Sensitive Hashing (PSH) Feature Extraction PSH LWR Parameter Sensitive Hashing (PSH) The trick: Estimate performance of different hash functions on examples, and select those sensitive to : The hash functions are applied in feature space but the KNN are valid in angle space.
PSH as a classification problem Feature Extraction PSH LWR PSH as a classification problem Label pairs of examples with similar angles Compare labeling Define hash functions h on feature space Predict labeling of similar\ non-similar examples by using h If labeling by h is good accept h, else change h
Feature Extraction PSH LWR +1 -1 Labels: (r=0.25) Clearly we will have to choose more negeiteve examples and positiove ones since there are many more negative cases then positive ones.
A binary hash function: Feature Extraction PSH LWR features A binary hash function: Feature Observe that the y axis is of teta so that if two points are similar in teta space then they fall to same bucket, if they are not similar in teta space then they fall in different bins (clearly this is the big assumption that the teat and features are simialr)
Feature Extraction PSH LWR T Rmemeber that for a k-bit sensitive hash function the positive collision probability if 1-(1-p1)^k, and the false positive is p2^k
Local Weighted Regression (LWR) Feature Extraction PSH LWR Local Weighted Regression (LWR) Given a query image, PSH returns KNNs. LWR uses the KNN to compute a weighted average of the estimated angles of the query: Maybe skeep this slide it is not so important for the KNN issue. So g is a weight time the neighbor parameters We want a beta that will minimize the distance from the neighbor’s parameter in parameter space and will also consider the distance in features space. For close ngb in feature space we have second term that is high, hence we need to have g to produce a value as close to teta_i as possible so the distnace in parameter space will be small.
Results Synthetic data were generated: 13 angles: 1 for rotation of the torso, 12 for joints. 150,000 images. Nuisance parameters added: clothing, illumination, face expression. How can you measure the rotation of the torso around the vertical axis
Selected 137 out of 5,123 meaningful features (how??): 1,775,000 example pairs. Selected 137 out of 5,123 meaningful features (how??): 18 bit hash functions (k), 150 hash tables (l). Recall: P1 is prob of positive hash. P2 is prob of bad hash. B is the max number of pts in a bucket. Without selection needed 40 bits and 1000 hash tables. This are slightly improved results than those in the actual paper There is large number of examples – the space is nicely sampled So there is a considerable dimension reduction when taking only 137 out of 5123 features. 3.4% refers to the whole structure of the hash table For this problem r=0.25, R=0.5 and epsilon =1 Authors don’t say how they selceted 137 features? My guess is that the optimal t was non zero on 137 features only. 18 bit k, and 150 tables of L are based on the recommendation of Indyke B is the maximal number of buckets that we want to search. Running with those k and L provides a fast convergence time for epsilon =1 we get dN^(1/2) Test on 1000 synthetic examples: PSH searched only 3.4% of the data per query.
Results – real data 800 images. Processed by a segmentation algorithm. 1.3% of the data were searched.
Results – real data Note the improvement by LWR
Interesting mismatches Note the difficult background Note the error correction by LWR but also cases that it is worst then the top match (col 4 Note col 3 might be from mismatch between the two distance metrics features vs. parameters spaces also could happen because the manifold is not sampled correctly.
Fast pose estimation - summary Fast way to compute the angles of human body figure. Moving from one representation space to another. Training a sensitive hash function. KNN smart averaging.
Food for Thought The basic assumption may be problematic (distance metric, representations). The training set should be dense. Texture and clutter. General: some features are more important than others and should be weighted. Last: Or some dependency is assumed between the data points.
Food for Thought: Point Location in Different Spheres (PLDS) Given: n spheres in Rd , centered at P={p1,…,pn} with radii {r1,…,rn} . Goal: given a query q, preprocess the points in P to find point pi that its sphere ‘cover’ the query q. q pi ri Courtesy of Mohamad Hegaze
Statistical curse of dimensionality: sparseness of the data. Mean-Shift Based Clustering in High Dimensions: A Texture Classification Example B. Georgescu, I. Shimshoni, and P. Meer Motivation: Clustering high dimensional data by using local density measurements (e.g. feature space). Statistical curse of dimensionality: sparseness of the data. Computational curse of dimensionality: expensive range queries. LSH parameters should be adjusted for optimal performance. Say that in image segmentation we work in feature space that can be high.
Outline Mean-shift in a nutshell + examples. Our scope: Mean-shift in high dimensions – using LSH. Speedups: Finding optimal LSH parameters. Data-driven partitions into buckets. Additional speedup by using LSH data structure.
Mean-Shift in a Nutshell LSH: optimal k,l LSH: data partition LSH LSH: data struct Mean-Shift in a Nutshell bandwidth point
KNN in mean-shift Bandwidth should be inversely proportional to the LSH: optimal k,l LSH: data partition LSH LSH: data struct KNN in mean-shift Bandwidth should be inversely proportional to the density in the region: high density - small bandwidth low density - large bandwidth Based on kth nearest neighbor of the point The bandwidth is Adaptive mean-shift vs. non-adaptive.
Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct
Image segmentation algorithm Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Image segmentation algorithm Input : Data in 5D (3 color + 2 x,y) or 3D (1 gray +2 x,y) Resolution controlled by the bandwidth: hs (spatial), hr (color) Apply filtering 3D: Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Image segmentation algorithm Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Image segmentation algorithm Filtering: pixel value of the nearest mode Mean-shift trajectories original segmented filtered Filtering: WE apply MS and then assign each pixel the intensity level of the mode: z_i = (x^s,y_c ^g). Segmentation: assignin to all z_i which are close enough to the same cluster.
Filtering examples original squirrel filtered original baboon filtered Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Segmentation examples Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Mean-shift in high dimensions LSH: optimal k,l LSH: data partition LSH LSH: data struct Mean-shift in high dimensions Computational curse of dimensionality: Statistical curse of dimensionality: Expensive range queries implemented with LSH Sparseness of the data variable bandwidth
LSH-based data structure Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct LSH-based data structure Choose L random partitions: Each partition includes K pairs (dk,vk) For each point we check: It Partitions the data into cells:
Choosing the optimal K and L Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Choosing the optimal K and L For a query q compute smallest number of distances to points in its buckets. Each cell created by a partiltion l defines a square in the rectangle remember that there might me several partitions over the same coordinate axis. C_intrsct lies close to the center of C_union so each query q returns some of the NN of q So when L is above some threshold the neighborhood of q is covered but other nuisance points get in there. N_cl formula is because we have a volume of K bits / d +1 regions given by the partitions. It is in dimension d so taking the d power gives the volume, and we divid n points by this bucket volum
Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct
Choosing optimal K and L Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Choosing optimal K and L Determine accurately the KNN for m randomly-selected data points. distance (bandwidth) Choose error threshold e The optimal K and L should satisfy the approximate distance
Choosing optimal K and L Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Choosing optimal K and L For each K estimate the error for In one run for all L’s: find the minimal L satisfying the constraint L(K) Minimize time t(K,L(K)): minimum For each K we find the error for all L, we then apply threshold on the error and pick L for each value of K on this threshold. We can now check the running time for the chosen points which shows how many extra distances we have practically computed and and find the K and L that give the minimum for the running time. Approximation error for K,L Running time t[K,L(K)] L(K) for e=0.05
Data driven partitions Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Data driven partitions In original LSH, cut values are random in the range of the data. Suggestion: Randomly select a point from the data and use one of its coordinates as the cut value. Small number of points is obtained for many cells when we use data driven distributions whereas many points on a large number of cells is obtained for uniform distribution. uniform data driven points/bucket distribution
Additional speedup Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Additional speedup
Speedup results 65536 points, 1638 points sampled , k=100 Note that the speedup does not even depend on the dimension 65536 points, 1638 points sampled , k=100
Food for thought Low dimension High dimension
A thought for food… Choose K, L by sample learning, or take the traditional. Can one estimate K, L without sampling? A thought for food: does it help to know the data dimensionality or the data manifold? Intuitively: dimensionality implies the number of hash functions needed. The catch: efficient dimensionality learning requires KNN. To demonstrate the intrinsic dimensionality issue: think of a data spreaded on a line manifold that lives in 3D – one projection is enough to capture i.e. project only in one direction The traditionsla is based on Indyk estimation which counts on some gaussian distribution of the data. 15:30 cookies…..
Summary LSH suggests a compromise on accuracy for the gain of complexity. Applications that involve massive data in high dimension require the LSH fast performance. Extension of the LSH to different spaces (PSH). Learning the LSH parameters and hash functions for different applications.
Conclusion ..but at the end everything depends on your data set Try it at home Visit: http://web.mit.edu/andoni/www/LSH/index.html Email Alex Andoni Andoni@mit.edu Test over your own data (C code under Red Hat Linux )
Thanks Ilan Shimshoni (Haifa). Mohamad Hegaze (Weizmann). Alex Andoni (MIT). Mica and Denis.