Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani
The approach of using the similarity searching is to be used only in high dimensions. The reason, or the idea behind the approach is that since the selection of features and the choice of distance is rather heuristic, determining an appropriate nearest neighbor should suffice for most practical purposes.
The basic idea is to hash the points from the database so as to ascertain that the probability of collision is much higher for objects that are close to each other than for those that are far apart. The necessity arose from the so called ‘curse of dimensionality’ fact for the large databases. In this case all the searching techniques reduce to linear search, if are being searched for the appropriate answer.
The similarity search problem involves the nearest ( most similar ) object in a given collection of objects to a given query. Typically the objects of interest are represented as points in d and a distance metric is used to measure the similarity of the objects. The basic problem is to perform indexing or similarity searching for query objects.
The problem arises due to the fact that the present methods are not entirely satisfactory, for large d. And is based on the idea that for most applications it is not necessary for the exact answer. It also provides the user with a time-quality trade-off. The above statements are based on the assumption that the searching for approximate answers is faster than for finding the exact answers.
The technique is to use locality-sensitive hashing instead of space sensitive hashing. The idea is to hash points points using several hash functions so as to ensure that, for each function the probability of collision is much higher for objects that are close to each other. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point.
The LSH ( locality sensitive hashing ) enabled to achieve the worst case O(dn 1/ ) time for approximate nearest neighbor over a n-point database. In the presented paper, the worst time running time has been improved by the new technique to O(dn 1/(1+ ) ), which is a significant improvement.
Preliminaries l d p is used to denote the Euclidian space d under the l p normal form i.e., when the length of the vector (x 1,…,x d ) is defined as (| x 1 | p +…+| x d | p ). Further, d(p,q) denotes the distance between the points p and q in l d p We use H d to represent the Hamming metric space of of dimension d. We use d H (p,q) to denote the Hamming distance.
General definition of the problem is to find K nearest points in the given database, where K > 1. Even for the KNNS problem, our algorithm generalizes to finding the K (>1) approximate nearest neighbors. Here we wish to find the K points p 1,…,p k such the distance of pi to the query q is at the most (1+ ) times the distance from the i th nearest point to q.
The Algorithm The distance is measured in the Euclidian terms, or the l 1 norm. All the co-ordinates of the points in P are positive integers.
Locality Sensitive Hashing The new algorithm is in many respects more natural the earlier ones: it does not require that a bucket to store only point. It has better running time. The analysis is generalized for the case of secondary memory.
Let C be the largest coordinate in all points in P. Then we can embed P into the Hamming cube H d’ with d’=C.d, by transforming each point p=(x1,…,xd) into a binary vector. v p =Unary c (x 1 )…Unary c (x d ), where Unary c (x) denotes the unary representation of x, i.e., a sequence of x zeroes followed by C-x ones.
For an integer l, choose I 1 …I l subsets of {1…d’}. Let p|I denote the the projection of vector p on the coordinate positions as per I and concatenating the bits in those positions. Denote g j (p)= p|I j For the preprocessing we store each p P in the buckets for g j (p), for j=1…l. As the total number of buckets may be large, we compress the buckets by resorting to standard hashing.
Thus we use two levels of hashing. The LSH maps the points into buckets g j (p) while a standard hashing function maps the contents of these buckets into a hash table of size M. If a bucket in a given index is full, a new point cannot be added to it, since it will be added to another index with a very high probability. This saves the overhead of maintaining the link structure.
To process a query q, we search all the indices g 1 (q)…g l (q) until we either encounter at least c.l points or use all the l indices. The number of disk accesses is always upper bounded on l, the number of indices. Let p 1,…,p t be the points encountered in the process. For the output we return the nearest K points, or fewer in case we could not find so many points as a result of the search.
The principle behind our method is the probability of collision of two points p and q is closely related to the distance between them. Especially the larger the distance, smaller the collision property.