Download presentation
Presentation is loading. Please wait.
1
Data-dependent Hashing for Similarity Search
Alex Andoni (Columbia University)
2
Nearest Neighbor Search (NNS)
Preprocess: a set π of points Query: given a query point π, report a point π β βπ with the smallest distance to π π β π
3
Motivation Generic setup: Application areas: Distances:
Points model objects (e.g. images) Distance models (dis)similarity measure Application areas: machine learning: k-NN rule speech/image/video/music recognition, vector quantization, bioinformatics, etcβ¦ Distances: Hamming, Euclidean, edit distance, earthmover distance, etcβ¦ Core primitive: closest pair, clustering, etcβ¦ 000000 011100 010100 000100 011111 000000 001100 000100 110100 111111 π β π
4
Curse of Dimensionality
π=2 Hard when πβ« log π Algorithm Query time Space Full indexing πβ
log π π 1 π π(π) (Voronoi diagram size) No indexing π(πβ
π) Refutes Strong ETH [Wilβ04] π 0.99 π π 1
5
Approximate NNS π-approximate
π-near neighbor: given a query point π, report a point π β² βπ s.t. πβ²βπ β€π as long as there is some point within distance π Practice: use for exact NNS Filtering: gives a set of candidates (hopefully small) ππ π π β ππ π π β²
6
Plan Space decompositions Structure in data No structure in data
PCA-tree No structure in data Data-dependent hashing Towards practice
7
Plan Space decompositions Structure in data No structure in data
PCA-tree No structure in data Data-dependent hashing Towards practice
8
Itβs all about space decompositions
Approximate space decompositions [Arya-Mountβ93], [Clarksonβ94], [Arya- Mount-Netanyahu-Silverman-Weβ98], [Kleinbergβ97], [Har-Peledβ02],[Arya- Fonseca-Mountβ11],β¦ Random decompositions == hashing [Kushilevitz-Ostrovsky-Rabaniβ98], [Indyk- Motwaniβ98], [Indykβ01], [Gionis-Indyk- Motwaniβ99], [Charikarβ02], [Datar- Immorlica-Indyk-Mirrokniβ04], [Chakrabarti- Regevβ04], [Panigrahyβ06], [Ailon- Chazelleβ06], [A.-Indykβ06], [Terasawa- Tanakaβ07], [Kapralovβ15], [Paghβ16], [Becker- Ducas-Gama-Laarhovenβ16], [Christianiβ17], β¦ for low-dimensional for high-dimensional
9
Low-dimensional kd-trees, approximate Voronoi,β¦ π=1+π
Usual factor: 1 π π(π)
10
High-dimensional Locality-Sensitive Hashing (LSH)
Space partitions: fully random Johnson-Lindenstrauss Lemma: project to random subspace LSH runtime: π 1/π for π approximation
11
Locality-Sensitive Hashing
[Indyk-Motwaniβ98] π πβ² Random hash function β on π
π satisfying: for close pair: when πβπ β€π Prβ‘[β(π)=β(π)] is βhighβ for far pair: when πβπβ² >ππ Prβ‘[β(π)=β(πβ²)] is βsmallβ Use several hash tables: π π π1= βnot-so-smallβ π2= Prβ‘[π(π)=π(π)] π= log 1/ π 1 log 1/ π 2 1 ππ, where π1 π2 πβπ π ππ
12
LSH Algorithms Hamming space Euclidean space Space Time Exponent π=π
Reference Hamming space π 1+π π π π=1/π π=1/2 [IMβ98] πβ₯1/π [MNPβ06, OWZβ11] Euclidean space π 1+π π π π=1/π π=1/2 [IMβ98, DIIMβ04] Other ways to use LSH πβ1/ π 2 π=1/4 [AIβ06] πβ₯1/ π 2 [MNPβ06, OWZβ11]
13
My dataset is not worse-case
vs Practice My dataset is not worse-case Data-dependent partitions optimize the partition to your dataset PCA-tree [Sβ91, Mβ01, VKDβ09] randomized kd-trees [SAH08, ML09] spectral/PCA/semantic/WTA hashing [WTF08, CKW09, SH09, YSRL11] β¦
14
Why do data-dependent partitions outperform random ones ?
Practice vs Theory Data-dependent partitions often outperform (vanilla) random partitions But no guarantees (correctness or performance) Hard to measure progress, understand trade-offs JL generally optimal [Aloβ03, JWβ13] Even for some NNS regimes! [AIPβ06] Why do data-dependent partitions outperform random ones ?
15
Plan Space decompositions Structure in data No structure in data
PCA-tree No structure in data Data-dependent hashing Towards practice
16
How to study phenomenon:
Because of further special structure in the dataset βIntrinsic dimension is smallβ [Clarksonβ99,Karger-Ruhlβ02, Krauthgamer-Leeβ04, Beygelzimer-Kakade-Langfordβ06, Indyk-Naorβ07, Dasgupta-Sinhaβ13,β¦] But, dataset is really high-dimensionalβ¦ Why do data-dependent partitions outperform random ones ?
17
Towards higher-dimensional datasets
[Abdullah-A.-Krauthgamer-Kannanβ14] βlow-dimensional signal + high-dimensional noiseβ Signal: πβπ where πβ β π of dimension πβͺπ Data: each point is perturbed by a full-dimensional Gaussian noise π π (0, π 2 ) π
18
Model properties Data π =π+πΊ Query π =π+ π π s.t.: Noise π(0, π 2 )
||πβ π β ||β€1 for βnearest neighborβ ||πβπ||β₯1+π for everybody else Noise π(0, π 2 ) πβ 1 π 1/4 up to factors poly in π log π roughly when nearest neighbor is still the same Noise is large: Displacement: ~ π π β π 1/4 β«1 top π dimensions of π capture sub-constant mass JL would not work: after noise, gap very close to 1
19
NNS performance as if we are in π dimensions ?
NNS in this model Best we can hope: dataset contains a βworst-caseβ π-dimensional instance Reduction from dimension π to π Spoiler: Yes, we can do it! Goal: NNS performance as if we are in π dimensions ?
20
Tool: PCA Principal Component Analysis
like Singular Value Decomposition Top PCA direction: direction maximizing variance of the data
21
NaΓ―ve: just use PCA? Use PCA to find the βsignal subspaceβ π ?
find top-k PCA space project everything and solve NNS there Does NOT work Sparse signal directions overpowered by noise directions PCA is good only on βaverageβ, not βworst-caseβ π β
22
Our algorithms: intuition
Extract βwell-captured pointsβ points with signal mostly inside top PCA space should work for large fraction of points Iterate on the rest Algorithm1: iterative PCA Algorithm 2: (modified) PCA-tree Analysis: via [Davis-Kahanβ70, Wedinβ72] π β
23
What if data has no structure?
Plan Space decompositions Structure in data PCA-tree No structure in data Data-dependent hashing Towards practice What if data has no structure? (worst-case)
24
Random hashing is not optimal!
Back to worst-case NNS Space Time Exponent π=π Reference Hamming space π 1+π π π π=1/π π=1/2 [IMβ98] LSH πβ₯1/π [MNPβ06, OWZβ11] π 1+π π π complicated π=1/2βπ [AINRβ14] πβ 1 2πβ1 π=1/3 [ARβ15] Random hashing is not optimal! Euclidean space π 1+π π π πβ1/ π 2 π=1/4 [AIβ06] LSH πβ₯1/ π 2 [MNPβ06, OWZβ11] π 1+π π π complicated π=1/4βπ [AINRβ14] πβ 1 2 π 2 β1 π=1/7 [ARβ15]
25
Beyond Locality Sensitive Hashing
A random hash function, chosen after seeing the given dataset Efficiently computable Data-dependent hashing (even when no structure) Example: Voronoi diagram but expensive to compute the hash function
26
Construction of hash function
[A.-Indyk-Nguyen-Razenshteynβ14] Two components: Nice geometric structure Reduction to such structure Like a (weak) βregularity lemmaβ for a set of points has better LSH data-dependent
27
Nice geometric structure
Pseudo-random configuration: Far pair is (near) orthogonal ππ) as when the dataset is random on the sphere close pair (distance π) Lemma: π= 1 2 π 2 β1 via Cap Carving ππ ππ/ 2
28
Reduction to nice configuration
[A.-Razenshteynβ15] Idea: iteratively make the pointset more isotopic by narrowing a ball around it Algorithm: find dense clusters with smaller radius large fraction of points recurse on dense clusters apply Cap Hashing on the rest recurse on each βcapβ eg, dense clusters might reappear Why ok? no dense clusters remainder near- orthogonal (radius=100ππ) even better! radius = 100ππ radius = 99ππ *picture not to scale & dimension
29
Hash function Described by a tree (like a hash table) radius = 100ππ
*picture not to scale&dimension
30
Dense clusters Current dataset: radius π
A dense cluster:
2 βπ π
Current dataset: radius π
A dense cluster: Contains π 1βπΏ points Smaller radius: 1βΞ© π 2 π
After we remove all clusters: For any point on the surface, there are at most π 1βπΏ points within distance 2 βπ π
The other points are essentially orthogonal ! When applying Cap Carving with parameters ( π 1 , π 2 ): Empirical number of far pts colliding with query: π π 2 + π 1βπΏ As long as π π 2 β« π 1βπΏ , the βimpurityβ doesnβt matter! π trade-off πΏ trade-off ?
31
Tree recap During query: Will look in >1 leaf! How much branching?
Recurse in all clusters Just in one bucket in CapCarving Will look in >1 leaf! How much branching? Claim: at most π πΏ +1 π(1/ π 2 ) Each time we branch at most π πΏ clusters (+1) a cluster reduces radius by Ξ©( π 2 ) cluster-depth at most 100/Ξ© π 2 Progress in 2 ways: Clusters reduce radius CapCarving nodes reduce the # of far points (empirical π 2 ) A tree succeeds with probability β₯ π β 1 2 π 2 β1 βπ(1) πΏ trade-off
32
Plan Space decompositions Structure in data No structure in data
PCA-tree No structure in data Data-dependent hashing Towards practice
33
Towards (provable) practicality 1
Two components: Nice geometric structure: spherical case Reduction to such structure practical variant [A-Indyk-Laarhoven-Razenshteyn-Schmidtβ15]
34
Classic LSH: Hyperplane [Charikarβ02]
Charikarβs construction: Sample π uniformly β π =π πππβ©π,πβͺ Pr β π = β π = 1 β πΌ(π,π) π Performance: π 1 =3/4 π 2 =1/2 Query time: β π 0.42
35
Our algorithm Based on cross-polytope Construction of β(π):
[A-Indyk-Laarhoven-Razenshteyn-Schmidtβ15] Based on cross-polytope introduced in [Terasawa-Tanakaβ07] Construction of β(π): Pick a random rotation π
β π : index of maximal coordinate of π
π and its sign Theorem: nearly the same exponent as (optimal) CapCarving LSH for π=2π caps!
36
Comparison: Hyperplane vs CrossPoly
Fix probability of collision for far points = 2 β6 =1/64 Equivalently: partition space into 2 6 =64 pieces Hyperplane LSH: with π=6 hyperplanes implies π= 2 π =64 parts Exponent π=0.42 CrossPolytope LSH: with π=2π=64 parts Exponent π=0.18
37
Complexity of our scheme
Time: Rotations are expensive! π π 2 time Idea: pseudo-random rotations Compute π
π where π
is βfast rotationβ based on Hadamard transform π(πβ
log π ) time! Used in [Ailon-Chazelleβ06, Dasgupta, Kumar, Sarlosβ11, Ve- Sarlos-Smolaβ13,β¦] Less space: Can adapt Multi-probe LSH [Lv-Josephson-Wang-Charikar- Liβ07]
38
Code & Experiments FALCONN package: http://falcon-lib.org
ANN_SIFT1M data (π=1π, π=128) Linear scan: 38ms Vanilla: Hyperplane: 3.7ms Cross-polytope: 3.1ms After some clustering and recentering (data-dependent): Hyperplane: 2.75ms Cross-polytope: 1.7ms (1.6x speedup) Pubmed data (π=8.2π, π=140,000) Use feature hashing Linear scan: 3.6s Hyperplane: 857ms Cross-polytope: 213ms (4x speedup)
39
Towards (provable) practicality 2
Two components: Nice geometric structure: spherical case Reduction to such structure a partial step [A β Razenshteyn β Shekel-Nosatzkiβ17]
40
LSH Forest++ Hamming space LSH Forest++: Theorem: obtains πβ 0.72 π
[A.-Razenshteyn-Shekel-Nosatzkiβ17] Hamming space π π random trees, of height π in each node sample a random coordinate LSH Forest++: LSH Forest [Bawa-Condie-Ganesanβ05] In a tree, split until π(1) bucket size ++: store π‘ points closest to the mean of P query compared against the π‘ points Theorem: obtains πβ 0.72 π for π‘,π large enough =performance of IM98 on random data π 1+π π π π=1/π [Indyk-Motwaniβ98] π₯= coor 3 coor 2 coor 5 coor 7
41
Follow-ups Dynamic data structure? Better bounds?
Dynamization techniques [Overmars-van Leeuwenβ81] Better bounds? For dimension π=π( log π ), can get better π! [Becker-Ducas-Gama- Laarhovenβ16] For dimension π> log 1+πΏ π : our π is optimal even for data-dependent hashing! [A-Razenshteynβ16]: in the right formalization (to rule out Voronoi diagram) Time-space trade-offs: π π π time and π 1+ π π space [A-Laarhoven-Razenshteyn-Waingartenβ17] Simpler algorithm/analysis Optimal* bound! [ALRWβ17, Christianiβ17] π 2 π π + π 2 β1 π π β€ 2 π 2 β1
42
Finale: Data-dependent hashing
A reduction from βworst-caseβ to βrandom caseβ Instance-optimality (beyond-worst case) ? Other applications? Eg, closest pair can be solved much faster for random-case [Valiantβ12, Karppa-Kaski-Kohonenβ16] Tool to understand NNS for harder metrics? Classic hard distance: β β No non-trivial sketch/random hash [BJKS04, ANβ12] But: [Indykβ98] gets efficient NNS with approximation π( log log π ) Tight in the relevant models [ACPβ08, PTWβ10, KPβ12] Can be seen as data-dependent hashing! Algorithms with worst-case guarantees, but great performance if dataset is βniceβ ? NNS for any norm (eg, matrix norms, EMD) or metric (edit d)?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.