Download presentation
Presentation is loading. Please wait.
Published byPhoebe Beasley Modified over 6 years ago
1
Data-Dependent Hashing for Nearest Neighbor Search
Alex Andoni (Columbia University) Ilya Razenshteyn (MIT)
2
Nearest Neighbor Search (NNS)
Preprocess: a set π of points Query: given a query point π, report a point π β βπ with the smallest distance to π π β π
3
Motivation Generic setup: Applications: Distances:
Points model objects (e.g. images) Distance models (dis)similarity measure Applications: machine learning: k-NN rule speeding up deep learning Distances: Hamming, Euclidean, β β edit distance, earthmover distance, etcβ¦ Core primitive: closest pair, clustering, etcβ¦ 000000 011100 010100 000100 011111 000000 001100 000100 110100 111111 π β π
4
Curse of Dimensionality
All exact algorithms degrade rapidly with the dimension π Algorithm Query time Space Full indexing πβ
log π π 1 π π(π) (Voronoi diagram size) No indexing β linear scan π(πβ
π)
5
Approximate NNS π-approximate
π-near neighbor: given a query point π, report a point π β² βπ s.t. πβ²βπ β€π as long as there is some point within distance π Practice: use for exact NNS Filtering: gives a set of candidates (hopefully small) ππ π π β ππ π π β²
6
NNS algorithms Exponential dependence on dimension
[Arya-Mountβ93], [Clarksonβ94], [Arya-Mount-Netanyahu-Silverman- Weβ98], [Kleinbergβ97], [Har-Peledβ02],[Arya-Fonseca-Mountβ11],β¦ Linear/poly dependence on dimension [Kushilevitz-Ostrovsky-Rabaniβ98], [Indyk-Motwaniβ98], [Indykβ98, β01], [Gionis-Indyk-Motwaniβ99], [Charikarβ02], [Datar-Immorlica-Indyk- Mirrokniβ04], [Chakrabarti-Regevβ04], [Panigrahyβ06], [Ailon-Chazelleβ06], [A.-Indykβ06], [A.-Indyk-Nguyen-Razenshteynβ14], [A.-Razenshteynβ15], [Paghβ16], [Becker-Ducas-Gama-Laarhovenβ16],β¦
7
Locality-Sensitive Hashing
[Indyk-Motwaniβ98] π πβ² Random hash function β on π
π satisfying: for close pair: when πβπ β€π Prβ‘[β(π)=β(π)] is βhighβ for far pair: when πβπβ² >ππ Prβ‘[β(π)=β(πβ²)] is βsmallβ Use several hash tables: π π π1= βnot-so-smallβ π2= π= log 1/ π 1 log 1/ π 2 ππ, where
8
LSH Algorithms Hamming space Euclidean space Sample random bits! Space
Time Exponent π=π Reference Hamming space π 1+π π π π=1/π π=1/2 [IMβ98] πβ₯1/π [MNPβ06, OWZβ11] Euclidean space π 1+π π π π=1/π π=1/2 [IMβ98, DIIMβ04] Other ways to use LSH πβ1/ π 2 π=1/4 [AIβ06] πβ₯1/ π 2 [MNPβ06, OWZβ11]
9
LSH is tightβ¦ whatβs next?
Lower bounds (cell probe) [A.-Indyk-Patrascuβ06, Panigrahy-Talwar-Wiederβ08,β10, Kapralov-Panigrahyβ12] Datasets with additional structure [Clarksonβ99, Karger-Ruhlβ02, Krauthgamer-Leeβ04, Beygelzimer-Kakade-Langfordβ06, Indyk-Naorβ07, Dasgupta-Sinhaβ13, Abdullah-A.-Krauthgamer-Kannanβ14,β¦] Space-time trade-offs [Panigrahyβ06, A.-Indykβ06, Kapralovβ15] But are we really done with basic NNS algorithms?
10
Beyond Locality Sensitive Hashing
Space Time Exponent π=π Reference Hamming space π 1+π π π π=1/π π=1/2 [IMβ98] LSH πβ₯1/π [MNPβ06, OWZβ11] π 1+π π π complicated π=1/2βπ [AINRβ14] πβ 1 2πβ1 π=1/3 [ARβ15] Euclidean space π 1+π π π πβ1/ π 2 π=1/4 [AIβ06] LSH πβ₯1/ π 2 [MNPβ06, OWZβ11] π 1+π π π complicated π=1/4βπ [AINRβ14] πβ 1 2 π 2 β1 π=1/7 [ARβ15]
11
Data-dependent hashing
New approach? A random hash function, chosen after seeing the given dataset Efficiently computable Data-dependent hashing Example: Voronoi diagram but expensive to compute the hash function
12
Construction of hash function
Two components: Pseudo-random structure Reduction to such structure Like a (weak) βregularity lemmaβ for a set of points has better LSH data-dependent
13
Pseudo-random structure
Think: random dataset on a sphere vectors perpendicular to each other s.t. random points at distance βππ Lemma: π= 1 2 π 2 β1 via Cap Carving Curvature helps ππ ππ/ 2
14
Reduction to nice structure
Idea: iteratively decrease the radius of minimum enclosing ball Algorithm: find dense clusters with smaller radius large fraction of points recurse on dense clusters apply cap carving on the rest recurse on each βcapβ eg, dense clusters might reappear Why ok? no dense clusters like βrandom datasetβ with radius=100ππ even better! Why ok? radius = 100ππ radius = 99ππ *picture not to scale & dimension
15
Hash function Described by a tree (like a hash table) radius = 100ππ
*picture not to scale&dimension
16
Practice Adapt partition to your given dataset
My dataset is not worse-case Adapt partition to your given dataset kd-trees, PCA-tree, spectral hashing, etc [S91, McN01,VKD09, WTF08,β¦] no guarantees (performance or correctness) Theory: assume special structure on dataset low intrinsic dimension [KRβ02, KLβ04, BKLβ06, INβ07, DSβ13,β¦] structure + noise [AAKKβ14] Data-dependent hashing helps even when no a priori structure !
17
Follow-ups π 2 π π + π 2 β1 π π β€ 2 π 2 β1 Dynamicity? Better bounds?
Dynamization techniques [Overmars-van Leeuwenβ81] Better bounds? For dimension π=π( log π ), can get better π! [Becker-Ducas-Gama- Laarhovenβ16] like it is the case for closest pair [Alman-Williamsβ15] For dimension π> log 1+πΏ π : our π is optimal even for data-dependent hashing! [A-Razenshteynβ16]: in the right formalization (to rule out Voronoi diagram): description complexity of the hash function is π 1βΞ©(1) Practical variant: FALCONN [A-Indyk-Laarhoven-Razenshteyn-Schmidtβ15] Time-space trade-offs: π π π time and π 1+ π π space Optimal* bound! [A-Laarhoven-Razenshteyn-Waingartenβ16, Christianiβ16] π 2 π π + π 2 β1 π π β€ 2 π 2 β1 [Laarhovenβ15]
18
Data-dependent hashing wrap-up
Main algorithmic tool: reduction from worst-case to βrandom caseβ Apply reduction in other settings? Eg, closest pair can be solved much faster for random-case [Valiantβ12, Karppa-Kaski-Kohonenβ16] Understand data-dependency for βbeyond-worst caseβ? Instance-optimal partitions? Data-dependency to understand space partitions for harder metrics? Classic hard distance: β β No non-trivial sketch But: [Indykβ98] gets efficient NNS with approximation π( log log π ) Tight approximation in the relevant models [ACPβ08, PTWβ10, KPβ12] Can be thought of as data-dependent hashing! NNS for any norm (eg, matrix norms, EMD) or metric (edit dist)?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.