Download presentation
Presentation is loading. Please wait.
1
Lecture 11: Nearest Neighbor Search
Left to the title, a presenter can insert his/her own image pertinent to the presentation.
2
Plan Distinguished Lecture Nearest Neighbor Search Scriber?
Quantum Computing Oct 19, 11:30am, Davis Aud in CEPSR Nearest Neighbor Search Scriber?
3
Sketching π: β π β short bit-strings
given π(π₯) and π(π¦), can distinguish between: Close: ||π₯βπ¦||β€π Far: ||π₯βπ¦||>ππ With high success probability: only πΏ=1/ π 3 failure prob. β 2 , β 1 norm: π( π β2 β
log π) bits π¦ π₯ π π 010110 010101 Is ||π π₯ βπ π¦ ||β€π‘ ? Yes: close, ||π₯βπ¦||β€π No: far, ||π₯βπ¦||> 1+π π
4
Near-linear space and sub-linear query time?
NNS: approaches Sketch π: uses π=π( π β2 β
log π ) bits 1: Linear scan++ Precompute π(π) for πβπ· Given π, compute π π For each πβπ·, estimate distance using π π , π(π) 2: Exhaustive storage++ For each possible πβ 0,1 π compute π΄ π = point πβπ· s.t. ||π π βπ| | 1 <π‘ On query π, output π΄[π π ] Space: 2 π = π π 1/ π 2 Near-linear space and sub-linear query time?
5
Locality Sensitive Hashing
[Indyk-Motwaniβ98] Random hash function β on π
π satisfying: for close pair (when πβπ β€π) Prβ‘[β(π)=β(π)] is βhighβ for far pair (when πβπβ² >ππ) Prβ‘[β(π)=β(πβ²)] is βsmallβ Use several hash tables π πβ² π π π1= βnot-so-smallβ π2= Prβ‘[β(π)=β(π)] π= log 1/ π 1 log 1/ π 2 ππ, where 1 π1 π2 πβπ π ππ
6
LSH for Hamming space Hash function π is usually a concatenation of βprimitiveβ functions: π π = β 1 (π), β 2 (π),β¦, β π (π) Fact 1: π π = π β Example: Hamming space 0,1 π β π = π π , i.e., choose π π‘β bit for a random π π(π) chooses π bits at random Pr β π =β π = 1 β π»ππ π,π π π 1 =1β π π β π βπ/π π 2 =1β ππ π β π βππ/π π= log 1/ π 1 log 1/ π 2 = π/π ππ/π = 1 π Prβ‘[β(π)=β(π)] 1 π1 π2 πβπ π ππ
7
Full Algorithm Data structure is just πΏ= π π hash tables: Query:
Each hash table uses a fresh random function π π π = β π,1 (π),β¦, β π,π (π) Hash all dataset points into the table Query: Check for collisions in each of the hash tables until we encounter a point within distance ππ Guarantees: Space: π ππΏ =π( π 1+π ), plus space to store points Query time: π πΏβ
(π+π) =π( π π β
π) (in expectation) 50% probability of success.
8
Choice of parameters π,πΏ ?
πΏ hash tables with π π = β 1 (π),β¦, β π (π) Pr[collision of far pair] = π 2 π Pr[collision of close pair] = π 1 π Success probability for a hash table: π 1 π πΏ=π 1/ π 1 π tables should suffice Runtime as a function of π 1 , π 2 ? π 1 π 1 π π‘ππππππ»ππ β+π π 2 π Hence πΏ=π( π π ) set π s.t. =1/π = π 2 π π = 1/π π π 1 π 1 2 π 2 π=1 π 2 2 π=2 π ππ
9
Analysis: correctness
Let π β be an π-near neighbor If does not exists, algorithm can output anything Algorithm fails when: near neighbor π β is not in the searched buckets π 1 π , π 2 π , β¦, π πΏ π Probability of failure: Probability π, π β do not collide in a hash table: β€1β π 1 π Probability they do not collide in πΏ hash tables at most 1β π 1 π πΏ = 1β 1 π π π π β€1/π
10
Analysis: Runtime Runtime dominated by: Distance computations:
Hash function evaluation: π(πΏβ
π) time Distance computations to points in buckets Distance computations: Care only about far points, at distance >ππ In one hash table, we have Probability a far point collides is at most π 2 π =1/π Expected number of far points in a bucket: πβ
1 π =1 Over πΏ hash tables, expected number of far points is πΏ Total: π πΏπ +π πΏπ =π( π π log π +π ) in expectation
11
LSH in practice If want exact NNS, what is π?
π safety not guaranteed fewer false positives If want exact NNS, what is π? Can choose any parameters πΏ,π Correct as long as 1β π 1 π πΏ β€0.1 Performance: trade-off between # tables and false positives will depend on dataset βqualityβ fewer tables πΏ
12
LSH Algorithms Hamming space Euclidean space Space Time Exponent π=π
Reference Hamming space π 1+π π π π=1/π π=1/2 [IMβ98] πβ₯1/π [MNPβ06, OWZβ11] Euclidean space π 1+π π π π=1/π π=1/2 [IMβ98, DIIMβ04] Other ways to use LSH πβ1/ π 2 π=1/4 [AIβ06] πβ₯1/ π 2 [MNPβ06, OWZβ11] Table does not include: π(ππ) additive space π(πβ
log π) factor in query time
13
LSH Zoo ( β 1 ) To be or not to be To sketch or not to sketch sketch sketch Hamming distance [IMβ98] β: pick a random coordinate(s) β 1 (Manhattan) distance [AIβ06] β: cell in a randomly shifted grid Jaccard distance between sets: π½ π΄,π΅ = π΄β©π΅ π΄βͺπ΅ β: pick a random permutation π on the universe β π΄ = min πβπ΄ π(π) min-wise hashing [Broβ97] Claim: Pr[collision]=J(A,B) be not not not or to be not or to β¦11101β¦ 1 β¦01111β¦ 1 β¦21102β¦ β¦01122β¦ {be,not,or,to} {not,or,to, sketch} be to π=be,to,sketch,or,not
14
LSH for Euclidean distance
[Datar-Immorlica-Indyk-Mirrokniβ04] LSH function β(π): pick a random line β, and quantize project point into β β π = πβ
β π€ +π β is a random Gaussian vector π random in [0,1] π€ is a parameter (e.g., 4) Claim: π=1/π β π
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.