Lecture 11: Nearest Neighbor Search

Slides:



Advertisements
Similar presentations
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Advertisements

Efficient Private Approximation Protocols Piotr Indyk David Woodruff Work in progress.
Song Intersection by Approximate Nearest Neighbours Michael Casey, Goldsmiths Malcolm Slaney, Yahoo! Inc.
Lecture outline Nearest-neighbor search in low dimensions
© 2009 IBM Corporation IBM Research Xianglong Liu 1, Junfeng He 2,3, and Bo Lang 1 1 Beihang University, Beijing, China 2 Columbia University, New York,
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Spectral Approaches to Nearest Neighbor Search arXiv: Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi.
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Similarity Search in High Dimensions via Hashing
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Microsoft Research)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Large-scale matching CSE P 576 Larry Zitnick
Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.
1 Lecture 18 Syntactic Web Clustering CS
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Summer School on Hashing’14 Locality Sensitive Hashing Alex Andoni (Microsoft Research)
Optimal Data-Dependent Hashing for Approximate Near Neighbors
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Content Based Image Retrieval Natalia.
Indexing Techniques Mei-Chen Yeh.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
Fast Similarity Search for Learned Metrics Prateek Jain, Brian Kulis, and Kristen Grauman Department of Computer Sciences University of Texas at Austin.
Locality Sensitive Hashing Basics and applications.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Sketching and Nearest Neighbor Search (2) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.
Locality-sensitive hashing and its applications
Data-dependent Hashing for Similarity Search
Lecture 22: Linearity Testing Sparse Fourier Transform
Lecture 18: Uniformity Testing Monotonicity Testing
Sublinear Algorithmic Tools 3
Sublinear Algorithmic Tools 2
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Spectral Approaches to Nearest Neighbor Search [FOCS 2014]
Lecture 10: Sketching S3: Nearest Neighbor Search
Lecture 4: CountSketch High Frequencies
Randomized Algorithms CS648
Lecture 7: Dynamic sampling Dimension Reduction
Near(est) Neighbor in High Dimensions
Data-Dependent Hashing for Nearest Neighbor Search
Lecture 16: Earth-Mover Distance
Linear sketching with parities
Locality Sensitive Hashing
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Linear sketching with parities
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications
Lecture 6: Counting triangles Dynamic graphs & sampling
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Presentation transcript:

Lecture 11: Nearest Neighbor Search Left to the title, a presenter can insert his/her own image pertinent to the presentation.

Plan Distinguished Lecture Nearest Neighbor Search Scriber? Quantum Computing Oct 19, 11:30am, Davis Aud in CEPSR Nearest Neighbor Search Scriber?

Sketching 𝑊: ℜ 𝑑 → short bit-strings given 𝑊(𝑥) and 𝑊(𝑦), can distinguish between: Close: ||𝑥−𝑦||≤𝑟 Far: ||𝑥−𝑦||>𝑐𝑟 With high success probability: only 𝛿=1/ 𝑛 3 failure prob. ℓ 2 , ℓ 1 norm: 𝑂( 𝜖 −2 ⋅log 𝑛) bits 𝑦 𝑥 𝑊 𝑊 010110 010101 Is ||𝑊 𝑥 −𝑊 𝑦 ||≤𝑡 ? Yes: close, ||𝑥−𝑦||≤𝑟 No: far, ||𝑥−𝑦||> 1+𝜖 𝑟

Near-linear space and sub-linear query time? NNS: approaches Sketch 𝑊: uses 𝑘=𝑂( 𝜖 −2 ⋅ log 𝑛 ) bits 1: Linear scan++ Precompute 𝑊(𝑝) for 𝑝∈𝐷 Given 𝑞, compute 𝑊 𝑞 For each 𝑝∈𝐷, estimate distance using 𝑊 𝑞 , 𝑊(𝑝) 2: Exhaustive storage++ For each possible 𝜎∈ 0,1 𝑘 compute 𝐴 𝜎 = point 𝑝∈𝐷 s.t. ||𝑊 𝑝 −𝜎| | 1 <𝑡 On query 𝑞, output 𝐴[𝑊 𝑞 ] Space: 2 𝑘 = 𝑛 𝑂 1/ 𝜖 2 Near-linear space and sub-linear query time?

Locality Sensitive Hashing [Indyk-Motwani’98] Random hash function ℎ on 𝑅 𝑑 satisfying: for close pair (when 𝑞−𝑝 ≤𝑟) Pr⁡[ℎ(𝑞)=ℎ(𝑝)] is “high” for far pair (when 𝑞−𝑝′ >𝑐𝑟) Pr⁡[ℎ(𝑞)=ℎ(𝑝′)] is “small” Use several hash tables 𝑞 𝑝′ 𝑝 𝑞 𝑃1= “not-so-small” 𝑃2= Pr⁡[ℎ(𝑞)=ℎ(𝑝)] 𝜌= log 1/ 𝑃 1 log 1/ 𝑃 2 𝑛𝜌, where 1 𝑃1 𝑃2 𝑞−𝑝 𝑟 𝑐𝑟

LSH for Hamming space Hash function 𝑔 is usually a concatenation of “primitive” functions: 𝑔 𝑝 = ℎ 1 (𝑝), ℎ 2 (𝑝),…, ℎ 𝑘 (𝑝) Fact 1: 𝜌 𝑔 = 𝜌 ℎ Example: Hamming space 0,1 𝑑 ℎ 𝑝 = 𝑝 𝑗 , i.e., choose 𝑗 𝑡ℎ bit for a random 𝑗 𝑔(𝑝) chooses 𝑘 bits at random Pr ℎ 𝑝 =ℎ 𝑞 = 1 – 𝐻𝑎𝑚 𝑝,𝑞 𝑑 𝑃 1 =1− 𝑟 𝑑 ≈ 𝑒 −𝑟/𝑑 𝑃 2 =1− 𝑐𝑟 𝑑 ≈ 𝑒 −𝑐𝑟/𝑑 𝜌= log 1/ 𝑃 1 log 1/ 𝑃 2 = 𝑟/𝑑 𝑐𝑟/𝑑 = 1 𝑐 Pr⁡[ℎ(𝑞)=ℎ(𝑝)] 1 𝑃1 𝑃2 𝑞−𝑝 𝑟 𝑐𝑟

Full Algorithm Data structure is just 𝐿= 𝑛 𝜌 hash tables: Query: Each hash table uses a fresh random function 𝑔 𝑖 𝑝 = ℎ 𝑖,1 (𝑝),…, ℎ 𝑖,𝑘 (𝑝) Hash all dataset points into the table Query: Check for collisions in each of the hash tables until we encounter a point within distance 𝑐𝑟 Guarantees: Space: 𝑂 𝑛𝐿 =𝑂( 𝑛 1+𝜌 ), plus space to store points Query time: 𝑂 𝐿⋅(𝑘+𝑑) =𝑂( 𝑛 𝜌 ⋅𝑑) (in expectation) 50% probability of success.

Choice of parameters 𝑘,𝐿 ? 𝐿 hash tables with 𝑔 𝑝 = ℎ 1 (𝑝),…, ℎ 𝑘 (𝑝) Pr[collision of far pair] = 𝑃 2 𝑘 Pr[collision of close pair] = 𝑃 1 𝑘 Success probability for a hash table: 𝑃 1 𝑘 𝐿=𝑂 1/ 𝑃 1 𝑘 tables should suffice Runtime as a function of 𝑃 1 , 𝑃 2 ? 𝑂 1 𝑃 1 𝑘 𝑡𝑖𝑚𝑒𝑇𝑜𝐻𝑎𝑠ℎ+𝑛 𝑃 2 𝑘 Hence 𝐿=𝑂( 𝑛 𝜌 ) set 𝑘 s.t. =1/𝑛 = 𝑃 2 𝜌 𝑘 = 1/𝑛 𝜌 𝑃 1 𝑃 1 2 𝑃 2 𝑘=1 𝑃 2 2 𝑘=2 𝑟 𝑐𝑟

Analysis: correctness Let 𝑝 ∗ be an 𝑟-near neighbor If does not exists, algorithm can output anything Algorithm fails when: near neighbor 𝑝 ∗ is not in the searched buckets 𝑔 1 𝑞 , 𝑔 2 𝑞 , …, 𝑔 𝐿 𝑞 Probability of failure: Probability 𝑞, 𝑝 ∗ do not collide in a hash table: ≤1− 𝑃 1 𝑘 Probability they do not collide in 𝐿 hash tables at most 1− 𝑃 1 𝑘 𝐿 = 1− 1 𝑛 𝜌 𝑛 𝜌 ≤1/𝑒

Analysis: Runtime Runtime dominated by: Distance computations: Hash function evaluation: 𝑂(𝐿⋅𝑘) time Distance computations to points in buckets Distance computations: Care only about far points, at distance >𝑐𝑟 In one hash table, we have Probability a far point collides is at most 𝑃 2 𝑘 =1/𝑛 Expected number of far points in a bucket: 𝑛⋅ 1 𝑛 =1 Over 𝐿 hash tables, expected number of far points is 𝐿 Total: 𝑂 𝐿𝑘 +𝑂 𝐿𝑑 =𝑂( 𝑛 𝜌 log 𝑛 +𝑑 ) in expectation

LSH in practice If want exact NNS, what is 𝑐? 𝑘 safety not guaranteed fewer false positives If want exact NNS, what is 𝑐? Can choose any parameters 𝐿,𝑘 Correct as long as 1− 𝑃 1 𝑘 𝐿 ≤0.1 Performance: trade-off between # tables and false positives will depend on dataset “quality” fewer tables 𝐿

LSH Algorithms Hamming space Euclidean space Space Time Exponent 𝒄=𝟐 Reference Hamming space 𝑛 1+𝜌 𝑛 𝜌 𝜌=1/𝑐 𝜌=1/2 [IM’98] 𝜌≥1/𝑐 [MNP’06, OWZ’11] Euclidean space 𝑛 1+𝜌 𝑛 𝜌 𝜌=1/𝑐 𝜌=1/2 [IM’98, DIIM’04] Other ways to use LSH 𝜌≈1/ 𝑐 2 𝜌=1/4 [AI’06] 𝜌≥1/ 𝑐 2 [MNP’06, OWZ’11] Table does not include: 𝑂(𝑛𝑑) additive space 𝑂(𝑑⋅log 𝑛) factor in query time

LSH Zoo ( ℓ 1 ) To be or not to be To sketch or not to sketch sketch sketch Hamming distance [IM’98] ℎ: pick a random coordinate(s) ℓ 1 (Manhattan) distance [AI’06] ℎ: cell in a randomly shifted grid Jaccard distance between sets: 𝐽 𝐴,𝐵 = 𝐴∩𝐵 𝐴∪𝐵 ℎ: pick a random permutation 𝜋 on the universe ℎ 𝐴 = min 𝑎∈𝐴 𝜋(𝑎) min-wise hashing [Bro’97] Claim: Pr[collision]=J(A,B) be not not not or to be not or to …11101… 1 …01111… 1 …21102… …01122… {be,not,or,to} {not,or,to, sketch} be to 𝜋=be,to,sketch,or,not

LSH for Euclidean distance [Datar-Immorlica-Indyk-Mirrokni’04] LSH function ℎ(𝑝): pick a random line ℓ, and quantize project point into ℓ ℎ 𝑝 = 𝑝⋅ℓ 𝑤 +𝑏 ℓ is a random Gaussian vector 𝑏 random in [0,1] 𝑤 is a parameter (e.g., 4) Claim: 𝜌=1/𝑐 ℓ 𝑝