Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.

Slides:

Advertisements

Similar presentations

Efficient Private Approximation Protocols Piotr Indyk David Woodruff Work in progress.

Advertisements

Lecture outline Nearest-neighbor search in low dimensions

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Sub Exponential Randomize Algorithm for Linear Programming Paper by: Bernd Gärtner and Emo Welzl Presentation by : Oz Lavee.

Data Mining Classification: Alternative Techniques

Big Data Lecture 6: Locality Sensitive Hashing (LSH)

Searching on Multi-Dimensional Data

MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.

Metric Embeddings with Relaxed Guarantees Hubert Chan Joint work with Kedar Dhamdhere, Anupam Gupta, Jon Kleinberg, Aleksandrs Slivkins.

Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.

The Divide-and-Conquer Strategy

Classification and Decision Boundaries

Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.

1 Last lecture  Configuration Space Free-Space and C-Space Obstacles Minkowski Sums.

Coherency Sensitive Hashing (CSH) Simon Korman and Shai Avidan Dept. of Electrical Engineering Tel Aviv University ICCV2011 | 13th International Conference.

Computational Support for RRTs David Johnson. Basic Extend.

Algorithms for Nearest Neighbor Search Piotr Indyk MIT.

Heuristic alignment algorithms and cost matrices

CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive.

1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.

Testing Metric Properties Michal Parnas and Dana Ron.

Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 20: Sorting.

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.

Improved Approximation Bounds for Planar Point Pattern Matching (under rigid motions) Minkyoung Cho Department of Computer Science University of Maryland.

Approximate Nearest Subspace Search with applications to pattern recognition Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech.

CHAPTER 7: SORTING & SEARCHING Introduction to Computer Science Using Ruby (c) Ophir Frieder at al 2012.

Indexing Techniques Mei-Chen Yeh.

- ABHRA DASGUPTA Solving Adhoc and Math related problems.

Module 04: Algorithms Topic 07: Instance-Based Learning

K Nearest Neighborhood (KNNs)

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)

Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

CSC 211 Data Structures Lecture 13

NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.

DATA MINING LECTURE 10b Classification k-nearest neighbor classifier

New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu, Andrew W. Moore, and Alexander Gray.

Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.

Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.

S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.

Data Science Algorithms: The Basic Methods

Fast nearest neighbor searches in high dimensions Sami Sieranoja

Haim Kaplan and Uri Zwick

Sublinear Algorithmic Tools 3

Lecture 11: Nearest Neighbor Search

K Nearest Neighbor Classification

Randomized Algorithms CS648

Lecture 7: Dynamic sampling Dimension Reduction

Near(est) Neighbor in High Dimensions

Enumerating Distances Using Spanners of Bounded Degree

Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University

Locality Sensitive Hashing

cse 521: design and analysis of algorithms

COSC 4335: Other Classification Techniques

Searching CLRS, Sections 9.1 – 9.3.

CS5112: Algorithms and Data Structures for Applications

Compact routing schemes with improved stretch

Minwise Hashing and Efficient Search

President’s Day Lecture: Advanced Nearest Neighbor Search

Data Mining CSCI 307, Spring 2019 Lecture 23

Presentation transcript:

Given by: Erez Eyal Uri Klein

Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions KD-Trees KD-Trees Approximate Nearest Neighbor search (LSH based) Approximate Nearest Neighbor search (LSH based) Locality Sensitive Hashing families Locality Sensitive Hashing families Algorithm for Hamming Cube Algorithm for Hamming Cube Algorithm for Euclidean space Algorithm for Euclidean space Summary Summary Overview Detailed

Nearest Neighbor Search in Springfield ?

? Nearest “ Neighbor ” Search for Homer Simpson Home planet distance Height Weight Color

Nearest Neighbor (NN) Search Given: a set P of n points in R d (d - dimension) Given: a set P of n points in R d (d - dimension) Goal: a data structure, which given a query point q, finds the nearest neighbor p of q in P (in terms of some distance function D) Goal: a data structure, which given a query point q, finds the nearest neighbor p of q in P (in terms of some distance function D) q p

Nearest Neighbor Search Interested in designing a data structure, with the following objectives: Space: O(dn) Space: O(dn) Query time: O(d log(n)) Query time: O(d log(n)) Data structure construction time is not important Data structure construction time is not important

Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions KD-Trees KD-Trees Approximate Nearest Neighbor search (LSH based) Approximate Nearest Neighbor search (LSH based) Locality Sensitive Hashing families Locality Sensitive Hashing families Algorithm for Hamming Cube Algorithm for Hamming Cube Algorithm for Euclidean space Algorithm for Euclidean space Summery Summery

Simple cases: 1-D (d = 1) A binary search will give the solution A binary search will give the solution Space: O(n); Time: O(log(n)) Space: O(n); Time: O(log(n)) q =

Simple cases: 2-D (d = 2) Using Voronoi diagrams will give the solution Using Voronoi diagrams will give the solution Space: O(n 2 ); Time: O(log(n)) Space: O(n 2 ); Time: O(log(n))

Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions KD-Trees Approximate Nearest Neighbor search (LSH based) Approximate Nearest Neighbor search (LSH based) Locality Sensitive Hashing families Locality Sensitive Hashing families Algorithm for Hamming Cube Algorithm for Hamming Cube Algorithm for Euclidean space Algorithm for Euclidean space Summary Summary

KD-Trees KD-tree is a data structure based on recursively subdividing a set of points with alternating axis- aligned hyperplanes. The classical KD-tree uses O(dn) space and answers queries in time logarithmic in n (worst case is O(n)), but exponential in d.

l5l5 l1l1 l9l9 l6l6 l3l3 l 10 l7l7 l4l4 l8l8 l2l2 l1l1 l8l8 1 l2l2 l3l3 l4l4 l5l5 l7l7 l6l6 l9l KD-Trees Construction

l5l5 l1l1 l9l9 l6l6 l3l3 l 10 l7l7 l4l4 l8l8 l2l2 l1l1 l8l8 1 l2l2 l3l3 l4l4 l5l5 l7l7 l6l6 l9l q KD-Trees Query

KD-Trees Algorithms

Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions KD-Trees KD-Trees Approximate Nearest Neighbor search (LSH based) Approximate Nearest Neighbor search (LSH based) Locality Sensitive Hashing families Algorithm for Hamming Cube Algorithm for Hamming Cube Algorithm for Euclidean space Algorithm for Euclidean space Summary Summary

A conjecture: “ The curse of dimensionality ” “ However, to the best of our knowledge, lower bounds for exact NN Search in high dimensions do not seem sufficiently convincing to justify the curse of dimensionality conjecture ” (Borodin et al. ‘ 99) In an exact solution, any algorithm for high dimension must use either n  (1) space or have d  (1) query time

Why Approximate NN? Approximation allow significant speedup of calculation (on the order of 10 ’ s to 100 ’ s) Fixed-precision arithmetic on computer causes approximation anyway Heuristics are used for mapping features to numerical values (causing uncertainty anyway)

Approximate Nearest Neighbor (ANN) Search Given: a set P of n points in R d (d - dimension) and a slackness parameter  >0 Given: a set P of n points in R d (d - dimension) and a slackness parameter  >0 Goal: a data structure, which given a query point q of which the nearest neighbor in P is a, finds any p s.t. D(q, p) b (1+  )D(q, a) Goal: a data structure, which given a query point q of which the nearest neighbor in P is a, finds any p s.t. D(q, p) b (1+  )D(q, a) q a (1+  )D(q, a)

Locality Sensitive Hashing A (r 1, r 2, P 1, P 2 ) - Locality Sensitive Hashing (LSH) family, is a family of hash functions H s.t. for a random hash function h and for any pair of points a, b we have: D(a, b) b r 1  Pr[h(a)=h(b)] r P 1 D(a, b) b r 1  Pr[h(a)=h(b)] r P 1 D(a, b) r r 2  Pr[h(a)=h(b)] b P 2 D(a, b) r r 2  Pr[h(a)=h(b)] b P 2 (r 1 P 2 ) (r 1 P 2 ) [Indyk-Motwani ’ 98] (A common method to reduce dimensionality without loosing distance information)

Hamming Cube A d-Dimensional hamming cube Q d is the set {0, 1} d A d-Dimensional hamming cube Q d is the set {0, 1} d For any a, bQ d we define Hamming distance H: For any a, b  Q d we define Hamming distance H:

LSH – Example in Hamming Cube a)=a i, i{1, …, d}} H ={h|h(a)=a i, i  {1, …, d}} Pr[q)=a)]=1-H(q, a)/d Pr[h(q)=h(a)]=1-H(q, a)/d Pr is a monotonically decreasing function in Pr is a monotonically decreasing function in H(q, a) Multi-index hashing: Multi-index hashing: G ={g|g(a)=(h 1 (a) h 2 (a)… h k (a))} Pr[q)=a)]=(1-H(q, a)/d Pr[g(q)=g(a)]=(1-H(q, a)/d) k Pr is a monotonically decreasing function in k

Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions KD-Trees KD-Trees Approximate Nearest Neighbor search (LSH based) Approximate Nearest Neighbor search (LSH based) Locality Sensitive Hashing families Locality Sensitive Hashing families Algorithm for Hamming Cube Algorithm for Euclidean space Algorithm for Euclidean space Summary Summary

LSH – ANN Search Basic Scheme Preprocess: Construct several such ‘ g ’ functions for each Construct several such ‘ g ’ functions for each l  {1,…, d} Store each a at the place g i (a) of the corresponding hash table Store each a  P at the place g i (a) of the corresponding hash table Query: Perform binary search on Perform binary search on l In each step retrieve g i (q) (of l, if exists) In each step retrieve g i (q) (of l, if exists) Return the last non empty result Return the last non empty result

ANN Search in Hamming Cube  -test  : Pick a subset C of {1, 2, …, d} independently, at random w.p.  For each i  C, pick independently and uniformly r i  {0, 1} For any a  Q d : (Equivalently, we may pick R  {0, 1} d s.t. R i is 1 w.p. , and the test is an inner product of R and a. Such R represents a  -test  ) [Kushilevitz et al. ’ 98]

ANN Search in Hamming Cube Pr[(a) R (b)] Define:  (a, b)=Pr[  (a) R  (b)] For, Let H(a, q) b l, H(b, q) > l For a query q, Let H(a, q) b l, H(b, q) > l(1+  ) Then for Then for  =1/(2l): b  (a, q) b  1  2 <  (b, q) Where: And define:  2 -  1 =  (1-e -  /2 )

ANN Search in Hamming Cube Data structure: S ={S 1, …, S d } Positive integers - M, T For any l  {1,…, d}, S l ={  1,…,  M } For any j  {1,…, M},  j consists of a set {t 1,…, t T } (each t k is a (1/(2l))-test) and a table A j of 2 T entries

ANN Search in Hamming Cube In each S l, construct  j as follows: Pick {t 1,…, t T } independently at random For v  Q d, the trace t(v)=(t 1 (v),…, t T (v))  {0,1} T b An entry z  {0, 1} T in A j contains a point a  P, if H(t(a), z) b (  1 +(1/3)  )T (else empty) The space complexity:

ANN Search in Hamming Cube b For any query q and a, b  P s.t. H(q, a) b l and H(q, b)>(1+  )l, it can be proven using Chernoff bounds that: [Alon & Spencer ’ 92] This gives the result that the trace t functions as a LSH family (in its essence) (When the event presented in these inequalities occur for some  j in S l,  j is said to ‘ fail ’ )

ANN Search in Hamming Cube Search Algorithm: We perform a binary search on l. In every step: Pick  j in S l uniformly, at random Compute t(q) from the list of tests in  j Check the entry labeled t(q) in A j : If the entry contains a point from P, restrict the search to lower l ’ s Otherwise restrict the search to greater l ’ s Return the last non-empty entry in the search

ANN Search in Hamming Cube Search Algorithm: Example Initialize l=d/2 Is A j (t(q)) empty? Calculate t(q) Choose  j Access S l Res  A j (t(q)), l  lower half No l  upper half Yes l covered already? No Yes

ANN Search in Hamming Cube Construction of S is said to ‘ fail ’, if for some l more than  M/log(d) structures  j in S l ‘fail’ Define (for some  ): Then S ’ s construction fails w.p. of at most  If S does not fail, then for every query the search algorithm fails to find an ANN w.p. of at most 

ANN Search in Hamming Cube Query time complexity: Space complexity: Complexities are also proportional to  -2

Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions KD-Trees KD-Trees Approximate Nearest Neighbor search (LSH based) Approximate Nearest Neighbor search (LSH based) Locality Sensitive Hashing families Locality Sensitive Hashing families Algorithm for Hamming Cube Algorithm for Hamming Cube Algorithm for Euclidean space Summary Summary

Euclidean Space The d-Dimensional Euclidean Space l i d The d-Dimensional Euclidean Space l i d is R d endowed with the L i distance For any a, bQ d we define distance: For any a, b  Q d we define L i distance: The algorithm presented deals with l d l d The algorithm presented deals with l 2 d, and with l 1 d under minor changes

Euclidean Space Define: B (a, r) is the closed ball around a with radius r D (a, r)=P  B (a, r) (A subset of R d ) [Kushilevitz et al. ’ 98]

LSH – ANN Search Extended Scheme Preprocess: Prepare a data structure for each ‘ hamming ball ’ induced by any a, b. Prepare a data structure for each ‘ hamming ball ’ induced by any a, b  P. Query: Start with some maximal ball Start with some maximal ball In each step calculate the ANN In each step calculate the ANN Stop according to some threshold Stop according to some threshold

ANN Search in Euclidean Space For a  P, Define a Euclidian to Hamming mapping  D (a, r)  {0, 1} DF  Define a parameter L Given a set of i.i.d. unit vectors z 1, …, z D For each z i, The cutting points c 1, …, c F are equally spaced on: Each z i and c j define a coordinate in the DF- hamming cube, on which the projection of any b  D (a, r) is 0 iff

ANN Search in Euclidean Space Euclidian to hamming Mapping Example: d=3, d=3, D=2, F= (a)(a) z1z1 z2z (b)(b) z1z1 z2z2 a3a3 a2a2 a1a1 a (aiR)(aiR) b3b3 b2b2 b1b1 b (biR)(biR)

ANN Search in Euclidean Space It can be proven that, expectedly, the mapping  preserves the relative distances between points in P This mapping gets more accurate as r grows smaller:

ANN Search in Euclidean Space Data structure: S={S a |a  P} Positive integers - D, F, L For any a  P, S a consists of: A list of all other P ’ s elements sorted by increasing distance from a R A structure S a,b for any b R a (b  P)

ANN Search in Euclidean Space Let r=L 2 (a, b), then S a,b consists of: A list of D i.i.d. unit vectors {z 1, …, z D } For each unit vector z i, a list of F cutting points A Hamming Cube data structure of dimension DF, containing D (a, r) The size of D (a, r)

ANN Search in Euclidean Space Search Algorithm (using a positive integer T): Pick a random a 0  P where b 0 is the farthest point from a 0, and start from S a0,b0 (r 0 =L 2 (a 0, b 0 )) For any S aj,bj : Query for ANN of  (q) in the Hamming Cube d.s. and get result  (a’) If L 2 (q, a ’ )>r -1 /10 return a ’ Otherwise, pick T points of D (a j, r j ) at random, and let a ” be the closest to q among them Let a j+1 be the closest to q of {a j, a ’, a ” }

ANN Search in Euclidean Space r Let b ’  P be the farthest from a j+1 s.t. 2L 2 (a j+1, q) r L 2 (a j+1, b ’ ), Using a binary search on the sorted list of S a(j+1) If can ’ t find, return a j+1 Otherwise, let b j+1 =b ’

ANN Search in Euclidean Space Each ball in the search contains q ’ s (exact) NN q aiai bibi

ANN Search in Euclidean Space contains only points from contains at most points w.p. of at least 1-2 -T q a i-1 b i-1 aiai

ANN Search in Euclidean Space q aiai bibi a i-1

ANN Search in Euclidean Space Conclusion: In the expected case, this gives us an O(log(n)) number of iterations

ANN Search in Euclidean Space Search Algorithm: Example q a0a0 b0b0 b1b1 a1a1

ANN Search in Euclidean Space Construction of S is said to ‘ fail ’, if for some S a,b,  does not preserve the relative distances Define (for some  ): Then S ’ s construction fails w.p. of at most  If S does not fail, then for every query the search algorithm finds an ANN

ANN Search in Euclidean Space Query time complexity: Space complexity: Complexities are also proportional to  -2

Remark – Additional Work Related Works: Jon M. Kleinberg. “ Two Algorithms for Nearest-Neighbor Search in High Dimensions ”, 1997 P. Indyk and R. Motwani. “ Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality ”, 1988 P. Indyk and R. Motwani. “ Similarity search in High Dimensions via Hashing ”, 1999

Remark – Additional Work Related Works: [P. Indyk and R. Motwani ‘ 99]

Remark – Additional Work [P. Indyk and R. Motwani ‘ 99] Related Works:

Summary The Goal: linear space and logarithmic search time Approximate nearest neighbor Locality Sensitive Hash functions Amplify probability by concatenating Discretization of values by projection of points on vector units

Good Bye (Approximate) Neighbor [ For questions feel free to consult your neighbors: