k-Nearest Neighbors Search in High Dimensions

Slides:



Advertisements
Similar presentations
Bayesian network classification using spline-approximated KDE Y. Gurwicz, B. Lerner Journal of Pattern Recognition.
Advertisements

Nearest Neighbor Search
DECISION TREES. Decision trees  One possible representation for hypotheses.
Aggregating local image descriptors into compact codes
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Spatio-temporal Databases
Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order Edgar Chávez Karina Figueroa Gonzalo Navarro UNIVERSIDAD MICHOACANA, MEXICO.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.
Searching on Multi-Dimensional Data
Salvatore giorgi Ece 8110 machine learning 5/12/2014
K Means Clustering , Nearest Cluster and Gaussian Mixture
Similarity Search in High Dimensions via Hashing
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Approximate Nearest Subspace Search with Applications to Pattern Recognition Ronen Basri, Tal Hassner, Lihi Zelnik-Manor presented by Andrew Guillory and.
Scalable Data Mining The Auton Lab, Carnegie Mellon University Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy.
1 Nearest Neighbor Learning Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AI.
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
What is Cluster Analysis?
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Nearest Neighbor Classifiers other names: –instance-based learning –case-based learning (CBL) –non-parametric learning –model-free learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
FLANN Fast Library for Approximate Nearest Neighbors
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Nearest Neighbor Searching Under Uncertainty
Handling Spatial Data In P2P Systems Verena Kantere, Timos Sellis, Yannis Kouvaras.
Multiplicative Data Perturbations. Outline  Introduction  Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random.
Line detection Assume there is a binary image, we use F(ά,X)=0 as the parametric equation of a curve with a vector of parameters ά=[α 1, …, α m ] and X=[x.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
CS654: Digital Image Analysis Lecture 30: Clustering based Segmentation Slides are adapted from:
CS654: Digital Image Analysis
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Nonparametric Density Estimation Riu Baring CIS 8526 Machine Learning Temple University Fall 2007 Christopher M. Bishop, Pattern Recognition and Machine.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
KNN & Naïve Bayes Hongning Wang
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
SIMILARITY SEARCH The Metric Space Approach
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Ch8: Nonparametric Methods
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Lecture 05: K-nearest neighbors
K Nearest Neighbor Classification
Image Segmentation Techniques
Near(est) Neighbor in High Dimensions
Nearest-Neighbor Classifiers
Revision (Part II) Ke Chen
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Locality Sensitive Hashing
Instance Based Learning
INTRODUCTION TO Machine Learning
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Statistical Models and Machine Learning Algorithms --Review
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

k-Nearest Neighbors Search in High Dimensions Tomer Peled Dan Kushnir Tell me who your neighbors are, and I'll know who you are

Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) l2 extension Applications (Dan)

Nearest Neighbor Search Problem definition Given: a set P of n points in Rd Over some metric find the nearest neighbor p of q in P The solution strategy is: Select Features  feature space Rd Select distance metric  for example l1 or l2 Build data-structure for fast near-neighbor queries Scale-ability with n & with d is important Q? Distance metric

Applications Classification Clustering Segmentation q ? Indexing Dimension reduction (e.g. lle) Weight q ? color

Naïve solution No preprocess Given a query point q query time = O(nd) Go over all n points Do comparison in Rd query time = O(nd) Keep in mind

Common solution Use a data structure for acceleration Scale-ability with n & with d is important

When to use nearest neighbor High level algorithms Parametric Non-parametric Density estimation Probability distribution estimation Nearest neighbors complex models Sparse data High dimensions Assuming no prior knowledge about the underlying probability structure

Nearest Neighbor q? Closest min pi  P dist(q,pi)

r,  - Nearest Neighbor q? (1 +  ) r dist(q,p1)  r r2=(1 +  ) r1

Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) l2 extension Applications (Dan)

The simplest solution Lion in the desert Lion in the desert

Quadtree Split the first dimension into 2 Repeat iteratively Stop when each cell has no more than 1 data point Extension to the K-dimensional case

Quadtree - structure X1,Y1 X1,Y1 P≥X1 P<X1 P≥Y1 P<Y1 P<X1 Extension to the K-dimensional case X

Quadtree - Query In many cases works X1,Y1 X1,Y1 P≥X1 P<X1 P≥Y1 Extension to the K-dimensional case X In many cases works

Quadtree – Pitfall1 In some cases doesn’t X1,Y1 X1,Y1 P<X1 P<Y1 Extension to the K-dimensional case P<X1 X In some cases doesn’t

Quadtree – Pitfall1 In some cases nothing works Y X Extension to the K-dimensional case In some cases nothing works X

Quadtree – pitfall 2 O(2d) X Y Extension to the K-dimensional case Smarty, Perky - ךןםמ O(2d) Could result in Query time Exponential in #dimensions

Space partition based algorithms Could be improved Multidimensional access methods / Volker Gaede, O. Gunther

Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) l2 extension Applications (Dan)

Curse of dimensionality Query time or space O(nd) D>10..20  worst than sequential scan For most geometric distributions Techniques specific to high dimensions are needed O( min(nd, nd) ) Naive Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002 Data structures scales purely with the dimension Prooved by Barkol & Rabani 2000 & Beame-Vee 2002 A quantitative analysis and performance study for Similarity Search Methods in High Dimensional Spaces / R. Weber, H. Schek, and S. Blott. Enough points n required for learning distribution in Rd (for n~ed the curse of dimensionality is meaningless)

Curse of dimensionality Some intuition 2 22 For Region bounding 23 2d

Outline Problem definition and flavors Algorithms overview - low dimensions Curse of dimensionality (d>10..20) Enchanting the curse Locality Sensitive Hashing (high dimension approximate solutions) l2 extension Applications (Dan) Compromise towards approximate Nearest neighbors with probability

Preview General Solution – Locality sensitive hashing Implementation for Hamming space Generalization to l1 & l2

Hash function Key=Numbers  giving access to a storage

Hash function Data_Item Hash function Key Bin/Bucket A function that maps a data item to a numeric value by use of a transformation. A hash function can convert a number that has meaning to a user, such as a key or other identifier, into a value for the location of that data in a structure such as a table. Dyson, Dictionary of Networking Key Bin/Bucket

Hash function X=Number in the range 0..n Data structure X modulo 3 A function that maps a data item to a numeric value by use of a transformation. A hash function can convert a number that has meaning to a user, such as a key or other identifier, into a value for the location of that data in a structure such as a table. Dyson, Dictionary of Networking 0..2 Storage Address Usually we would like related Data-items to be stored at the same bin

Recall r,  - Nearest Neighbor q? (1 +  ) r r dist(q,p1)  r dist(q,p2)  (1 +  ) r r2=(1 +  ) r1

Locality sensitive hashing q? (1 +  ) r r (r, ,p1,p2) Sensitive ≡Pr[I(p)=I(q)] is “high” if p is “close” to q ≡Pr[I(p)=I(q)] is “low” if p is”far” from q P1 P2 r2=(1 +  ) r1

Preview General Solution – Locality sensitive hashing Implementation for Hamming space Generalization to l1 & l2 Why Hamming space It’s easier It’s the same as L1 It’s almost the same as L2

Hamming Space Hamming space = 2N binary strings Hamming distance = #changed digits Manhattan distance between two vertices in an n-dimensional hypercube, where n is the length of the words. Richard W. Hamming. Error-detecting and error-correcting codes, Bell System Technical Journal 29(2):147-160, 1950. http://en.wikipedia.org/wiki/Hamming_distance Richard Hamming a.k.a Signal distance Richard Hamming

Hamming Space Hamming space 010100001111 Hamming distance 010100001111 010010000011 Manhattan distance between two vertices in an n-dimensional hypercube, where n is the length of the words. Richard W. Hamming. Error-detecting and error-correcting codes, Bell System Technical Journal 29(2):147-160, 1950. http://en.wikipedia.org/wiki/Hamming_distance SUM(X1 XOR X2)

L1 to Hamming Space Embedding 2 Axis: positive integers Let C=Max(axis_val) Represent each point feature as padded unary value Concatenate all values into d’=C*d value p d’=C*d 8 11000000000 11111111000 11000000000 11111111000

Hash function p ∈ Hd’ Lj Hash function j=1..L, k=3 digits Gj(p)=p|Ij 11000000000 11111111000 11000000000 11111111000 1 1 p ∈ Hd’ Lj Hash function j=1..L, k=3 digits Gj(p)=p|Ij Bits sampling from p Store p into bucket p|Ij 2k buckets 101

Construction p 1 2 L

Query q 1 2 L

Alternative intuition random projections 2 Pass a TH Partition the space into 2 p d’=C*d 8 11000000000 11111111000 11000000000 11111111000

Alternative intuition random projections 2 Pass a TH Partition the space into 2 The TH is analogue to sampling of digit from the Hamming representation #THs = #sampled digits Sampling k digits is analogue to random partitioning of the space using k thresholds p 8 11000000000 11111111000 11000000000 11111111000

Alternative intuition random projections 2 Pass a TH Partition the space into 2 p 8 11000000000 11111111000 11000000000 11111111000

Alternative intuition random projections 11000000000 11111111000 11000000000 11111111000 1 1 110 111 100 101 p 101 23 Buckets 000 001

k samplings

Repeating

Repeating L times

Repeating L times

Secondary hashing Support volume tuning 2k buckets Skip 011 Simple Hashing α = memory utilization parameter M*B=α*n α=2 Size=B M Buckets Support volume tuning dataset-size vs. storage volume

The above hashing is locality-sensitive Probability (p,q in same bucket) = k=1 Pr k=2 Probability Distance (q,pi) Distance (q,pi) Adopted from Piotr Indyk’s slides

Preview General Solution – Locality sensitive hashing Implementation for Hamming space Generalization to l2

Direct L2 solution New hashing function Still based on sampling Using mathematical trick P-stable distribution for Lp distance Gaussian distribution for L2 distance

(Weighted Gaussians) = Weighted Gaussian Central limit theorem v1* +v2* …+vn* = +… (Weighted Gaussians) = Weighted Gaussian Generalized Central Limit Theorem : the only possible non-trivial limit of normalized sums of independent identically distributed terms is stable. http://academic2.american.edu/~jpnolan/stable/stable.html Image taken from http://www.statisticalengineering.com/central_limit_theorem.htm See: 1.6 Sums of stable random variables from Stable Distributions - Models for Heavy Tailed Data/ John P. Nolan

Central limit theorem v1..vn = Real Numbers v1* X1 +v2* X2 +… …+vn* Xn = v1..vn = Real Numbers Generalized Central Limit Theorem : the only possible non-trivial limit of normalized sums of independent identically distributed terms is stable. http://academic2.american.edu/~jpnolan/stable/stable.html Image taken from http://www.statisticalengineering.com/central_limit_theorem.htm See: 1.6 Sums of stable random variables from Stable Distributions - Models for Heavy Tailed Data/ John P. Nolan X1:Xn = Independent Identically Distributed (i.i.d)

Central limit theorem Dot Product Norm Generalized Central Limit Theorem : the only possible non-trivial limit of normalized sums of independent identically distributed terms is stable. http://academic2.american.edu/~jpnolan/stable/stable.html Image taken from http://www.statisticalengineering.com/central_limit_theorem.htm See: 1.6 Sums of stable random variables from Stable Distributions - Models for Heavy Tailed Data/ John P. Nolan

Norm  Distance Features vector 1 Features vector 2 Distance Dot product is calculated once per vector (n) Instead of calculating all distances (n^2) The distances sketch now resort to subtraction of scalars

Norm  Distance Dot Product Dot Product Distance Dot product is calculated once per vector (n) Instead of calculating all distances (n^2) The distances sketch now resort to subtraction of scalars

+b w The full Hashing 22 77 42 [34 82 21] d random* numbers Features vector phase Random[0,w] 22 77 42 +b 1 [34 82 21] d w 7944 Discretization step

The full Hashing 7944 +34 100 7944 7800 7900 8000 8100 8200

+34 100 The full Hashing 7944 phase Discretization step Random[0,w]

i.i.d from p-stable distribution The full Hashing i.i.d from p-stable distribution Features vector phase Random[0,w] v +b 1 a d w Discretization step

Generalization: P-Stable distribution Central Limit Theorem Gaussian (normal) distribution Lp p=eps..2 Generalized Central Limit Theorem P-stable distribution Cauchy for L2

Reported in Email by Alexander Andoni P-Stable summary r,  - Nearest Neighbor Works for Generalizes to 0<p<=2 Improves query time Latest results Reported in Email by Alexander Andoni Query time = O (dn1/(1+)log(n) )  O (dn1/(1+)^2log(n) )

Parameters selection 90% Probability  Best quarry time performance For Euclidean Space

Parameters selection … Single projection hit an  - Nearest Neighbor with Pr=p1 k projections hits an  - Nearest Neighbor with Pr=p1k L hashings fail to collide with Pr=(1-p1k)L To ensure Collision (e.g. 1-δ≥90%) 1- (1-p1k)L≥ 1-δ For Euclidean Space Accept Neighbors Reject Non-Neighbors

… Parameters selection K … Parameters selection time Candidates verification Candidates extraction Tc is Time to calculate all candidates in Hash buckets The distance function k

Pros. & Cons. Better Query Time than Spatial Data Structures Scales well to higher dimensions and larger data size ( Sub-linear dependence ) Predictable running time Extra storage over-head Inefficient for data with distances concentrated around average works best for Hamming distance (although can be generalized to Euclidean space) In secondary storage, linear scan is pretty much all we can do (for high dim) requires radius r to be fixed in advance From Pioter Indyk slides

Conclusion ..but at the end everything depends on your data set Try it at home Visit: http://web.mit.edu/andoni/www/LSH/index.html Email Alex Andoni Andoni@mit.edu Test over your own data (C code under Red Hat Linux )

LSH - Applications Searching video clips in databases .("Hierarchical, Non-Uniform Locality Sensitive Hashing and Its Application to Video Identification“, Yang, Ooi, Sun). Searching image databases (see the following). Image segmentation (see the following). Image classification (“Discriminant adaptive Nearest Neighbor Classification”, T. Hastie, R Tibshirani). Texture classification (see the following). Clustering (see the following). Embedding and manifold learning (LLE, and many others) Compression – vector quantization. Search engines (“LSH Forest: SelfTuning Indexes for Similarity Search”, M. Bawa, T. Condie, P. Ganesan”). Genomics (“Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing”, J. Buhler). In short: whenever K-Nearest Neighbors (KNN) are needed.

Motivation A variety of procedures in learning require KNN computation. KNN search is a computational bottleneck. LSH provides a fast approximate solution to the problem. LSH requires hash function construction and parameter tunning. Sensitive in the sense that they can find the right neighbors and reject wrong ones with the prescribed probabilitiesץ For some application spaces this requires a good training as we shall see.

Outline Fast Pose Estimation with Parameter Sensitive Hashing G. Shakhnarovich, P. Viola, and T. Darrell. Finding sensitive hash functions. Mean Shift Based Clustering in High Dimensions: A Texture Classification Example B. Georgescu, I. Shimshoni, and P. Meer Tuning LSH parameters. LSH data structure is used for algorithm speedups.

Fast Pose Estimation with Parameter Sensitive Hashing G Fast Pose Estimation with Parameter Sensitive Hashing G. Shakhnarovich, P. Viola, and T. Darrell Given an image x, what are the parameters θ, in this image? i.e. angles of joints, orientation of the body, etc.􀁺 The Problem:

Ingredients Input query image with unknown angles (parameters). Database of human poses with known angles. Image feature extractor – edge detector. Distance metric in feature space dx. Distance metric in angles space:

Example based learning Construct a database of example images with their known angles. Given a query image, run your favorite feature extractor. Compute KNN from database. Use these KNNs to compute the average angles of the query. Input: query Find KNN in database of examples Output: Average angles of KNN

The algorithm flow Features extraction PSH (LSH) LWR (Regression) Input Query Features extraction Processed query PSH (LSH) Database of examples LWR (Regression) Output Match

The image features Image features are multi-scale edge histograms: PSH Feature Extraction PSH LWR The image features Image features are multi-scale edge histograms: B A The features space is the concatenated histograms of all different windows Windows are of size 8 16 32

PSH: The basic assumption Feature Extraction PSH LWR PSH: The basic assumption There are two metric spaces here: feature space ( ) and parameter space ( ). We want similarity to be measured in the angles space, whereas LSH works on the feature space. Assumption: The feature space is closely related to the parameter space.

Insight: Manifolds PSH Feature Extraction PSH LWR Insight: Manifolds Manifold is a space in which every point has a neighborhood resembling a Euclid space. But global structure may be complicated: curved. For example: lines are 1D manifolds, planes are 2D manifolds, etc.

Is this Magic? Feature Space Parameters Space (angles) q Parameters space is lower….

Parameter Sensitive Hashing (PSH) Feature Extraction PSH LWR Parameter Sensitive Hashing (PSH) The trick: Estimate performance of different hash functions on examples, and select those sensitive to : The hash functions are applied in feature space but the KNN are valid in angle space.

PSH as a classification problem Feature Extraction PSH LWR PSH as a classification problem Label pairs of examples with similar angles Compare labeling Define hash functions h on feature space Predict labeling of similar\ non-similar examples by using h If labeling by h is good accept h, else change h

Feature Extraction PSH LWR +1 -1 Labels: (r=0.25) Clearly we will have to choose more negeiteve examples and positiove ones since there are many more negative cases then positive ones.

A binary hash function: Feature Extraction PSH LWR features A binary hash function: Feature Observe that the y axis is of teta so that if two points are similar in teta space then they fall to same bucket, if they are not similar in teta space then they fall in different bins (clearly this is the big assumption that the teat and features are simialr)

Feature Extraction PSH LWR T Rmemeber that for a k-bit sensitive hash function the positive collision probability if 1-(1-p1)^k, and the false positive is p2^k

Local Weighted Regression (LWR) Feature Extraction PSH LWR Local Weighted Regression (LWR) Given a query image, PSH returns KNNs. LWR uses the KNN to compute a weighted average of the estimated angles of the query: Maybe skeep this slide it is not so important for the KNN issue. So g is a weight time the neighbor parameters We want a beta that will minimize the distance from the neighbor’s parameter in parameter space and will also consider the distance in features space. For close ngb in feature space we have second term that is high, hence we need to have g to produce a value as close to teta_i as possible so the distnace in parameter space will be small.

Results Synthetic data were generated: 13 angles: 1 for rotation of the torso, 12 for joints. 150,000 images. Nuisance parameters added: clothing, illumination, face expression. How can you measure the rotation of the torso around the vertical axis

Selected 137 out of 5,123 meaningful features (how??): 1,775,000 example pairs. Selected 137 out of 5,123 meaningful features (how??): 18 bit hash functions (k), 150 hash tables (l). Recall: P1 is prob of positive hash. P2 is prob of bad hash. B is the max number of pts in a bucket. Without selection needed 40 bits and 1000 hash tables. This are slightly improved results than those in the actual paper There is large number of examples – the space is nicely sampled So there is a considerable dimension reduction when taking only 137 out of 5123 features. 3.4% refers to the whole structure of the hash table For this problem r=0.25, R=0.5 and epsilon =1 Authors don’t say how they selceted 137 features? My guess is that the optimal t was non zero on 137 features only. 18 bit k, and 150 tables of L are based on the recommendation of Indyke B is the maximal number of buckets that we want to search. Running with those k and L provides a fast convergence time for epsilon =1 we get dN^(1/2) Test on 1000 synthetic examples: PSH searched only 3.4% of the data per query.

Results – real data 800 images. Processed by a segmentation algorithm. 1.3% of the data were searched.

Results – real data Note the improvement by LWR

Interesting mismatches Note the difficult background Note the error correction by LWR but also cases that it is worst then the top match (col 4 Note col 3 might be from mismatch between the two distance metrics features vs. parameters spaces also could happen because the manifold is not sampled correctly.

Fast pose estimation - summary Fast way to compute the angles of human body figure. Moving from one representation space to another. Training a sensitive hash function. KNN smart averaging.

Food for Thought The basic assumption may be problematic (distance metric, representations). The training set should be dense. Texture and clutter. General: some features are more important than others and should be weighted. Last: Or some dependency is assumed between the data points.

Food for Thought: Point Location in Different Spheres (PLDS) Given: n spheres in Rd , centered at P={p1,…,pn} with radii {r1,…,rn} . Goal: given a query q, preprocess the points in P to find point pi that its sphere ‘cover’ the query q. q pi ri Courtesy of Mohamad Hegaze

Statistical curse of dimensionality: sparseness of the data. Mean-Shift Based Clustering in High Dimensions: A Texture Classification Example B. Georgescu, I. Shimshoni, and P. Meer Motivation: Clustering high dimensional data by using local density measurements (e.g. feature space). Statistical curse of dimensionality: sparseness of the data. Computational curse of dimensionality: expensive range queries. LSH parameters should be adjusted for optimal performance. Say that in image segmentation we work in feature space that can be high.

Outline Mean-shift in a nutshell + examples. Our scope: Mean-shift in high dimensions – using LSH. Speedups: Finding optimal LSH parameters. Data-driven partitions into buckets. Additional speedup by using LSH data structure.

Mean-Shift in a Nutshell LSH: optimal k,l LSH: data partition LSH LSH: data struct Mean-Shift in a Nutshell bandwidth point

KNN in mean-shift Bandwidth should be inversely proportional to the LSH: optimal k,l LSH: data partition LSH LSH: data struct KNN in mean-shift Bandwidth should be inversely proportional to the density in the region: high density - small bandwidth low density - large bandwidth Based on kth nearest neighbor of the point The bandwidth is Adaptive mean-shift vs. non-adaptive.

Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct

Image segmentation algorithm Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Image segmentation algorithm Input : Data in 5D (3 color + 2 x,y) or 3D (1 gray +2 x,y) Resolution controlled by the bandwidth: hs (spatial), hr (color) Apply filtering 3D: Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’

Image segmentation algorithm Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Image segmentation algorithm Filtering: pixel value of the nearest mode Mean-shift trajectories original segmented filtered Filtering: WE apply MS and then assign each pixel the intensity level of the mode: z_i = (x^s,y_c ^g). Segmentation: assignin to all z_i which are close enough to the same cluster.

Filtering examples original squirrel filtered original baboon filtered Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’

Segmentation examples Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’

Mean-shift in high dimensions LSH: optimal k,l LSH: data partition LSH LSH: data struct Mean-shift in high dimensions Computational curse of dimensionality: Statistical curse of dimensionality: Expensive range queries implemented with LSH Sparseness of the data variable bandwidth

LSH-based data structure Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct LSH-based data structure Choose L random partitions: Each partition includes K pairs (dk,vk) For each point we check: It Partitions the data into cells:

Choosing the optimal K and L Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Choosing the optimal K and L For a query q compute smallest number of distances to points in its buckets. Each cell created by a partiltion l defines a square in the rectangle remember that there might me several partitions over the same coordinate axis. C_intrsct lies close to the center of C_union so each query q returns some of the NN of q So when L is above some threshold the neighborhood of q is covered but other nuisance points get in there. N_cl formula is because we have a volume of K bits / d +1 regions given by the partitions. It is in dimension d so taking the d power gives the volume, and we divid n points by this bucket volum

Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct

Choosing optimal K and L Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Choosing optimal K and L Determine accurately the KNN for m randomly-selected data points. distance (bandwidth) Choose error threshold e The optimal K and L should satisfy the approximate distance

Choosing optimal K and L Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Choosing optimal K and L For each K estimate the error for In one run for all L’s: find the minimal L satisfying the constraint L(K) Minimize time t(K,L(K)): minimum For each K we find the error for all L, we then apply threshold on the error and pick L for each value of K on this threshold. We can now check the running time for the chosen points which shows how many extra distances we have practically computed and and find the K and L that give the minimum for the running time. Approximation error for K,L Running time t[K,L(K)] L(K) for e=0.05

Data driven partitions Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Data driven partitions In original LSH, cut values are random in the range of the data. Suggestion: Randomly select a point from the data and use one of its coordinates as the cut value. Small number of points is obtained for many cells when we use data driven distributions whereas many points on a large number of cells is obtained for uniform distribution. uniform data driven points/bucket distribution

Additional speedup Mean-shift LSH: optimal k,l LSH: data partition LSH LSH: data struct Additional speedup

Speedup results 65536 points, 1638 points sampled , k=100 Note that the speedup does not even depend on the dimension 65536 points, 1638 points sampled , k=100

Food for thought Low dimension High dimension

A thought for food… Choose K, L by sample learning, or take the traditional. Can one estimate K, L without sampling? A thought for food: does it help to know the data dimensionality or the data manifold? Intuitively: dimensionality implies the number of hash functions needed. The catch: efficient dimensionality learning requires KNN. To demonstrate the intrinsic dimensionality issue: think of a data spreaded on a line manifold that lives in 3D – one projection is enough to capture i.e. project only in one direction The traditionsla is based on Indyk estimation which counts on some gaussian distribution of the data. 15:30 cookies…..

Summary LSH suggests a compromise on accuracy for the gain of complexity. Applications that involve massive data in high dimension require the LSH fast performance. Extension of the LSH to different spaces (PSH). Learning the LSH parameters and hash functions for different applications.

Conclusion ..but at the end everything depends on your data set Try it at home Visit: http://web.mit.edu/andoni/www/LSH/index.html Email Alex Andoni Andoni@mit.edu Test over your own data (C code under Red Hat Linux )

Thanks Ilan Shimshoni (Haifa). Mohamad Hegaze (Weizmann). Alex Andoni (MIT). Mica and Denis.