Spectral Approaches to Nearest Neighbor Search [FOCS 2014]

Slides:

Advertisements

Similar presentations

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

Advertisements

Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

A Nonlinear Approach to Dimension Reduction Robert Krauthgamer Weizmann Institute of Science Joint work with Lee-Ad Gottlieb TexPoint fonts used in EMF.

1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)

Spectral Approaches to Nearest Neighbor Search arXiv: Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.

Dimensionality Reduction PCA -- SVD

Similarity Search in High Dimensions via Hashing

Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)

Spectral Approaches to Nearest Neighbor Search Alex Andoni Joint work with:Amirali Abdullah Ravi Kannan Robi Krauthgamer.

Large-scale matching CSE P 576 Larry Zitnick

Affine-invariant Principal Components Charlie Brubaker and Santosh Vempala Georgia Tech School of Computer Science Algorithms and Randomness Center.

Principal Component Analysis

Dimensionality Reduction and Embeddings

1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.

Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)

Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.

Dimensionality Reduction

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Sketching and Nearest Neighbor Search (2) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Summer School on Hashing’14 Dimension Reduction Alex Andoni (Microsoft Research)

Clustering with Spectral Norm and the k-means algorithm Ravi Kannan Microsoft Research Bangalore joint work with Amit Kumar (Indian Institute of Technology,

Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.

Spatial Data Management

Data-dependent Hashing for Similarity Search

Clustering Data Streams

Data Driven Resource Allocation for Distributed Learning

Approximate Near Neighbors for General Symmetric Norms

Motion Segmentation with Missing Data using PowerFactorization & GPCA

Spectral Clustering.

Lecture 18: Uniformity Testing Monotonicity Testing

Sublinear Algorithmic Tools 3

Machine Learning Dimensionality Reduction

Lecture 11: Nearest Neighbor Search

Sublinear Algorithmic Tools 2

LSI, SVD and Data Management

Singular Value Decomposition

K Nearest Neighbor Classification

Lecture 10: Sketching S3: Nearest Neighbor Search

Sketching and Embedding are Equivalent for Norms

Lecture 4: CountSketch High Frequencies

Lecture 7: Dynamic sampling Dimension Reduction

Near(est) Neighbor in High Dimensions

Data-Dependent Hashing for Nearest Neighbor Search

Lecture 16: Earth-Mover Distance

Near-Optimal (Euclidean) Metric Compression

Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

Locality Sensitive Hashing

Sampling in Graphs: node sparsifiers

Streaming Symmetric Norms via Measure Concentration

Feature space tansformation methods

Lecture 6: Counting triangles Dynamic graphs & sampling

Lecture 15: Least Square Regression Metric Embeddings

On Clusterings: Good, Bad, and Spectral

Minwise Hashing and Efficient Search

President’s Day Lecture: Advanced Nearest Neighbor Search

Learning-Based Low-Rank Approximations

Presentation transcript:

Spectral Approaches to Nearest Neighbor Search [FOCS 2014] Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi Kannan Simons Institute, Nov. 2016

Nearest Neighbor Search (NNS) Preprocess: a set 𝑃 of 𝑛 points in ℝ 𝑑 Query: given a query point 𝑞, report a point 𝑝 ∗ ∈𝑃 with the smallest distance to 𝑞 𝑝 ∗ 𝑞

Motivation Generic setup: Application areas: Distance can be: Points model objects (e.g. images) Distance models (dis)similarity measure Application areas: machine learning: k-NN rule signal processing, vector quantization, bioinformatics, etc… Distance can be: Hamming, Euclidean, edit distance, earth-mover distance, … 000000 011100 010100 000100 011111 000000 001100 000100 110100 111111 𝑝 ∗ 𝑞

Curse of Dimensionality All exact algorithms degrade rapidly with the dimension 𝑑 Algorithm Query time Space Full indexing 𝑂(log 𝑛⋅𝑑) 𝑛 𝑂(𝑑) (Voronoi diagram size) No indexing – linear scan 𝑂(𝑛⋅𝑑)

Approximate NNS Given a query point 𝑞, report 𝑝 ′ ∈𝑃 s.t. 𝑝′−𝑞 ≤𝑐 min 𝑝 ∗ 𝑝 ∗ −𝑞 𝑐≥1 : approximation factor randomized: return such 𝑝′ with probability ≥90% Heuristic perspective: gives a set of candidates (hopefully small) 𝑝 ∗ 𝑞 𝑝 ′

NNS algorithms It’s all about space partitions ! Low-dimensional [Arya-Mount’93], [Clarkson’94], [Arya-Mount- Netanyahu-Silverman-We’98], [Kleinberg’97], [HarPeled’02],[Arya-Fonseca-Mount’11],… High-dimensional [Indyk-Motwani’98], [Kushilevitz-Ostrovsky- Rabani’98], [Indyk’98, ‘01], [Gionis-Indyk- Motwani’99], [Charikar’02], [Datar-Immorlica- Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [Andoni- Indyk’06], [Andoni-Indyk-Nguyen- Razenshteyn’14], [Andoni-Razenshteyn’15]

Low-dimensional kd-trees,… 𝑐=1+𝜖 runtime: 𝜖 −𝑂(𝑑) ⋅log 𝑛

High-dimensional Locality-Sensitive Hashing Crucial use of random projections Johnson-Lindenstrauss Lemma: project to random subspace of dimension 𝑂( 𝜖 −2 log 𝑛 ) for 1+𝜖 approximation Runtime: 𝑛 1/𝑐 for 𝑐-approximation

Practice Data-aware partitions optimize the partition to your dataset PCA-tree [Sproull’91, McNames’01, Verma-Kpotufe-Dasgupta’09] randomized kd-trees [SilpaAnan-Hartley’08, Muja-Lowe’09] spectral/PCA/semantic/WTA hashing [Weiss-Torralba-Fergus’08, Wang-Kumar-Chang’09, Salakhutdinov-Hinton’09, Yagnik-Strelow-Ross- Lin’11]

Practice vs Theory Data-aware projections often outperform (vanilla) random-projection methods But no guarantees (correctness or performance) JL generally optimal [Alon’03, Jayram-Woodruff’11] Even for some NNS setups! [Andoni-Indyk-Patrascu’06] Why do data-aware projections outperform random projections ? Algorithmic framework to study phenomenon?

Plan for the rest Model Two spectral algorithms Conclusion

Our model “low-dimensional signal + large noise” inside high dimensional space Signal: 𝑃⊂𝑈 for subspace 𝑈⊂ ℝ 𝑑 of dimension 𝑘≪𝑑 Data: each point in 𝑃 is perturbed by a full-dimensional Gaussian noise 𝑁 𝑑 (0, 𝜎 2 𝐼 𝑑 ) 𝑈

Model properties Data 𝑃 =𝑃+𝐺 Query 𝑞 =𝑞+ 𝑔 𝑞 s.t.: each point in P have at least unit norm Query 𝑞 =𝑞+ 𝑔 𝑞 s.t.: ||𝑞− 𝑝 ∗ ||≤1 for “nearest neighbor” 𝑝 ∗ ||𝑞−𝑝||≥1+𝜖 for everybody else Noise entries 𝑁(0, 𝜎 2 ) 𝜎≈ 1 𝑑 1/4 up to factor poly( 𝜖 −1 𝑘 log 𝑛) Claim: exact nearest neighbor is still the same Noise is large: has magnitude 𝜎 𝑑 ≈ 𝑑 1/4 ≫1 top 𝑘 dimensions of 𝑃 capture sub-constant mass JL would not work: after noise, gap very close to 1

Algorithms via PCA Find the “signal subspace” 𝑈 ? then can project everything to 𝑈 and solve NNS there Use Principal Component Analysis (PCA)? ≈ extract top direction(s) from SVD e.g., 𝑘-dimensional space 𝑆 that minimizes 𝑝∈𝑃 𝑑 2 (𝑝,𝑆) If PCA removes noise “perfectly”, we are done: 𝑆=𝑈 Can reduce to 𝑘-dimensional NNS

NNS performance as if we are in 𝑘 dimensions for full model? Best we can hope for dataset contains a “worst-case” 𝑘-dimensional instance Effectively reduce dimension 𝑑 to 𝑘 Spoiler: Yes NNS performance as if we are in 𝑘 dimensions for full model?

PCA under noise fails Does PCA find “signal subspace” 𝑈 under noise ? PCA minimizes 𝑝∈𝑃 𝑑 2 (𝑝,𝑆) good only on “average”, not “worst-case” weak signal directions overpowered by noise directions typical noise direction contributes 𝑖=1 𝑛 𝑔 𝑖 2 𝜎 2 =Θ(𝑛 𝜎 2 ) 𝑝 ∗

1st Algorithm: intuition Extract “well-captured points” points with signal mostly inside top PCA space should work for large fraction of points Iterate on the rest 𝑝 ∗

Iterative PCA To make this work: Find top PCA subspace 𝑆 Nearly no noise in 𝑆: ensuring 𝑆 close to 𝑈 𝑆 determined by heavy-enough spectral directions (dimension may be less than 𝑘) Capture only points whose signal fully in 𝑆 well-captured: distance to 𝑆 explained by noise only Find top PCA subspace 𝑆 𝐶=points well-captured by 𝑆 Build NNS d.s. on {𝐶 projected onto 𝑆} Iterate on the remaining points, 𝑃 ∖𝐶 Query: query each NNS d.s. separately

Simpler model Find top-k PCA subspace 𝑆 𝐶=points well-captured by 𝑆 Build NNS on {𝐶 projected onto 𝑆} Iterate on remaining points, 𝑃 ∖𝐶 Query: query each NNS separately Assume: small noise 𝑝 𝑖 = 𝑝 𝑖 + 𝛼 𝑖 , where || 𝛼 𝑖 ||≪𝜖 can be even adversarial Algorithm: well-captured if 𝑑 𝑝 ,𝑆 ≤2𝛼 Claim 1: if 𝑝 ∗ captured by 𝐶, will find it in NNS for any captured 𝑝 : || 𝑝 𝑆 − 𝑞 𝑆 ||=|| 𝑝 −𝑞||±4𝛼=||𝑝−𝑞||±5𝛼 Claim 2: number of iterations is 𝑂( log 𝑛 ) 𝑝 ∈ 𝑃 𝑑 2 ( 𝑝 ,𝑆) ≤ 𝑝 ∈ 𝑃 𝑑 2 𝑝 ,𝑈 ≤𝑛⋅ 𝛼 2 for at most 1/4-fraction of points, 𝑑 2 𝑝 ,𝑆 ≥4 𝛼 2 hence constant fraction captured in each iteration

Analysis of general model Noise is larger, must use that it is random “Signal” should be stronger than “noise” (on average) Use random matrix theory 𝑃 =𝑃+𝐺 𝐺 is random 𝑛×𝑑 with entries 𝑁(0, 𝜎 2 ) All singular values 𝜆 2 ≤ 𝜎 2 𝑛≈𝑛/ 𝑑 𝑃 has rank ≤𝑘 and (Frobenius-norm)2 ≥𝑛 important directions have 𝜆 2 ≥Ω(𝑛/𝑘) can ignore directions with 𝜆 2 ≪𝜖𝑛/𝑘 Important signal directions stronger than noise!

Closeness of subspaces ? Trickier than singular values Top singular vector not stable under perturbation! Only stable if second singular value much smaller How to even define “closeness” of subspaces? To the rescue: Wedin’s sin-theta theorem sin 𝜃 𝑆,𝑈 = max 𝑥∈𝑆 |𝑥|=1 min 𝑦∈𝑈 ||𝑥−𝑦|| 𝑆 𝜃 𝑈

Wedin’s sin-theta theorem Developed by [Davis-Kahan’70], [Wedin’72] Theorem: Consider 𝑃 =𝑃+𝐺 𝑆 is top-𝑙 subspace of 𝑃 𝑈 is the 𝑘-space containing 𝑃 Then: sin 𝜃 𝑆,𝑈 ≤ ||𝐺|| 𝜆 𝑙 (𝑃) Another way to see why we need to take directions with sufficiently heavy singular values 𝜃

Additional issue: Conditioning After an iteration, the noise is not random anymore! non-captured points might be “biased” by capturing criterion Fix: estimate top PCA subspace from a small sample of the data Might be purely due to analysis But does not sound like a bad idea in practice either

Performance of Iterative PCA Can prove there are 𝑂 𝑑 log 𝑛 iterations In each, we have NNS in ≤𝑘 dimensional space Overall query time: 𝑂 1 𝜖 𝑂 𝑘 ⋅ 𝑑 ⋅ log 3/2 𝑛 Reduced to 𝑂 𝑑 log 𝑛 instances of 𝑘-dimension NNS!

2nd Algorithm: PCA-tree Closer to algorithms used in practice Find top PCA direction 𝑣 Partition into slabs ⊥𝑣 Snap points to ⊥ hyperplane Recurse on each slice Query: follow all tree paths that may contain 𝑝 ∗ ≈𝜖/𝑘

Two algorithmic modifications Centering: Need to use centered PCA (subtract average) Otherwise errors from perturbations accumulate Sparsification: Need to sparsify the set of points in each node of the tree Otherwise can get a “dense” cluster: not enough variance in signal lots of noise Find top PCA direction 𝑣 Partition into slabs ⊥𝑣 Snap points to ⊥ hyperplanes Recurse on each slice Query: follow all tree paths that may contain 𝑝 ∗

Analysis An “extreme” version of Iterative PCA Algorithm: just use the top PCA direction: guaranteed to have signal ! Main lemma: the tree depth is ≤2𝑘 because each discovered direction close to 𝑈 snapping: like orthogonalizing with respect to each one cannot have too many such directions Query runtime: 𝑂 𝑘 𝜖 2𝑘 Overall performs like 𝑂(𝑘⋅log 𝑘)-dimensional NNS!

Wrap-up Recent development: Here: Immediate questions: Why do data-aware projections outperform random projections ? Algorithmic framework to study phenomenon? Recent development: Data-aware worst-case algorithm [Andoni-Razenshtein’15] Here: Model: “low-dimensional signal + large noise” like NNS in low dimensional space via “right” adaptation of PCA Immediate questions: Other, less-structured signal/noise models? Algorithms with runtime dependent on spectrum? Broader Q: Analysis that explains empirical success?