Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Slides:



Advertisements
Similar presentations
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Advertisements

A Nonlinear Approach to Dimension Reduction Robert Krauthgamer Weizmann Institute of Science Joint work with Lee-Ad Gottlieb TexPoint fonts used in EMF.
1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Spectral Approaches to Nearest Neighbor Search arXiv: Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi.
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Similarity Search in High Dimensions via Hashing
Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]
A Nonlinear Approach to Dimension Reduction Lee-Ad Gottlieb Weizmann Institute of Science Joint work with Robert Krauthgamer TexPoint fonts used in EMF.
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Microsoft Research)
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)
Spectral Approaches to Nearest Neighbor Search Alex Andoni Joint work with:Amirali Abdullah Ravi Kannan Robi Krauthgamer.
Nearest Neighbor Search in High-Dimensional Spaces Alexandr Andoni (Microsoft Research Silicon Valley)
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.
1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.
Optimal Data-Dependent Hashing for Approximate Near Neighbors
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
Random Projections of Signal Manifolds Michael Wakin and Richard Baraniuk Random Projections for Manifold Learning Chinmay Hegde, Michael Wakin and Richard.
Embedding and Sketching Non-normed spaces Alexandr Andoni (MSR)
Algorithms on negatively curved spaces James R. Lee University of Washington Robert Krauthgamer IBM Research (Almaden) TexPoint fonts used in EMF. Read.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)
Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)
Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Princeton/CCI → MSR SVC) Barriers II August 30, 2010.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Sketching and Nearest Neighbor Search (2) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.
Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)
1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.
Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Columbia) Robert Krauthgamer (Weizmann Inst) Ilya Razenshteyn (MIT) 1.
Summer School on Hashing’14 Dimension Reduction Alex Andoni (Microsoft Research)
S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.
Sparse RecoveryAlgorithmResults  Original signal x = x k + u, where x k has k large coefficients and u is noise.  Acquire measurements Ax = y. If |x|=n,
Coarse Differentiation and Planar Multiflows
Approximate Near Neighbors for General Symmetric Norms
Web Data Integration Using Approximate String Join
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
Sublinear Algorithmic Tools 2
Lecture 10: Sketching S3: Nearest Neighbor Search
Sketching and Embedding are Equivalent for Norms
Lecture 7: Dynamic sampling Dimension Reduction
Near(est) Neighbor in High Dimensions
Data-Dependent Hashing for Nearest Neighbor Search
Lecture 16: Earth-Mover Distance
CSCI B609: “Foundations of Data Science”
Near-Optimal (Euclidean) Metric Compression
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
Overcoming the L1 Non-Embeddability Barrier
Streaming Symmetric Norms via Measure Concentration
CS5112: Algorithms and Data Structures for Applications
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
Clustering.
President’s Day Lecture: Advanced Nearest Neighbor Search
Approximating Edit Distance in Near-Linear Time
Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech
Presentation transcript:

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Overcoming the L_1 non-embeddability barrier 2 Algorithms on Metric Spaces Fix a metric M Fix a computational problem Solve problem under M Ulam metric ED(x,y) = minimum number of edit operations that transform x into y. edit operation = insert/delete/ substitute a character ED( , ) = 2 Nearest Neighbor Search: Preprocess n strings, so that given a query string, can find the closest string to it. Compute distance between x,y Earthmover distance … … Hamming distance

Overcoming the L_1 non-embeddability barrier 3 Motivation for Nearest Neighbor Many applications:  Image search (Euclidean dist, Earth-mover dist)  Processing of genetic information, text processing (edit dist.)  many others… Generic Search Engine

Overcoming the L_1 non-embeddability barrier 4 A General Tool: Embeddings An embedding of M into a host metric (H,d H ) is a map f : M →H  preserves distances approximately  has distortion A ≥ 1 if for all x,y   d M (x,y) ≤ d H (f(x),f(y)) ≤ A*d M (x,y) Why?  If H is “easy” (= can solve efficiently computational problems like NNS)  Then get good algorithms for the original space M! f

Overcoming the L_1 non-embeddability barrier 5 Host space? Popular target metric: ℓ 1 Have efficient algorithms:  Distance estimation: O(d) for d-dimensional space (often less)  NNS: c-approx with O(n 1/c ) query time and O(n 1+1/c ) space [IM98] Powerful enough for some things… MetricReferencesUpper boundLower bound Edit distance over {0,1} d [OR05]; [KN05,KR06,AK07] 2 Õ(√log d) Ω(log d) Ulam (= edit distance over permutations) [CK06]; [AK07] O(log d)Ω̃(log d) Block edit distance over {0,1} d [MS00, CM07]; [Cor03] Õ(log d)4/3 Earthmover distance in  2 (sets of size s) [Cha02, IT03]; [NS07] O(log s)  (log 1/2 s) Earthmover distance in {0,1} d (set of size s) [AIK08]; [KN05] O(log s*log d)  (log s) ℓ 1 =real space with d 1 (x,y) =∑ i |x i -y i |

Overcoming the L_1 non-embeddability barrier 6 Below logarithmic? Cannot work with ℓ 1 Other possibilities?  (ℓ 2 ) p is bigger and algorithmically tractable but not rich enough (often same lower bounds)  ℓ ∞ is rich (includes all metrics), but not efficient computationally usually (high dimension) And that’s roughly it…   (at least for efficient NNS) (ℓ 2 ) p =real space with dist 2p (x,y)=||x-y|| 2 p ℓ ∞ =real space with dist ∞ (x,y)=max i |x i -y i |

Overcoming the L_1 non-embeddability barrier 7 d ∞,1 d1d1 … Meet our new host Iterated product space, Ρ 22,∞,1 = β α γ d1d1 … d ∞,1 d1d1 … d 22,∞,1

Overcoming the L_1 non-embeddability barrier 8 Why Ρ 22,∞,1 ? Because we can… Theorem 1. Ulam embeds into Ρ 22,∞,1 with O(1) distortion  Dimensions (γ,β,α)=(d, log d, d) Theorem 2. Ρ 22,∞,1 admits NNS on n points with O(log log n) approximation O(n ε ) query time and O(n 1+ε ) space In fact, there is more for Ulam… Rich Algorithmically tractable

Overcoming the L_1 non-embeddability barrier 9 Our Algorithms for Ulam Ulam = edit on strings where each symbol appears at most once  A classical distance between rankings  Exhibits hardness of misalignments (as in general edit) All lower bounds same as for general edit (up to Θ̃() ) Distortion of embedding into ℓ 1 (and (ℓ 2 ) p, etc): Θ̃(log d) Our approach implies new algorithms for Ulam: 1. NNS with O(log log n) approx, O(n ε ) query time Can improve to O(log log d) approx 2. Sketching with O(1)-approx in log O(1) d space 3. Distance estimation with O(1)-approx in time ED( , ) = 2 [BEKMRRS03]: when ED ¼ d, approx d ε in O(d 1-2ε ) time If we ever hope for approximation <<log d for NNS under general edit, first we have to get it under Ulam!

Overcoming the L_1 non-embeddability barrier 10 Theorem 1 Theorem 1. Can embed Ulam into Ρ 22,∞,1 with O(1) distortion  Dimensions (γ,β,α)=(d, log d, d) Proof  “Geometrization” of Ulam characterizations  Previously studied in the context of testing monotonicity (sortedness): Sublinear algorithms [EKKRV98, ACCL04] Data-stream algorithms [GJKK07, GG07, EH08]

Overcoming the L_1 non-embeddability barrier 11 Thm 1: Characterizing Ulam Consider permutations x,y over [d]  Assume for now: x = identity permutation Idea:  Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y))  Call them faulty characters Issues:  Ambiguity…  How do we count them? X= y=

Overcoming the L_1 non-embeddability barrier 12 Thm 1: Characterization – inversions Definition: chars a<b form inversion if b precedes a in y How to identify faulty char?  Has an inversion? Doesn’t work: all chars might have inversion  Has many inversions? Still can miss “faulty” chars  Has many inversions locally? Same problem Check if either is true! X= y=

Overcoming the L_1 non-embeddability barrier 13 Thm 1: Characterization – faulty chars Definition 1: a is faulty if exists K>0 s.t.  a is inverted w.r.t. a majority of the K symbols preceding a in y  (ok to consider K=2 k ) Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)) characters preceding 1 (all inversions with 1)

Overcoming the L_1 non-embeddability barrier 14 Thm 1: Characterization  Embedding To get embedding, need: 1. Symmetrization (neither string is identity) 2. Deal with “exists”, “majority”…? To resolve (1), use instead X[a;K] … Definition 2: a is faulty if exists K=2 k such that  |X[a;2 k ] Δ Y[a;2 k ]| > 2 k (symmetric difference) Y[5;4] X[5;4]

Overcoming the L_1 non-embeddability barrier 15 Thm 1: Embedding – final step We have Replace by weight? Final embedding: Y[5;2 2 ] X[5;2 2 ] equal 1 iff true ( )2)2

Overcoming the L_1 non-embeddability barrier 16 Theorem 2 Theorem 2. Ρ 22,∞,1 admits NNS on n points  O(log log n) approximation  O(n ε ) query time and O(n 1+ε ) space for any small ε (ignoring (αβγ) O(1) ) A rather general approach “LSH” on ℓ 1 -products of general metric spaces  Of course, cannot do, but can reduce to ℓ ∞ -products

Overcoming the L_1 non-embeddability barrier 17 Thm 2: Proof Let’s start from basics: ℓ 1 α  [IM98]: c-approx with O(n 1/c ) query time and O(n 1+1/c ) space (ignoring α O(1) ) Ok, what about Suppose: NNS for M with c M -approx Q M query time S M space. Then: NNS for O(c M * log log n) -approx Õ(Q M ) query time O(S M * n 1+ε ) space. [I02]

Overcoming the L_1 non-embeddability barrier 18 Thm 2: What about (ℓ 2 ) 2 -product? Enough to consider  (for us, M is the l 1 -product) Off-the-shelf?  [I04]: gives space ~n  or >log n approximation We reduce to multiple NNS queries under  Instructive to first look at NNS for standard ℓ 1 …

Overcoming the L_1 non-embeddability barrier 19 Thm 2: Review of NNS for ℓ 1 LSH family: collection H of hash functions such that:  For random h  H (parameter  >0) Pr[h(q)=h(p)] ≈ 1-||q-p|| 1 /  Query just uses primitive: Can obtain H by imposing randomly-shifted grid of side-length   Then for h defined by r i 2 [0,  ] at random, primitive becomes:  p q “return all points p such that h(q)=h(p) “return all p s.t. |q i -p i |<r i for all i  [d]

Overcoming the L_1 non-embeddability barrier 20 Thm 2: LSH for ℓ 1 -product Intuition: abstract LSH! Recall we had: for r i random from [0,  ], point p returned if for all i: |q i -p i |<r i Equivalently  For all i:  p q ℓ ∞ product of R ! “return all points p’s such that max i d M (q i,p i )/r i <1 For ℓ 1 For “return all p s.t. |q i -p i |<r i for all i  [d]

Overcoming the L_1 non-embeddability barrier 21 Thm 2: Final Thus, sufficient to solve primitive: We reduced NNS over to several instances of NNS over (with appropriately scaled coordinates) Approximation is O(1)*O(log log n) Done! “return all points p’s such that max i d M (q i,p i )/r i <1 (in fact, for k independent choices of (r 1,…r d )) For

Overcoming the L_1 non-embeddability barrier 22 Take-home message: Can embed combinatorial metrics into iterated product spaces  Works for Ulam (=edit on non-repetitive strings) Approach bypasses non-embeddability results into usual-suspect spaces like ℓ 1, (ℓ 2 ) 2 … Open: Embeddings for edit over {0,1} d, EMD, other metrics? Understanding product spaces? [Jayram-Woodruff]: sketching