Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Overcoming the L_1 non-embeddability barrier 2 Algorithms on Metric Spaces Fix a metric M Fix a computational problem Solve problem under M Ulam metric ED(x,y) = minimum number of edit operations that transform x into y. edit operation = insert/delete/ substitute a character ED( , ) = 2 Nearest Neighbor Search: Preprocess n strings, so that given a query string, can find the closest string to it. Compute distance between x,y Earthmover distance … … Hamming distance
Overcoming the L_1 non-embeddability barrier 3 Motivation for Nearest Neighbor Many applications: Image search (Euclidean dist, Earth-mover dist) Processing of genetic information, text processing (edit dist.) many others… Generic Search Engine
Overcoming the L_1 non-embeddability barrier 4 A General Tool: Embeddings An embedding of M into a host metric (H,d H ) is a map f : M →H preserves distances approximately has distortion A ≥ 1 if for all x,y d M (x,y) ≤ d H (f(x),f(y)) ≤ A*d M (x,y) Why? If H is “easy” (= can solve efficiently computational problems like NNS) Then get good algorithms for the original space M! f
Overcoming the L_1 non-embeddability barrier 5 Host space? Popular target metric: ℓ 1 Have efficient algorithms: Distance estimation: O(d) for d-dimensional space (often less) NNS: c-approx with O(n 1/c ) query time and O(n 1+1/c ) space [IM98] Powerful enough for some things… MetricReferencesUpper boundLower bound Edit distance over {0,1} d [OR05]; [KN05,KR06,AK07] 2 Õ(√log d) Ω(log d) Ulam (= edit distance over permutations) [CK06]; [AK07] O(log d)Ω̃(log d) Block edit distance over {0,1} d [MS00, CM07]; [Cor03] Õ(log d)4/3 Earthmover distance in 2 (sets of size s) [Cha02, IT03]; [NS07] O(log s) (log 1/2 s) Earthmover distance in {0,1} d (set of size s) [AIK08]; [KN05] O(log s*log d) (log s) ℓ 1 =real space with d 1 (x,y) =∑ i |x i -y i |
Overcoming the L_1 non-embeddability barrier 6 Below logarithmic? Cannot work with ℓ 1 Other possibilities? (ℓ 2 ) p is bigger and algorithmically tractable but not rich enough (often same lower bounds) ℓ ∞ is rich (includes all metrics), but not efficient computationally usually (high dimension) And that’s roughly it… (at least for efficient NNS) (ℓ 2 ) p =real space with dist 2p (x,y)=||x-y|| 2 p ℓ ∞ =real space with dist ∞ (x,y)=max i |x i -y i |
Overcoming the L_1 non-embeddability barrier 7 d ∞,1 d1d1 … Meet our new host Iterated product space, Ρ 22,∞,1 = β α γ d1d1 … d ∞,1 d1d1 … d 22,∞,1
Overcoming the L_1 non-embeddability barrier 8 Why Ρ 22,∞,1 ? Because we can… Theorem 1. Ulam embeds into Ρ 22,∞,1 with O(1) distortion Dimensions (γ,β,α)=(d, log d, d) Theorem 2. Ρ 22,∞,1 admits NNS on n points with O(log log n) approximation O(n ε ) query time and O(n 1+ε ) space In fact, there is more for Ulam… Rich Algorithmically tractable
Overcoming the L_1 non-embeddability barrier 9 Our Algorithms for Ulam Ulam = edit on strings where each symbol appears at most once A classical distance between rankings Exhibits hardness of misalignments (as in general edit) All lower bounds same as for general edit (up to Θ̃() ) Distortion of embedding into ℓ 1 (and (ℓ 2 ) p, etc): Θ̃(log d) Our approach implies new algorithms for Ulam: 1. NNS with O(log log n) approx, O(n ε ) query time Can improve to O(log log d) approx 2. Sketching with O(1)-approx in log O(1) d space 3. Distance estimation with O(1)-approx in time ED( , ) = 2 [BEKMRRS03]: when ED ¼ d, approx d ε in O(d 1-2ε ) time If we ever hope for approximation <<log d for NNS under general edit, first we have to get it under Ulam!
Overcoming the L_1 non-embeddability barrier 10 Theorem 1 Theorem 1. Can embed Ulam into Ρ 22,∞,1 with O(1) distortion Dimensions (γ,β,α)=(d, log d, d) Proof “Geometrization” of Ulam characterizations Previously studied in the context of testing monotonicity (sortedness): Sublinear algorithms [EKKRV98, ACCL04] Data-stream algorithms [GJKK07, GG07, EH08]
Overcoming the L_1 non-embeddability barrier 11 Thm 1: Characterizing Ulam Consider permutations x,y over [d] Assume for now: x = identity permutation Idea: Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y)) Call them faulty characters Issues: Ambiguity… How do we count them? X= y=
Overcoming the L_1 non-embeddability barrier 12 Thm 1: Characterization – inversions Definition: chars a<b form inversion if b precedes a in y How to identify faulty char? Has an inversion? Doesn’t work: all chars might have inversion Has many inversions? Still can miss “faulty” chars Has many inversions locally? Same problem Check if either is true! X= y=
Overcoming the L_1 non-embeddability barrier 13 Thm 1: Characterization – faulty chars Definition 1: a is faulty if exists K>0 s.t. a is inverted w.r.t. a majority of the K symbols preceding a in y (ok to consider K=2 k ) Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)) characters preceding 1 (all inversions with 1)
Overcoming the L_1 non-embeddability barrier 14 Thm 1: Characterization Embedding To get embedding, need: 1. Symmetrization (neither string is identity) 2. Deal with “exists”, “majority”…? To resolve (1), use instead X[a;K] … Definition 2: a is faulty if exists K=2 k such that |X[a;2 k ] Δ Y[a;2 k ]| > 2 k (symmetric difference) Y[5;4] X[5;4]
Overcoming the L_1 non-embeddability barrier 15 Thm 1: Embedding – final step We have Replace by weight? Final embedding: Y[5;2 2 ] X[5;2 2 ] equal 1 iff true ( )2)2
Overcoming the L_1 non-embeddability barrier 16 Theorem 2 Theorem 2. Ρ 22,∞,1 admits NNS on n points O(log log n) approximation O(n ε ) query time and O(n 1+ε ) space for any small ε (ignoring (αβγ) O(1) ) A rather general approach “LSH” on ℓ 1 -products of general metric spaces Of course, cannot do, but can reduce to ℓ ∞ -products
Overcoming the L_1 non-embeddability barrier 17 Thm 2: Proof Let’s start from basics: ℓ 1 α [IM98]: c-approx with O(n 1/c ) query time and O(n 1+1/c ) space (ignoring α O(1) ) Ok, what about Suppose: NNS for M with c M -approx Q M query time S M space. Then: NNS for O(c M * log log n) -approx Õ(Q M ) query time O(S M * n 1+ε ) space. [I02]
Overcoming the L_1 non-embeddability barrier 18 Thm 2: What about (ℓ 2 ) 2 -product? Enough to consider (for us, M is the l 1 -product) Off-the-shelf? [I04]: gives space ~n or >log n approximation We reduce to multiple NNS queries under Instructive to first look at NNS for standard ℓ 1 …
Overcoming the L_1 non-embeddability barrier 19 Thm 2: Review of NNS for ℓ 1 LSH family: collection H of hash functions such that: For random h H (parameter >0) Pr[h(q)=h(p)] ≈ 1-||q-p|| 1 / Query just uses primitive: Can obtain H by imposing randomly-shifted grid of side-length Then for h defined by r i 2 [0, ] at random, primitive becomes: p q “return all points p such that h(q)=h(p) “return all p s.t. |q i -p i |<r i for all i [d]
Overcoming the L_1 non-embeddability barrier 20 Thm 2: LSH for ℓ 1 -product Intuition: abstract LSH! Recall we had: for r i random from [0, ], point p returned if for all i: |q i -p i |<r i Equivalently For all i: p q ℓ ∞ product of R ! “return all points p’s such that max i d M (q i,p i )/r i <1 For ℓ 1 For “return all p s.t. |q i -p i |<r i for all i [d]
Overcoming the L_1 non-embeddability barrier 21 Thm 2: Final Thus, sufficient to solve primitive: We reduced NNS over to several instances of NNS over (with appropriately scaled coordinates) Approximation is O(1)*O(log log n) Done! “return all points p’s such that max i d M (q i,p i )/r i <1 (in fact, for k independent choices of (r 1,…r d )) For
Overcoming the L_1 non-embeddability barrier 22 Take-home message: Can embed combinatorial metrics into iterated product spaces Works for Ulam (=edit on non-repetitive strings) Approach bypasses non-embeddability results into usual-suspect spaces like ℓ 1, (ℓ 2 ) 2 … Open: Embeddings for edit over {0,1} d, EMD, other metrics? Understanding product spaces? [Jayram-Woodruff]: sketching