Embedding and Sketching Non-normed spaces Alexandr Andoni (MSR)

Slides:



Advertisements
Similar presentations
Shortest Vector In A Lattice is NP-Hard to approximate
Advertisements

A Nonlinear Approach to Dimension Reduction Robert Krauthgamer Weizmann Institute of Science Joint work with Lee-Ad Gottlieb TexPoint fonts used in EMF.
1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Metric Embeddings with Relaxed Guarantees Hubert Chan Joint work with Kedar Dhamdhere, Anupam Gupta, Jon Kleinberg, Aleksandrs Slivkins.
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
Metric embeddings, graph expansion, and high-dimensional convex geometry James R. Lee Institute for Advanced Study.
Geometric embeddings and graph expansion James R. Lee Institute for Advanced Study (Princeton) University of Washington (Seattle)
Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]
A Nonlinear Approach to Dimension Reduction Lee-Ad Gottlieb Weizmann Institute of Science Joint work with Robert Krauthgamer TexPoint fonts used in EMF.
Distance Scales, Embeddings, and Metrics of Negative Type By James R. Lee Presented by Andy Drucker Mar. 8, 2007 CSE 254: Metric Embeddings.
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Microsoft Research)
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)
Nearest Neighbor Search in High-Dimensional Spaces Alexandr Andoni (Microsoft Research Silicon Valley)
A Note on Finding the Nearest Neighbor in Growth-Restricted Metrics Kirsten Hildrum John Kubiatowicz Sean Ma Satish Rao.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
Dimensionality Reduction
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
Distance scales, embeddings, and efficient relaxations of the cut cone James R. Lee University of California, Berkeley.
Footer Here1 Feature Selection Copyright, 1996 © Dale Carnegie & Associates, Inc. David Mount For CMSC 828K: Algorithms and Data Structures for Information.
Volume distortion for subsets of R n James R. Lee Institute for Advanced Study & University of Washington Symposium on Computational Geometry, 2006; Sedona,
Integrality Gaps for Sparsest Cut and Minimum Linear Arrangement Problems Nikhil R. Devanur Subhash A. Khot Rishi Saket Nisheeth K. Vishnoi.
Algorithms on negatively curved spaces James R. Lee University of Washington Robert Krauthgamer IBM Research (Almaden) TexPoint fonts used in EMF. Read.
On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)
Randomized Online Algorithm for Minimum Metric Bipartite Matching Adam Meyerson UCLA.
1 Oblivious Routing in Wireless networks Costas Busch Rensselaer Polytechnic Institute Joint work with: Malik Magdon-Ismail and Jing Xi.
Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Princeton/CCI → MSR SVC) Barriers II August 30, 2010.
13 th Nov Geometry of Graphs and It’s Applications Suijt P Gujar. Topics in Approximation Algorithms Instructor : T Kavitha.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Embeddings, flow, and cuts: an introduction University of Washington James R. Lee.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)
Doubling Dimension: a short survey Anupam Gupta Carnegie Mellon University Barriers in Computational Complexity II, CCI, Princeton.
Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams.
Graph Partitioning using Single Commodity Flows
Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Columbia) Robert Krauthgamer (Weizmann Inst) Ilya Razenshteyn (MIT) 1.
Summer School on Hashing’14 Dimension Reduction Alex Andoni (Microsoft Research)
Sparse RecoveryAlgorithmResults  Original signal x = x k + u, where x k has k large coefficients and u is noise.  Acquire measurements Ax = y. If |x|=n,
Coarse Differentiation and Planar Multiflows
Approximate Near Neighbors for General Symmetric Norms
Stream-based Geometric Algorithms
Dimension reduction for finite trees in L1
Generalized Sparsest Cut and Embeddings of Negative-Type Metrics
Haim Kaplan and Uri Zwick
Ultra-low-dimensional embeddings of doubling metrics
Sublinear Algorithmic Tools 3
Lecture 10: Sketching S3: Nearest Neighbor Search
Sketching and Embedding are Equivalent for Norms
Near(est) Neighbor in High Dimensions
Data-Dependent Hashing for Nearest Neighbor Search
Lecture 16: Earth-Mover Distance
Parallel Algorithms for Geometric Graph Problems
Near-Optimal (Euclidean) Metric Compression
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
Overcoming the L1 Non-Embeddability Barrier
Range-Efficient Computation of F0 over Massive Data Streams
Embedding and Sketching
Embedding Metrics into Geometric Spaces
Lecture 15: Least Square Regression Metric Embeddings
Randomized Online Algorithm for Minimum Metric Bipartite Matching
Approximating Edit Distance in Near-Linear Time
Presentation transcript:

Embedding and Sketching Non-normed spaces Alexandr Andoni (MSR)

Embedding / Sketching  Definition: an embedding is a map f:M  H of a metric (M, d M ) into a host metric (H,  H ) such that for any x,y  M: d M (x,y) ≤  H (f(x), f(y)) ≤ D * d M (x,y) where D is the distortion (approximation) of the embedding f.  Embeddings come in all shapes and colors:  Source/host spaces M,H  Distortion D  Can be randomized:  H (f(x), f(y)) ≈ d M (x,y) with 1-  probability  Can be non-oblivious: given set S  M, compute f(x) (depends on entire S)  Time to compute f(x)  …  Types of embeddings:  From a norm ( ℓ 1 ) into another norm ( ℓ ∞ )  From norm to the same norm but of lower dimension (dimension reduction)  From non-norms (Earth-Mover Distance, edit distance) into a norm ( ℓ 1 )  From given finite metric (shortest path on a planar graph) into a norm ( ℓ 1 )  …

Earth-Mover Distance  Definition:  Given two sets A, B of points in a metric space  EMD(A,B) = min cost bipartite matching between A and B  Which metric space?  Can be plane, ℓ 2, ℓ 1 …  Applications in image vision

Planar EMD  Consider EMD on grid [  ]x[  ], and sets of size s  What do we want to do?  Compute EMD between two sets (min-cost bi-chromatic matching)  Closest pair, nearest neighbor search, etc  What can we do?  Exact computation: O(s 2+  ) time [AES95]  No non-trivial nearest neighbor search (exact)  In fact, at least as hard as Hamming space of dimension  (  2 )

Approximate algorithms via embedding

Couple definitions

EMD over small grid  Suppose  =3  How to embed A,B in [3] 2 into ℓ 1 with distortion O(1) ?  f(A) has nine coordinates, counting # points in each joint  f(A)=(2,1,1,0,0,0,1,0,0)  f(B)=(1,1,0,0,2,0,0,0,1)

Embedding EMD([  ] 2 ) into ℓ 1 8  Sets of size s in [1…  ]x[1…  ] box  Embedding of set A:  impose randomly-shifted grid  Each grid cell gives a coordinate: f (A) c =#points in the cell c  Subpartition the grid recursively, and assign new coordinates for each new cell (on all levels)

Main Approach + ≈

Decomposition Lemma [I07]  For randomly-shifted cut-grid G of side length k, we have:  EEMD  (A,B) ≤ EEMD k (A 1, B 1 ) + EEMD k (A 2,B 2 )+… + k *EEMD  /k (A G, B G )  3*EEMD  (A,B)   [ EEMD k (A 1, B 1 ) + EEMD k (A 2,B 2 )+… ]  EEMD  (A,B)   [ k *EEMD  /k (A G, B G ) ]  The main embedding will follow by applying the lemma recursively to (A G,B G ) /k/k k

Proof of Decomposition Lemma: Part 1  For a randomly-shifted cut-grid G of side length k, we have:  EEMD  (A,B) ≤ EEMD k (A 1, B 1 ) + EEMD k (A 2,B 2 )+… + k *EEMD  /k (A G, B G )  Extract a matching  from the matchings on right-hand side  For each a  A, with a  A i, it is either:  matched in EEMD(A i,B i ) to some b  B i  or a  A i \B i, and it is matched in EEMD(A G,B G ) to some b  B j  Match cost of a (2 nd case):  Move a to center (  )  paid by EEMD(A i,B i )  Move from cell i to cell j  paid by EEMD(A G,B G )  Extra points |A-B| pay k*  /k=  /k/k k

Proof of Decomposition Lemma: Part 2 & 3  For a randomly-shifted cut-grid G of side length k, we have:  3*EEMD  (A,B)   [ EEMD k (A 1, B 1 ) + EEMD k (A 2,B 2 )+… ]  EEMD  (A,B)   [ k *EEMD  /k (A G, B G ) ]  Fix a matching  minimizing EEMD  (A,B)  Will construct matchings for each EEMD on RHS  Uncut pairs (a,b) are matched in respective (A i,B i )  Cut pairs (a,b) are matched  in (A G,B G )  and remain unmatched in their mini-grids

Part 2: 3*EEMD  (A,B)   [ ∑ i EEMD k (A i, B i )]  Uncut pairs (a,b) are matched in respective (A i,B i )  Contribute a total ≤ EEMD  (A,B)  Consider a cut pair (a,b) at distance a-b=(d x,d y )  Contribute ≤ 2k to ∑ i EEMD k (A i, B i )  Pr[(a,b) cut] = 1-(1-d x / k)(1-d y /k) ≤ (d x +d y )/k  Expected contribution ≤ Pr[(a,b) cut] *2k = 2 (d x +d y )=2||a-b|| 1  In total, contribute 2*EEMD  (A,B) dxdx k

Part 3: EEMD  (A,B)   [ k*EEMD  /k (A G, B G ) ]  All uncut pairs contribute zero to k*EEMD  /k (A G, B G )  For a cut pair at distance a-b=(d x,d y )  if d x = xk+r x, and d y = yk+r y, then  expected cost ≤ (x+r x /k) * k + (y+r y /k) * k = d x +d y = ||a-b|| 1  Total expected cost ≤ EEMD  (A,B) dxdx k kk

Embedding into ℓ 1 using the Decomposition Lemma  For randomly-shifted cut-grid G of side length k, we have:  EEMD  (A,B) ≤ ∑ i EEMD k (A i, B i ) + k *EEMD  /k (A G, B G )  3*EEMD  (A,B)   [ ∑ i EEMD k (A i, B i ) ]  EEMD  (A,B)   [ k *EEMD  /k (A G, B G ) ]  To embed into ℓ 1, we applying it recursively for k=3  Choose randomly-shifted cut-grid G 1 on [  ] 2  Obtain many grids [3] 2, and a big grid [  /3] 2  Then choose randomly-shifted cut-grid G 2 on [  /3] 2  Obtain more grids [3] 2, and another big grid [  /3 2 ] 2  Then choose randomly-shifted cut-grid G 3 on [  /9] 2 ……  Then, embed each of the small grids [3] 2 into ℓ 1, using O(1) distortion embedding, and concatenate the embeddings

Proving recursion works  Embedding does not contract distances:  EEMD  (A,B) ≤  ∑ i EEMD k (A i, B i ) + k *EEMD  /k (A G1, B G1 ) ≤  ∑ i EEMD k (A i, B i ) + k∑ i EEMD k (A G1,i, B G1,i )+ k *EEMD  /k (A G2, B G2 ) ≤ …  Embedding distorts distances by O(log  ), in expectation:  (3log k  ) * EEMD  (A,B)   3* EEMD  (A, B) + (3log k  /k)* EEMD  (A, B)    [ ∑ i EEMD k (A i, B i ) + (3log k  /k)* k *EEMD  /k (A G1, B G1 ) ]   …  By Markov’s, it’s O(log  ) distortion with 90% probability

Final theorem  Theorem: can embed EMD over [  ] 2 into ℓ 1 with O(log  ) distortion.  Dimension required: O(  2 ), but a set A of size s maps to a vector that has only O(s*log  ) non-zero coordinates.  Time: can compute in O(s*log  )  Randomized: does not contract, but large distortortion happens with <10%  Applications:  Can compute EMD(A,B) in time O(s*log  )  NNS: O(c*log  ) approximation, with O(n 1+1/c *s) space, O(n 1/c *s*log  ) query time.

Embeddings of various metrics  Embeddings into ℓ 1 MetricUpper bound Earth-mover distance (s-sized sets in 2D plane) O(log s) [Cha02, IT03] Earth-mover distance (s-sized sets in {0,1} d ) O(log s*log d) [AIK08] Edit distance over {0,1} d (= #indels to tranform x->y) Ulam (edit distance between non-repetitive strings) O(log d) [CK06] Block edit distanceÕ(log d) [MS00, CM07] Lower bound Ω(log s) [KN05] Ω(log d) [KN05,KR06] Ω̃(log d) [AK07] 4/3 [Cor03]

Curse of non-embeddability into ℓ 1 ?  ℓ 1 natural target for many metrics, and have algorithms  Will see two example of “going beyond ℓ 1 ”  Sketching for EMD  Embedding of Ulam metric into product spaces  Enable (weaker) results for NNS

Sketching EMD

How to obtain a sketch for EMD

Ulam metric ED( , ) = 2

Some Open Questions on non-normed metrics MetricUpper bound Earth-mover distance (s-sized sets in 2D plane) O(log s) [Cha02, IT03] Earth-mover distance (s-sized sets in {0,1} d ) O(log s*log d) [AIK08] Edit distance over {0,1} d (= #indels to tranform x->y) Ulam (edit distance between non-repetitive strings) O(log d) [CK06] Block edit distanceÕ(log d) [MS00, CM07] Lower bound Ω(log s) [KN05] Ω(log d) [KN05,KR06] Ω̃(log d) [AK07] 4/3 [Cor03]

What I didn’t talk about:

Bibliography 1  [AES95] PK Agarwal, A. Efrat, M. Sharir. Vertical decomposition of shallow levels in 3-dimensional arrangements and its applications”. SoCG95. SICOMP 00.  [Cha02] M. Charikar. Similarity estimation techniques from rounding. STOC02  [IT03] P. Indyk, N. Thaper. Fast color image retrieval via embeddings. Workshop on Statistical and Computational Theories in Vision (ICCV)  [I07] P. Indyk. A near linear time constant factor approximation for euclidean bichromatic matching (cost). In SODA 07.  [ADIW09] A. Andoni, K. Do Ba, P. Indyk, D. Woodruff. Efficient sketches for Earth-Mover Distance, with applications. FOCS09  [VZ] E. Verbin, Q. Zhang. Rademacher-Sketch: A dimensionality- reducing embedding for sum-product norms, with an application to Earth-Mover Distance. Manuscript 2011.

Bibliography 2  [AIK08] A. Andoni, P. Indyk, R. Krauthgamer. Earth-mover distance over high-dimensional spaces. SODA08.  [OR05] R. Ostrovsky, Y. Rabani. Low distortion embedding for edit distance. STOC05. JACM  [CK06] M. Charikar, R. Krauthgamer. Embedding the Ulam metric into ell_1. ToC  [MS00] M. Muthukrishnan, C. Sahinalp. Approximate nearest neighbors and sequence comparison with block operations. STOC00  [CM07] G. Cormode, M. Muthukrishnan. The string edit distance matching problem with moves. TALG SODA02.  [NS07] A. Naor, G. Schechtman. Planar earthmover in not in L_1. FOCS06. SICOMP  [KN05] S. Khot, A. Naor. Nonembeddability theorems via Fourier analysis. Math. Ann FOCS05  [KR06] R. Krauthgamer, Y. Rabani. Improved lower bounds for embeddings into L1. SODA06.  [AK07] A. Andoni, R. Krauthgamer. The computational hardness of estimating edit distance. FOCS07. SICOMP10.  [Cor03] G. Cormode. Sequence Distance Embeddings. PhD Thesis.  [AIK09] A. Andoni, P. Indyk, R. Krauthgamer. Overcoming the ell_1 non-embeddability barrier: algorithms for product metrics. SODA09

Bibliography 3  [LLR] N. Linial, E. London, Y. Rabinovich. The geometry of graphs and some of its algorithmic applications. FOCS94  [Bou] J. Bourgain. On Lipschitz embedding of finite metric spaces into Hilbert space. Israel J Math  [Rao] S. Rao. Small distortion and volume preserving embeddings for planar and Euclidean metrics. SoCG  [ARV] S. Arora, S. Rao, U. Vazirani. Expander flows, geometric embeddings and graph partitioning. STOC04. JACM  [ALN] S. Arora, J. Lee, A. Naor. Euclidean distortion and sparsest cut. STOC05.