1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

Slides:

Advertisements

Similar presentations

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Advertisements

Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research.

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Efficient Algorithms via Precision Sampling Robert Krauthgamer (Weizmann Institute) joint work with: Alexandr Andoni (Microsoft Research) Krzysztof Onak.

Indexing DNA Sequences Using q-Grams

A Nonlinear Approach to Dimension Reduction Robert Krauthgamer Weizmann Institute of Science Joint work with Lee-Ad Gottlieb TexPoint fonts used in EMF.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.

Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)

Spectral Approaches to Nearest Neighbor Search arXiv: Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.

Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]

Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)

Spectral Approaches to Nearest Neighbor Search Alex Andoni Joint work with:Amirali Abdullah Ravi Kannan Robi Krauthgamer.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005

Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)

1 Lecture 18 Syntactic Web Clustering CS

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.

Chapter 5. Operations on Multiple R. V.'s 1 Chapter 5. Operations on Multiple Random Variables 0. Introduction 1. Expected Value of a Function of Random.

Data Streams and Applications in Computer Science David Woodruff IBM Almaden Presburger lecture, ICALP, 2014.

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.

Modern Information Retrieval Chapter 4 Query Languages.

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

Embedding and Sketching Non-normed spaces Alexandr Andoni (MSR)

Efficient Approximation of Edit Distance Robert Krauthgamer, Weizmann Institute of Science SPIRE 2013 TexPoint fonts used in EMF. Read the TexPoint manual.

On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)

1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.

A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

13 th Nov Geometry of Graphs and It’s Applications Suijt P Gujar. Topics in Approximation Algorithms Instructor : T Kavitha.

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)

Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)

Dynamic Programming: Edit Distance

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Columbia) Robert Krauthgamer (Weizmann Inst) Ilya Razenshteyn (MIT) 1.

Summer School on Hashing’14 Dimension Reduction Alex Andoni (Microsoft Research)

Approximate Near Neighbors for General Symmetric Norms

Fast Dimension Reduction MMDS 2008

Approximate Algorithms (chap. 35)

Sublinear Algorithmic Tools 3

Sublinear Algorithmic Tools 2

Spectral Approaches to Nearest Neighbor Search [FOCS 2014]

Lecture 10: Sketching S3: Nearest Neighbor Search

Sketching and Embedding are Equivalent for Norms

Near(est) Neighbor in High Dimensions

Lecture 16: Earth-Mover Distance

Lower Bounds for Edit Distance Estimation

Near-Optimal (Euclidean) Metric Compression

String matching.

Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University

Locality Sensitive Hashing

Overcoming the L1 Non-Embeddability Barrier

Complement to lecture 11 : Levenshtein distance algorithm

Streaming Symmetric Norms via Measure Concentration

Lecture 6: Counting triangles Dynamic graphs & sampling

Lecture 15: Least Square Regression Metric Embeddings

President’s Day Lecture: Advanced Nearest Neighbor Search

Approximating Edit Distance in Near-Linear Time

Sublinear Algorihms for Big Data

On Solving Linear Systems in Sublinear Time

Presentation transcript:

1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

2 Edit Distance For two strings x,y  ∑ n ed(x,y) = minimum number of edit operations to transform x into y  Edit operations = insertion/deletion/substitution Important in: computational biology, text processing, etc Example: ED( , ) = 2

3 Computing Edit Distance Problem: compute ed(x,y) for given x,y  {0,1} n Exactly:  O(n 2 ) [Levenshtein’65]  O(n 2 /log 2 n) for |∑|=O(1) [Masek-Paterson’80] Approximately in n 1+o(1) time:  n 1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving over [Myers’86, BarYossef-Jayram-Krauthgamer-Kumar’04] Sublinear time:  ≤n 1-ε vs ≥n/100 in n 1-2ε time [Batu-Ergun-Kilian-Magen- Raskhodnikova-Rubinfeld-Sami’03]

4 Computing via embedding into ℓ 1 Embedding: f:{0,1} n → ℓ 1  such that ed(x,y) ≈ ||f(x) - f(y)|| 1  up to some distortion (=approximation)  Can compute ed(x,y) in time to compute f(x) Best embedding by [Ostrovsky-Rabani’05]:  distortion = 2 Õ(√log n)  Computation time: ~n 2 randomized (and similar dimension)  Helps for nearest neighbor search, sketching, but not computation…

5 Our result Theorem: Can compute ed(x,y) in  n*2 Õ(√log n) time with  2 Õ(√log n) approximation While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding

6 Sketcher’s hat 2 examples of “sketches” from embeddings… [Johnson-Lindenstrauss]: pick a random k- subspace of R n, then for any q 1,…q n  R n, if q̃ i is projection of q i, then, w.h.p.  ||q i -q j || 2 ≈ ||q̃ i -q̃ j || 2 up to O(1) distortion.  for k=O(log n) [Bourgain]: given n vectors q i, can construct n vectors q̃ i of k=O(log 2 n) dimension such that  ||q i -q j || 1 ≈ ||q̃ i -q̃ j || 1 up to O(log n) distortion.

7 Our Algorithm For each length m in some fixed set L  [n], compute vectors v i m  ℓ 1 such that  ||v i m – v j m || 1 ≈ ed( z[i:i+m], z[j:j+m] )  Dimension of v i m is only O(log 2 n) Vectors {v i m } are computed recursively from {v i k } corresponding to shorter substrings (smaller k  L) Output: ed(x,y)≈||v 1 n/2 – v n/2+1 n/2 || 1 (i.e., for m=n/2=|x|=|y|) i z[i:i+m] z= xy

8 Idea: intuition How to compute {v i m } from {v i k } for k<<m ?  [OR] show how to compute some {w i m } with same property, but which have very high dimension (~m) Can apply [Bourgain] to vectors { w i m },  Obtain vectors {v i m } of polylogaritmic dimension  Incurs “only” O(log n) distortion at this step of recursion (which turns out to be ok). Challenge: how to do this in Õ(n) time?! ||v i m – v j m || 1 ≈ ed( z[i:i+m], z[j:j+m] )

9 Key step: Main Lemma: fix n vectors v i  ℓ 1 k, of dimension k=O(log 2 n).  Let s<n. Define A i ={v i, v i+1, …, v i+s-1 }.  Then we can compute vectors q i  ℓ 1 k such that ||q i – q j || 1 ≈ EMD(A i, A j ) up to distortion log O(1) n  Computing q i ’s takes Õ(n) time. embeddings of shorter substrings embeddings of longer substrings* EMD(A,B)=min-cost bipartite matching* * cheating…

10 Proof of Main Lemma “low” = log O(1) n Graph-metric: shortest path on a weighted graph Sparse: Õ(n) edges  min k M is semi-metric on M k with “distance” d min,M (x,y)=min i=1..k d M (x i,y i ) EMD over n sets A i  min low ℓ 1 high  min low ℓ 1 low  min low tree-metric sparse graph-metric O(log 2 n) O(1) O(log n) O(log 3 n) ℓ 1 low O(log n) [Bourgain] (efficient)

11 Step 1 EMD over n sets A i  min low ℓ 1 high O(log 2 n) q.e.d.

12 Step 2 Lemma 2: can embed an n point set from ℓ 1 H into  min O(log n) ℓ 1 k, for k=log 3 n, with O(1) distortion. Use weak dimensionality reduction in ℓ 1 Thm [Indyk’06]: Let A be a random* matrix of size H by k=log 3 n. Then for any x,y, letting x̃=Ax, ỹ=Ay:  no contraction: ||x̃-ỹ|| 1 ≥||x-y|| 1 (w.h.p.)  5-expansion: ||x̃-ỹ|| 1 ≤5*||x-y|| 1 (with 0.01 probability) Just use O(log n) of such embeddings  Their min is O(1) approximation to ||x-y|| 1, w.h.p.  min low ℓ 1 high  min low ℓ 1 low O(1)

13 Efficiency of Step 1+2 From step 1+2, we get some embedding f() of sets A i ={v i, v i+1, …, v i+s-1 } into  min low ℓ 1 low Naively would take Ω(n*s)=Ω(n 2 ) time to compute all f(A i ) Save using linearity of sketches:  f() is linear: f(A) = ∑ a  A f(a)  Then f(A i ) = f(A i-1 )-f(v i-1 )+f(v i+s-1 )  Compute f(A i ) in order, for a total of Õ(n) time

14 Step 3 Lemma 3: can embed ℓ 1 over {0..M} p into  min low tree-m, with O(log n) distortion. For each Δ = a power of 2, take O(log n) random grids. Each grid gives a  min - coordinate  min low ℓ 1 low  min low tree-metric O(log n)  ∞ Δ

15 Step 4 Lemma 4: suppose have n points in  min low tree-m, which approximates a metric up to distortion D. Can embed into a graph-metric of size Õ(n) with distortion D.  min low tree-metric sparse graph-metric O(log 3 n)

16 Step 5 Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ 1 low with O(log n) distortion in Õ(m) time. Just implement [Bourgain]’s embedding:  Choose O(log 2 n) sets B i  Need to compute the distance from each node to each B i  For each B i can compute its distance to each node using Dijkstra’s algorithm in Õ(m) time sparse graph-metric ℓ 1 low O(log n)

17 Summary of Main Lemma Min-product helps to get low dimension (~small-size sketch)  bypasses impossibility of dim-reduction in ℓ 1 Ok that it is not a metric, as long as it is close to a metric EMD over n sets A i  min low ℓ 1 high  min low ℓ 1 low  min low tree-metric sparse graph-metric O(log 2 n) O(1) O(log n) O(log 3 n) ℓ 1 low O(log n) oblivious non-oblivious

18 Conclusion Theorem: can compute ed(x,y) in n*2 Õ(√log n) time with 2 Õ(√log n) approximation