Approximating Edit Distance in Near-Linear Time

Slides:

Advertisements

Similar presentations

1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

Advertisements

WSPD Applications.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.

Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.

Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]

A Nonlinear Approach to Dimension Reduction Lee-Ad Gottlieb Weizmann Institute of Science Joint work with Robert Krauthgamer TexPoint fonts used in EMF.

Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)

Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Microsoft Research)

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)

Uncertainty Principles, Extractors, and Explicit Embeddings of L 2 into L 1 Piotr Indyk MIT.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Sparsest Cut S S  G) = min |E(S, S)| |S| S µ V G = (V, E) c- balanced separator  G) = min |E(S, S)| |S| S µ V c |S| ¸ c ¢ |V| Both NP-hard.

1 Lecture 18 Syntactic Web Clustering CS

1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.

1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

Embedding and Sketching Non-normed spaces Alexandr Andoni (MSR)

Efficient Approximation of Edit Distance Robert Krauthgamer, Weizmann Institute of Science SPIRE 2013 TexPoint fonts used in EMF. Read the TexPoint manual.

On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)

Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.

A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.

13 th Nov Geometry of Graphs and It’s Applications Suijt P Gujar. Topics in Approximation Algorithms Instructor : T Kavitha.

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.

Topics in Algorithms 2007 Ramesh Hariharan. Tree Embeddings.

Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)

Timing Model Reduction for Hierarchical Timing Analysis Shuo Zhou Synopsys November 7, 2006.

Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Columbia) Robert Krauthgamer (Weizmann Inst) Ilya Razenshteyn (MIT) 1.

1 Chapter 7 Network Flow Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.

Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.

Sparse RecoveryAlgorithmResults  Original signal x = x k + u, where x k has k large coefficients and u is noise.  Acquire measurements Ax = y. If |x|=n,

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Approximate Near Neighbors for General Symmetric Norms

Chapter 5. Optimal Matchings

Sublinear Algorithmic Tools 3

Sublinear Algorithmic Tools 2

Sketching and Embedding are Equivalent for Norms

CSE 421: Introduction to Algorithms

Near(est) Neighbor in High Dimensions

Enumerating Distances Using Spanners of Bounded Degree

Lecture 16: Earth-Mover Distance

Parallel Algorithms for Geometric Graph Problems

CIS 700: “algorithms for Big Data”

Lower Bounds for Edit Distance Estimation

Near-Optimal (Euclidean) Metric Compression

Algorithms (2IL15) – Lecture 5 SINGLE-SOURCE SHORTEST PATHS

Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University

3.5 Minimum Cuts in Undirected Graphs

Locality Sensitive Hashing

The Communication Complexity of Distributed Set-Joins

Overcoming the L1 Non-Embeddability Barrier

CSCI B609: “Foundations of Data Science”

On the effect of randomness on planted 3-coloring models

Data Compression Section 4.8 of [KT].

Embedding and Sketching

Metric Methods and Approximation Algorithms

Dimension versus Distortion a.k.a. Euclidean Dimension Reduction

Embedding Metrics into Geometric Spaces

Lecture 15: Least Square Regression Metric Embeddings

President’s Day Lecture: Advanced Nearest Neighbor Search

Eötvös Loránd Tudományegyetem, Budapest

Locality In Distributed Graph Algorithms

Data Structures and Algorithm Analysis Lecture 8

Presentation transcript:

Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

Edit Distance For two strings x,y  ∑n ed(x,y) = minimum number of edit operations to transform x into y Edit operations = insertion/deletion/substitution Important in: computational biology, text processing, etc Example: ED(0101010, 1010101) = 2

Computing Edit Distance Problem: compute ed(x,y) for given x,y{0,1}n Exactly: O(n2) [Levenshtein’65] O(n2/log2 n) for |∑|=O(1) [Masek-Paterson’80] Approximately in n1+o(1) time: n1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving over [Sahinalp-Vishkin’96, Cole-Hariharan’02, BarYossef-Jayram-Krauthgamer-Kumar’04] Sublinear time: ≤n1-ε vs ≥n/100 in n1-2ε time [Batu-Ergun-Kilian-Magen-Raskhodnikova-Rubinfeld-Sami’03]

Computing via embedding into ℓ1 Embedding: f:{0,1}n → ℓ1 such that ed(x,y) ≈ ||f(x) - f(y)||1 up to some distortion (=approximation) Can compute ed(x,y) in time to compute f(x) Best embedding by [Ostrovsky-Rabani’05]: distortion = 2Õ(√log n) Computation time: ~n2 randomized (and similar dimension) Helps for nearest neighbor search, sketching, but not computation…

Our result Theorem: Can compute ed(x,y) in n*2Õ(√log n) time with 2Õ(√log n) approximation While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding

Review of Ostrovsky-Rabani embedding φm = embedding of strings of length m δ(m) = distortion of φm Embedding is recursive Partition into b blocks (b later chosen to be exp(√log m)) Use embeddings φk for k ≤ m/b Embed each block separately as follows… X= m/b

Ostrovsky-Rabani embedding (II) X= s E2s E3s Ebs E1s= rec. embedding of the s substrings Want to approximate ed(x,y) by ∑i=1..b ∑sS TEMDs(Eis(x), Eis(y)) EMD(A,B) = min-cost bipartite matching Finish by embedding TEMD into ℓ1 with small distortion T (thresholded)

Distortion of [OR] embedding Suppose can embed TEMD into ℓ1 with distortion (log m)O(1) Then [Ostrovsky-Rabani’05] show that distortion of φm is δ(m) ≤ (log m)O(1) * [δ(m/b) + b] For b=exp[√log m] δ(m) ≤ exp[Õ(√log m)]

Why it is expensive to compute [OR] embedding E1s= rec. embedding of the s substrings In first step, need to compute recursive embedding for ~n/b strings of length ~n/b The dimension blows up

Our Algorithm For each length m in some fixed set L[n], y i z= z[i:i+m] For each length m in some fixed set L[n], compute vectors vimℓ1 such that ||vim – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] ) up to distortion δ(m) Dimension of vim is only O(log2 n) Vectors vim are computed inductively from vik for k≤m/b (kL) Output: ed(x,y)≈||v1n/2 – vn/2+1n/2||1 (i.e., for m=n/2=|x|=|y|)

Idea: intuition ||vim – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] ) For each mL, compute φm(z[i:i+m]) as in the O-R recursive step except we use vectors vik, k<m/b & kL, in place of recursive embeddings of shorter substrings (sets Eis) Resulting φm(z[i:i+m]) have high dimension, >m/b… Use Bourgain’s Lemma to vectors φm(z[i:i+m]), i=1..n-m, [Bourgain]: given n vectors qi, construct n vectors q̃i of O(log2 n) dimension such that ||qi-qj||1 ≈ ||q̃i-q̃j||1 up to O(log n) distortion. Apply to vectors φm(z[i:i+m]) to obtain vectors vim of polylogaritmic dimension incurs O(log n) distortion at each step of recursion. but OK as there are only ~√log n steps, giving an additional distortion of only exp[Õ(√log n)]

Idea: implementation Essential step is: Main Lemma: fix n vectors viℓ1, of dimension p=O(log2n). Let s<n. Define Ai={vi, vi+1, …, vi+s-1}. Then we can compute vectors qiℓ1k for k=O(log2n) such that ||qi – qj||1≈ TEMD(Ai, Aj) up to distortion logO(1) n Computing qi’s takes Õ(n) time.

Proof of Main Lemma Graph-metric: shortest path on a weighted graph TEMD over n sets Ai O(log2 n) Graph-metric: shortest path on a weighted graph Sparse: Õ(n) edges “low” = logO(1) n mink M is semi-metric on Mk with “distance” dmin,M(x,y)=mini=1..kdM(xi,yi) minlow ℓ1high O(1) minlow ℓ1low O(log n) minlow tree-metric O(log3n) sparse graph-metric [Bourgain] (efficient) O(log n) ℓ1low

Step 1 TEMD over n sets Ai minlow ℓ1high O(log2 n) minlow ℓ1high Lemma 1: can embed TEMD over n sets in ({0..M}p, ℓ1) into minO(log n) ℓ1M^p with O(log2n) distortion, w.h.p. Use [A-Indyk-Krauthgamer’08] (similar to Ostrovsky-Rabani embedding) Embedding: for each Δ = powers of 2 impose a randomly-shifted grid one coordinate per cell, equal to # of points in the cell Theorem [AIK]: no contraction w.h.p. expected expansion = O(log2 n) Just repeat O(log n) times 

minlow ℓ1high Step 2 O(1) minlow ℓ1low Lemma 2: can embed an n point set from ℓ1M into minO(log n) ℓ1k, for k=O(log3 n), with O(1) distortion. Use (weak) dimensionality reduction in ℓ1 Thm [Indyk’06]: Let A be matrix of size M by k=O(log3 n) with each element chosen from Cauchy distribution. Then for any x̃=Ax, ỹ=Ay: no contraction: ||x̃-ỹ||1≥||x-y||1 (w.h.p.) 5-expansion: ||x̃-ỹ||1≤5*||x-y||1 (with 0.01 probability) Just use O(log n) of such embeddings

Efficiency of Step 1+2 From step 1+2, we get some embedding f() of sets Ai={vi, vi+1, …, vi+s-1} into minlow ℓ1low Naively would take Ω(n*s)=Ω(n2) time to compute all f(Ai) More efficiently: Note that f() is linear: f(A) = ∑aA f(a) Then f(Ai) = f(Ai-1)-f(vi-1)+f(vi+s-1) Compute f(Ai) in order, for a total of Õ(n) time

Step 3 minlow ℓ1low O(log n) minlow tree-metric Lemma 3: can embed ℓ1 over {0..M}p into minO(log^2 n) tree-m, with O(log n) distortion. For each Δ = a power of 2, take O(log n) random grids. Each grid gives a min-coordinate ∞ Δ 

minlow tree-metric Step 4 O(log3n) sparse graph-metric Lemma 4: suppose have n points in minlow tree-m, which approximates a metric up to distortion D. Can embed into a graph-metric of size Õ(n) with distortion D.

Step 5 sparse graph-metric O(log n) ℓ1low Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ1low with O(log n) distortion in Õ(m) time. Just implement Bourgain’s embedding: Choose O(log2 n) sets Bi Need to compute the distance from each node to each Bi For each Bi can compute its distance to each node using Dijkstra’s algorithm in Õ(m) time

Summary of Main Lemma TEMD over n sets Ai Min-product helps to get low dimension (~small-size sketch) bypasses impossibility of dim-reduction in ℓ1 Ok that it is not a metric, as long as it is close to a metric O(log2 n) minlow ℓ1high O(1) oblivious minlow ℓ1low O(log n) minlow tree-metric O(log3n) sparse graph-metric non-oblivious O(log n) ℓ1low

Conclusion + a question Theorem: can compute ed(x,y) in n*2Õ(√log n) time with 2Õ(√log n) approximation Question: can we do the following “oblivious” dimensionality reduction in ℓ1 Given n, construct a randomized embedding φ:ℓ1M→ℓ1polylog n such that for any v1…vnℓ1M, with high probability, φ has distortion logO(1) n on these vectors? If φ exists, it cannot be linear [Charikar-Sahai’02]