Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)
Edit Distance For two strings x,y ∑n ed(x,y) = minimum number of edit operations to transform x into y Edit operations = insertion/deletion/substitution Important in: computational biology, text processing, etc Example: ED(0101010, 1010101) = 2
Computing Edit Distance Problem: compute ed(x,y) for given x,y{0,1}n Exactly: O(n2) [Levenshtein’65] O(n2/log2 n) for |∑|=O(1) [Masek-Paterson’80] Approximately in n1+o(1) time: n1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving over [Sahinalp-Vishkin’96, Cole-Hariharan’02, BarYossef-Jayram-Krauthgamer-Kumar’04] Sublinear time: ≤n1-ε vs ≥n/100 in n1-2ε time [Batu-Ergun-Kilian-Magen-Raskhodnikova-Rubinfeld-Sami’03]
Computing via embedding into ℓ1 Embedding: f:{0,1}n → ℓ1 such that ed(x,y) ≈ ||f(x) - f(y)||1 up to some distortion (=approximation) Can compute ed(x,y) in time to compute f(x) Best embedding by [Ostrovsky-Rabani’05]: distortion = 2Õ(√log n) Computation time: ~n2 randomized (and similar dimension) Helps for nearest neighbor search, sketching, but not computation…
Our result Theorem: Can compute ed(x,y) in n*2Õ(√log n) time with 2Õ(√log n) approximation While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding
Review of Ostrovsky-Rabani embedding φm = embedding of strings of length m δ(m) = distortion of φm Embedding is recursive Partition into b blocks (b later chosen to be exp(√log m)) Use embeddings φk for k ≤ m/b Embed each block separately as follows… X= m/b
Ostrovsky-Rabani embedding (II) X= s E2s E3s Ebs E1s= rec. embedding of the s substrings Want to approximate ed(x,y) by ∑i=1..b ∑sS TEMDs(Eis(x), Eis(y)) EMD(A,B) = min-cost bipartite matching Finish by embedding TEMD into ℓ1 with small distortion T (thresholded)
Distortion of [OR] embedding Suppose can embed TEMD into ℓ1 with distortion (log m)O(1) Then [Ostrovsky-Rabani’05] show that distortion of φm is δ(m) ≤ (log m)O(1) * [δ(m/b) + b] For b=exp[√log m] δ(m) ≤ exp[Õ(√log m)]
Why it is expensive to compute [OR] embedding E1s= rec. embedding of the s substrings In first step, need to compute recursive embedding for ~n/b strings of length ~n/b The dimension blows up
Our Algorithm For each length m in some fixed set L[n], y i z= z[i:i+m] For each length m in some fixed set L[n], compute vectors vimℓ1 such that ||vim – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] ) up to distortion δ(m) Dimension of vim is only O(log2 n) Vectors vim are computed inductively from vik for k≤m/b (kL) Output: ed(x,y)≈||v1n/2 – vn/2+1n/2||1 (i.e., for m=n/2=|x|=|y|)
Idea: intuition ||vim – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] ) For each mL, compute φm(z[i:i+m]) as in the O-R recursive step except we use vectors vik, k<m/b & kL, in place of recursive embeddings of shorter substrings (sets Eis) Resulting φm(z[i:i+m]) have high dimension, >m/b… Use Bourgain’s Lemma to vectors φm(z[i:i+m]), i=1..n-m, [Bourgain]: given n vectors qi, construct n vectors q̃i of O(log2 n) dimension such that ||qi-qj||1 ≈ ||q̃i-q̃j||1 up to O(log n) distortion. Apply to vectors φm(z[i:i+m]) to obtain vectors vim of polylogaritmic dimension incurs O(log n) distortion at each step of recursion. but OK as there are only ~√log n steps, giving an additional distortion of only exp[Õ(√log n)]
Idea: implementation Essential step is: Main Lemma: fix n vectors viℓ1, of dimension p=O(log2n). Let s<n. Define Ai={vi, vi+1, …, vi+s-1}. Then we can compute vectors qiℓ1k for k=O(log2n) such that ||qi – qj||1≈ TEMD(Ai, Aj) up to distortion logO(1) n Computing qi’s takes Õ(n) time.
Proof of Main Lemma Graph-metric: shortest path on a weighted graph TEMD over n sets Ai O(log2 n) Graph-metric: shortest path on a weighted graph Sparse: Õ(n) edges “low” = logO(1) n mink M is semi-metric on Mk with “distance” dmin,M(x,y)=mini=1..kdM(xi,yi) minlow ℓ1high O(1) minlow ℓ1low O(log n) minlow tree-metric O(log3n) sparse graph-metric [Bourgain] (efficient) O(log n) ℓ1low
Step 1 TEMD over n sets Ai minlow ℓ1high O(log2 n) minlow ℓ1high Lemma 1: can embed TEMD over n sets in ({0..M}p, ℓ1) into minO(log n) ℓ1M^p with O(log2n) distortion, w.h.p. Use [A-Indyk-Krauthgamer’08] (similar to Ostrovsky-Rabani embedding) Embedding: for each Δ = powers of 2 impose a randomly-shifted grid one coordinate per cell, equal to # of points in the cell Theorem [AIK]: no contraction w.h.p. expected expansion = O(log2 n) Just repeat O(log n) times
minlow ℓ1high Step 2 O(1) minlow ℓ1low Lemma 2: can embed an n point set from ℓ1M into minO(log n) ℓ1k, for k=O(log3 n), with O(1) distortion. Use (weak) dimensionality reduction in ℓ1 Thm [Indyk’06]: Let A be matrix of size M by k=O(log3 n) with each element chosen from Cauchy distribution. Then for any x̃=Ax, ỹ=Ay: no contraction: ||x̃-ỹ||1≥||x-y||1 (w.h.p.) 5-expansion: ||x̃-ỹ||1≤5*||x-y||1 (with 0.01 probability) Just use O(log n) of such embeddings
Efficiency of Step 1+2 From step 1+2, we get some embedding f() of sets Ai={vi, vi+1, …, vi+s-1} into minlow ℓ1low Naively would take Ω(n*s)=Ω(n2) time to compute all f(Ai) More efficiently: Note that f() is linear: f(A) = ∑aA f(a) Then f(Ai) = f(Ai-1)-f(vi-1)+f(vi+s-1) Compute f(Ai) in order, for a total of Õ(n) time
Step 3 minlow ℓ1low O(log n) minlow tree-metric Lemma 3: can embed ℓ1 over {0..M}p into minO(log^2 n) tree-m, with O(log n) distortion. For each Δ = a power of 2, take O(log n) random grids. Each grid gives a min-coordinate ∞ Δ
minlow tree-metric Step 4 O(log3n) sparse graph-metric Lemma 4: suppose have n points in minlow tree-m, which approximates a metric up to distortion D. Can embed into a graph-metric of size Õ(n) with distortion D.
Step 5 sparse graph-metric O(log n) ℓ1low Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ1low with O(log n) distortion in Õ(m) time. Just implement Bourgain’s embedding: Choose O(log2 n) sets Bi Need to compute the distance from each node to each Bi For each Bi can compute its distance to each node using Dijkstra’s algorithm in Õ(m) time
Summary of Main Lemma TEMD over n sets Ai Min-product helps to get low dimension (~small-size sketch) bypasses impossibility of dim-reduction in ℓ1 Ok that it is not a metric, as long as it is close to a metric O(log2 n) minlow ℓ1high O(1) oblivious minlow ℓ1low O(log n) minlow tree-metric O(log3n) sparse graph-metric non-oblivious O(log n) ℓ1low
Conclusion + a question Theorem: can compute ed(x,y) in n*2Õ(√log n) time with 2Õ(√log n) approximation Question: can we do the following “oblivious” dimensionality reduction in ℓ1 Given n, construct a randomized embedding φ:ℓ1M→ℓ1polylog n such that for any v1…vnℓ1M, with high probability, φ has distortion logO(1) n on these vectors? If φ exists, it cannot be linear [Charikar-Sahai’02]