Download presentation
Presentation is loading. Please wait.
Published byDarren Mitchell Modified over 9 years ago
1
On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden) Based on joint work (i) with Moses Charikar, (ii) with Yuval Rabani, (iii) with Parikshit Gopalan and T.S. Jayram. (iv) with Alex Andoni
2
On Embedding Edit Distance into L_1 2 x 2 n, y 2 m ED(x,y) = Minimum number of character insertions, deletions and substitutions that transform x to y. [aka Levenshtein distance] Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications: Genomics Text processing Web search For simplicity: m = n. Edit Distance X
3
On Embedding Edit Distance into L_1 3 Embedding into L 1 An embedding of (X,d) into l 1 is a map f : X ! l 1. It has distortion K ¸ 1 if d(x,y) ≤ k f(x)-f(y) k 1 ≤ K d(x,y) 8 x,y 2 X Very powerful concept (when distortion is small) Goal: Embed edit distance into l 1 with small distortion Motivation: Reduce algorithmic problems to l 1 E.g. Nearest-Neighbor Search Study a simple metric space without norm E.g. Hamming cube w/cyclic shifts.
4
On Embedding Edit Distance into L_1 4 Large Gap…Despite signficant effort!!! Known Results for Edit Distance O(n 2/3 ) [Bar Yossef-Jayram-K.- Kumar’04] 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound: (log n) 1/2-o(1) [Khot-Naor’05] and 3/2 [Andoni-Deza-Gupta-Indyk- Raskhodnikova’03] (log n) [K.-Rabani’06] Previous boundsEmbed ({0,1} n, ED) into L 1
5
On Embedding Edit Distance into L_1 5 Submetrics (Restricted Strings) Why focus on submetrics of edit distance? May admit smaller distortion Partial progress towards general case A framework to analyzing non worst-case instances Example (a la computational biology): Handle only “typical” strings Class 1: A string is k-non-repetitive if all its k-substrings are distinct A random 0-1 string is WHP (2log n)-non-repetitive Yields a submetric containing 1-o(1) fraction of the strings Class 2: Ulam metric = edit distance on all permutations (here ={1,…,n}) Every permutation is 1-non-repetitive Note: k-non-repetitive strings embed into Ulam with distortion k. Theory of Computation Seminar, Computer Science Department k=7
6
On Embedding Edit Distance into L_1 6 Large Gap …Near-tight! Known Results for Ulam Metric O(log n) [Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound: log n/loglog n) [Andoni-K.’07] (Actually qualitatively stronger) (log n) [K.-Rabani’06] Embed Ulam metric into L 1 Embed ({0,1} n, ED) into L 1
7
On Embedding Edit Distance into L_1 7 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l 1 with distortion O(log n). Proof.Define where Claim 1: ||f(P)-f(Q)|| 1 ≤ O(log n) ED(P,Q) Suppose Q is obtained from P by moving one symbol, say ‘s’ General case then follows by applying triangle inequality on P,P’,P’’,…,Q Total contribution of coordinates s 2 {a,b} is 2 k (1/k) ≤ O(log n) other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n) Intuition: sign(f a,b (P)) is indicator for “a appears before b” in P Thus, |f a,b (P)-f a,b (Q)| “measures” if {a,b} is an inversion in P vs. Q
8
On Embedding Edit Distance into L_1 8 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l 1 with distortion O(log n). Proof.Define where Claim 1: ||f(P)-f(Q)|| 1 ≤ O(log n) ED(P,Q) Claim 2: ||f(P)-f(Q)|| 1 ¸ ½ ED(P,Q) Assume wlog that P=identity Edit Q into an increasing sequence (thus into P) using quicksort: Choose a random pivot, Delete all characters inverted wrt to pivot Repeat recursively on left and right portions Now argue ||f(P)-f(Q)|| 1 ¸ E [ #quicksort deletions ] ¸ ½ ED(P,Q) Surviving subsequence is increasing ED(P,Q) ≤ 2 #deletions For every inversion (a,b) in Q: Pr[a deleted “by” pivot b] ≤ 1/ | Q -1 [a]-Q -1 [b]+1 | ≤ 2 |f a,b (P) – f a,b (Q)|
9
On Embedding Edit Distance into L_1 9 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1} n,ED) into L 1 requires distortion (log n) Proof sketch: Suppose embeds with distortion D ¸ 1, and let V={0,1} n. By the cut-cone characterization of L 1 : For every symmetric probability distributions and over V £ V, The embedding f into L 1 can be written as Hence,
10
On Embedding Edit Distance into L_1 10 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1} n,ED) into L 1 requires distortion (log n) Proof sketch: Suppose embeds with distortion D ¸ 1, and let V={0,1} n. By the cut-cone characterization of L 1 : For every symmetric probability distributions and over V £ V, We choose: = uniform over V £ V = ½( H + S ) where H =random point+random bit flip (uniform over E H ={(x,y): ||x-y|| 1 =1}) S =random point+a cyclic shift (uniform over E S ={(x,S(x)} ) The RHS of (*) evaluates to O(D/n) by a counting argument. Main Lemma: For all A µ V, the LHS of (*) is (log n) / n. Analysis of Boolean functions on the hypercube
11
On Embedding Edit Distance into L_1 11 Lower bound for 0-1 strings – cont. Recall = ½( H + S ) where H =random point+random bit flip S =random point+a cyclic shift Lemma: For all A µ V, the LHS of (*) is Proof sketch: Assume to contrary, and define f = 1 A.
12
On Embedding Edit Distance into L_1 12 Lower bound for 0-1 strings – cont. Claim: I j ¸ 1/n 1/8 ) I j +1 ¸ 1/2n 1/8 Proof: x x+ejx+ej S(x+ej)S(x+ej) flip bit j cyclic shift S(x)S(x) flip bit j+1 cyclic shift = S ( x )+ e j +1
13
On Embedding Edit Distance into L_1 13 Communication Complexity Approach Alice x2nx2n y2ny2n randomness Distance Estimation Problem: decide whether d(x,y) ¸ R or d(x,y)·R/A Communication complexity model: Two-party protocol Shared randomness Promise (gap) version A = approximation factor CC A = min. # bits to decide whp … CC A bits Bob Previous communication lower bounds: l 1 [Saks-Sun’02, BarYossef-Jayram- Kumar-Shivakumar’04] l 1 [Woodruff’04] Earthmover [Andoni-Indyk-K.’07]
14
On Embedding Edit Distance into L_1 14 Communication Bounds for Edit Distance A tradeoff between approximation and communication Theorem [Andoni-K.’07]: For Hamming distance : CC 1+ = (1/ 2 ) [Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04] First computational model where edit is provably harder than Hamming! Corollary 1: Approximation A=O(1) requires CC A ¸ (loglog n) Corollary 2: Communication CC A =O(1) requires A ¸ * (log n) Implications to embeddings: Embedding ED into L 1 (or squared-L 2 ) requires distortion * (log n) Furthermore, holds for both 0-1 strings and permutations (Ulam)
15
On Embedding Edit Distance into L_1 15 Proof Outline Step 1 [Yao’s minimax Theorem]: Reduce to distributional complexity If CC A ≤k then for every two distributions far, close there is a k-bit deterministic protocol with success probability ¸ 2/3 Step 2 [Andoni-Indyk-K.’07]: Reduce to 1-bit protocols Further to above, there are Boolean functions s A,s B : n {0,1} with advantage Pr (x,y) 2 far [s A (x) s B (y)] – Pr (x,y) 2 close [s A (x) s B (y)] ¸ (2 -k ) Step 3 [Fourier expansion]: Reduce to one Fourier level Furthermore, s A,s B depend only on fixed positions j 1,…,j Step 4 [Choose distribution]: Analyze (x,y) 2 projected on these positions Let close, far include -noise handle a high level Let close, far include (few/more) block rotations handle a low level Step 5: Reduce Ulam to {0,1} n A random mapping {0,1} works Key property: distribution of ( x j1,…,x j, y j1,…,y j ) is “statistically close” under far vs. under close Compare this additive analysis to our previous analysis:
16
On Embedding Edit Distance into L_1 16 Summary of Known Results O(log n) [Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) 2 O(√log n) [Ostrovsky-Rabani’05] Upper bound: Lower bound: log n/loglog n) [Andoni-K.’07] (Qualitatively much stronger) (log n) [K.-Rabani’06] Embed Ulam metric into L 1 Embed ({0,1} n, ED) into L 1
17
On Embedding Edit Distance into L_1 17 Concluding Remarks The computational lens Study Distance Estimation problems rather than embeddings Open problems: Still large gap for 0-1 strings Variants of edit distance (e.g. edit distance with block-moves) Rule out other algorithms (e.g. “CC model” capturing Indyk’s NNS for l 1 ) Recent progress: Bypass L 1 -embedding by devising new techniques E.g. using max ( l 1 ) product for NNS under Ulam metric [Andoni- Indyk-K.] Analyze/design “good” heuristics E.g. smoothed analysis [Andoni-K.]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.