Download presentation
Presentation is loading. Please wait.
Published byMae Shelton Modified over 9 years ago
1
Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni [Microsoft SVC] Krzysztof Onak [CMU] TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A
2
Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni [Microsoft SVC] Krzysztof Onak [CMU] TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A 110110011111011 00111 …
3
Polylog. Approx. for ED and the Asymmetric Query Complexity 3 Generic Search Engine Given two strings x,y n : ed(x,y) = minimum number of character operations (insertion/deletion/substitution) that transform x to y. ed( banana, ananas ) = 2 Edit Distance (Levenshtein distance) Applications: Computational Biology Text processing Web search
4
Polylog. Approx. for ED and the Asymmetric Query Complexity 4 Basic task Compute ed(x,y) for input x,y n O(n 2 ) time [WF’74] banana a n a n a s 2 56 1 1 1 1 1 1 2 22 22 22 2 2 2 2 2 3 33 3 3 33 34 4 4 4 4 5 5 D(i,j)= min D(i-1, j-1), if x[i]=y[j] D(i, j-1) + 1 D(i-1, j) + 1 D(i,j) = ed( x[1:i], y[1:j] ) Faster algorithms?
5
Polylog. Approx. for ED and the Asymmetric Query Complexity 5 Faster Algorithms? Compute ed(x,y) for given x,y n O(n 2 ) time [WF’74] O(n 2 /log 2 n) time [MP’80] Linear time (or near-linear)? Specific cases (average, smoothed, restricted input) and variants (block edit dist etc.) [U’83, LV’85, M’86, GG’88, GP’89, UW’90, CL’90, CH’98, LMS’98, U’85, CL’92, N’99, CPSV’00, MS’00,CM’02, AK’08, BF’08…] 2 Õ(√log n) approximation [OR’05,AO’09], improving earlier n c - approximation [BEKMRRS’03,BJKK’04,BES’06] Same “barrier” 2 Õ(√log n) -approximation also for related tasks: Nearest neighbor search (text indexing), embedding into normed spaces, sketching [OR’05]
6
Polylog. Approx. for ED and the Asymmetric Query Complexity 6 Results I Theorem 1: Can approximate ed(x,y) within (log n) O(1/ε) factor in time n 1+ε (for any ε>0). Exponential improvement over previous factor 2 Õ(√log n) Fallout from the study of asymmetric query model …
7
Polylog. Approx. for ED and the Asymmetric Query Complexity 7 Approach: asymmetric query model “Compress” one string, x, to n ε information Use dynamic programming to compute ed(x,y) in n 1+ε time How to compress? Carefully subsample x… Focus on sample-size (number of queried positions) in x, for fixed y ? Obtain near-tight bounds x y
8
Polylog. Approx. for ED and the Asymmetric Query Complexity 8 Results II: Asymmetric Query Complexity Problem: Decide ed(x,y) ≥ n/10 vs ed(x,y) ≤ n/A Complexity = #queries into x (unlimited access to y) n 1-ε A Θ(log n) Θ(log 2 n) Θ(log 3 n) Θ(log t n) # queries n 1/2-ε n 1/2 n 1/3 n 1/4 n 1/t-ε n 1/(t+1) Approximation:(log n) O(1/ε) # Queries:O(n ε ) Ω(n ε/loglog n ) [n 1/(t+1), n 1/t-ε ] O(log t n) Ω(log t n)
9
Polylog. Approx. for ED and the Asymmetric Query Complexity 9 Upper bound Theorem 2: can distinguish ed(x,y) ≥ n/10 vs ed(x,y) ≤ n/A for A=(log n) O(1/ε) approximation with n ε queries into x (for any ε>0). Proof structure: 1. Characterize edit by “tree-distance” T xy Parameter b≥2 (degree) T xy ≈ ed(x,y) up to 6b*log n factor 2. Prune the tree to subsample x x1x1 x2x2 xnxn b sampled positions in x
10
Polylog. Approx. for ED and the Asymmetric Query Complexity 10 Step 1: Tree distance Partition x into b blocks, recursively, for h=log b n levels x[1:n] x[1:⅓n]x[⅔n:n] … x[1] x[2]x[3] x[⅓n:⅔n] y[1:n] y[u:u+⅓n] x[u:u+ ⅓ n] T i (s,u) = T-distance between x[s:s+ℓ i ] and y[u:u+ℓ i ] where ℓ i is the block-length at level i
11
Polylog. Approx. for ED and the Asymmetric Query Complexity 11 Tree distance: recursive definition Recall T i (s,u) = distance between x[s:s+ℓ i ] and y[u:u+ℓ i ] Base case: T h (s,u)=Hamming(x[s],y[u]) Output: T xy =T 0 (s=1,u=1) x[s:s+ℓ i ] y[u:u+ℓ i ] r0r0 x y
12
Polylog. Approx. for ED and the Asymmetric Query Complexity 12 T-distance approximates edit distance Lemma: T xy ≈ed(x,y) up to 6b*log b n factor. Hierarchical decomposition inspired by earlier approaches [BEKMRRS’03, OR’05] All had approximation recurrence of the type A(n) = c*A(n/b) + b for c≥2 Solves to A(n) ≥ 2 √log n factor for every choice of b Our characterization has no multiplicative loss (c=1): A(n) = A(n/b) + b Analysis inspired by algorithms for smoothed edit [AK’08]
13
Polylog. Approx. for ED and the Asymmetric Query Complexity 13 Step 2: Compute the tree distance For b=2, T-distance gives O(log n) approximation! BUT know only how to compute T-distance in Õ(n 2 ) time Instead, for b=( log n) 1/ε, can prune the tree to n O(ε) nodes, and get 1+ε approximation Pruning: subsample (log n) O(1) children out of each node Works only when ed(x,y) ≥ (n) Generally, must subsample the tree non-uniformly, using the Precision Sampling Lemma b sampled positions in x
14
Polylog. Approx. for ED and the Asymmetric Query Complexity 14 Key tool: non-uniform sampling Goal: For unknown a 1, a 2, …a n [0,1] Estimate their sum, up to an additive constant error Using only “weak” estimates ã 1, ã 2, …ã n Sum Estimator Adversary 0. fix distribution U 1. Fix a 1,a 2,…a n (unknown) 2. pick “precisions” u i (our algorithm: u i ~U i.i.d.) 3. provide ã 1,ã 2,…ã n s.t. |a i -ã i |<1/u i 4. report S̃=S̃(ã 1,…, u 1,…) with |S̃ – ∑a i ̃| < 1.
15
Polylog. Approx. for ED and the Asymmetric Query Complexity 15 Precision Sampling Goal: estimate ∑a i from {ã i } s.t. |a i -ã i |<1/u i. Precision Sampling Lemma: Can achieve WHP additive error 1 and multiplicative error 1.5 with expected precision E u_i~U [u i ]=O(log n). Inspired by a technique from [IW’05] for streaming (F k moments) In fact, PSL gives simple & improved algorithms for F k moments, cascaded (mixed) norms, ℓ p -sampling problems [AKO’10] Also distant relative of Priority Sampling [DLT’07]
16
Polylog. Approx. for ED and the Asymmetric Query Complexity 16 Precision Sampling for Edit Distance Apply Precision Sampling to the tree from the characterization recursively at each node If a node has very weak precision, can trim the entire sub-tree
17
Polylog. Approx. for ED and the Asymmetric Query Complexity 17 Lower Bound Theorem Theorem 3: Achieving approximation A=O(log 7 n) for edit distance requires asymmetric query complexity n Ω(1/loglog n). I.e., distinguishing ed(x,y)>n/10 vs ed(x,y)<n/10A Implications: First lower bound to expose hardness from repetitiveness in edit distance Contrast with edit on non-repetitive strings (Ulam’s distance) Empirically easier (better algorithms are known for it) Yet, all previous lower bounds essentially equivalent for the two variants [BEKMRRS’03, AN’10, KN’05, KR’06, AK’07, AJP’10] But asymmetric query complexity: Ulam: 2-approx. with O(log n) queries [ACCL’04, SS’10] Edit: requires n Ω(1/loglog n) queries
18
Polylog. Approx. for ED and the Asymmetric Query Complexity 18 Lower Bound Techniques Core gadget: ¾ (.) = cyclic shift operation Observation: ed(x, ¾ j (x)) · 2j Lower bound outline: exhibit lower bound via shifts Amplification by “composing” the hard instance recursively We will see here: Theorem 4: Asymmetric query complexity of approximation n 1/2 to edit distance is Ω(log 2 n)
19
Polylog. Approx. for ED and the Asymmetric Query Complexity 19 The Shift Gadget Lemma: Ω(log n) query lower bound for approximation A=n 0.5. Hard distribution (x,y): Fix specific z 1, z 2 {0,1} n (random-looking) Set: Formally: y=z 1 and x=σ j (z 1 OR z 2 ) and random j [n 0.5 ] An algorithm is a set queried positions: Q ½ [n], |Q|<<log n It “reads” (z 1 OR z 2 ) at positions Q+j Claim: Both z 1 | Q+j and z 2 | Q+j close to uniform dist. on {0,1} |Q| up to ~2 |Q| /n 0.5 statistical distance Hence |Q| ¸ Ω(log n), even for approximation A=n 0.99 00101 y= x= 01101 00101 ¾ j ( ) ) ed(x,y) · 2n 0.5 [close] ) ed(x,y) ¸ n/10 [far]
20
Polylog. Approx. for ED and the Asymmetric Query Complexity 20 Amplification via Substitution Product Ω(log 2 n) lower bound by amplification: “compose” two shift instances Hard distribution (x,y): Fix z 1,z 2 {0,1} √n, w 0,w 1 {0,1} √n and y=z 1 (w 0,w 1 ) (substitution) Choose either z=z 1 (close) or z=z 2 (far) x = z (w 0,w 1 ) but with random shifts j [n 1/3 ] inside each block and between blocks Intuition: must distinguish z=z 1 from z=z 2 Must “learn” Ω(log n) positions i of z, and each requires reading Ω(log n) further positions in the corresponding blocks w z[i] 001011101100111 110110011111011 00111 z1=z1= w0=w0=w1=w1= x=
21
Polylog. Approx. for ED and the Asymmetric Query Complexity 21 Towards the Full Theorem For the full theorem: recursive composition Proof overview: 1. Define ® -similarity of k distributions ( ® ≈information per query) 2. ® -similarity ) query lower bound 1/ ® (for adaptive algorithms) 3. Initial “Shift metric” has high ® -similarity (induction basis) 4. ® -similarity amplified under substitution product (inductive step) 5. Prove edit distance concentrates well(requires large alphabet) 6. Can reduce large alphabet to binary (lossy, but done once)
22
Polylog. Approx. for ED and the Asymmetric Query Complexity 22 Conclusion We compute ed(x,y) up to (log n) O(1/ε) approximation in n 1+ε time Via Asymmetric Query Complexity (new model) Open questions: Do faster / limitations: E.g. O(log 2 n) approximation in n 1+o(1) time? Use these insights for related problems: Nearest Neighbor Search? Sublinear-time algorithms (symmetric queries)? Embeddings? Communication complexity? Further thoughts: Practical ramifications? Asymmetric queries model? Paradigm for “fast dynamic programming”?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.