Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Slides:

Advertisements

Similar presentations

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Advertisements

Efficient Algorithms via Precision Sampling Robert Krauthgamer (Weizmann Institute) joint work with: Alexandr Andoni (Microsoft Research) Krzysztof Onak.

1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.

Derandomized parallel repetition theorems for free games Ronen Shaltiel, University of Haifa.

Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]

Approximating Average Parameters of Graphs Oded Goldreich, Weizmann Institute Dana Ron, Tel Aviv University.

1 Truthful Mechanism for Facility Allocation: A Characterization and Improvement of Approximation Ratio Pinyan Lu, MSR Asia Yajun Wang, MSR Asia Yuan Zhou,

Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)

Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey U. Waterloo Department of Combinatorics and Optimization Joint work with Isaac.

Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.

Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey University of Waterloo Department of Combinatorics and Optimization Joint.

On Sketching Quadratic Forms Robert Krauthgamer, Weizmann Institute of Science Joint with: Alex Andoni, Jiecao Chen, Bo Qin, David Woodruff and Qin Zhang.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005

Proximity algorithms for nearly-doubling spaces Lee-Ad Gottlieb Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual.

The 1’st annual (?) workshop. 2 Communication under Channel Uncertainty: Oblivious channels Michael Langberg California Institute of Technology.

Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.

Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)

Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.

1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.

Maximum Likelihood (ML), Expectation Maximization (EM)

Lattices for Distributed Source Coding - Reconstruction of a Linear function of Jointly Gaussian Sources -D. Krithivasan and S. Sandeep Pradhan - University.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.

Efficient Approximation of Edit Distance Robert Krauthgamer, Weizmann Institute of Science SPIRE 2013 TexPoint fonts used in EMF. Read the TexPoint manual.

Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)

On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)

C&O 355 Mathematical Programming Fall 2010 Lecture 4 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Quantum Computing MAS 725 Hartmut Klauck NTU TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A.

Analysis of Algorithms

Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)

Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

Data Stream Algorithms Lower Bounds Graham Cormode

Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

List Decoding Using the XOR Lemma Luca Trevisan U.C. Berkeley.

Tight Bound for the Gap Hamming Distance Problem Oded Regev Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual before you delete.

Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.

Stochastic Streams: Sample Complexity vs. Space Complexity

New Characterizations in Turnstile Streams with Applications

Hans Bodlaender, Marek Cygan and Stefan Kratsch

On Testing Dynamic Environments

Generalization and adaptivity in stochastic convex optimization

Lecture 18: Uniformity Testing Monotonicity Testing

Sublinear Algorithmic Tools 3

Sublinear Algorithmic Tools 2

Lecture 10: Sketching S3: Nearest Neighbor Search

Sketching and Embedding are Equivalent for Norms

Lecture 4: CountSketch High Frequencies

Lecture 16: Earth-Mover Distance

Near-Optimal (Euclidean) Metric Compression

Overcoming the L1 Non-Embeddability Barrier

Streaming Symmetric Norms via Measure Concentration

Lecture 15: Least Square Regression Metric Embeddings

Approximating Edit Distance in Near-Linear Time

On Solving Linear Systems in Sublinear Time

Presentation transcript:

Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni [Microsoft SVC] Krzysztof Onak [CMU] TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A

Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni [Microsoft SVC] Krzysztof Onak [CMU] TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A …

Polylog. Approx. for ED and the Asymmetric Query Complexity 3 Generic Search Engine Given two strings x,y  n : ed(x,y) = minimum number of character operations (insertion/deletion/substitution) that transform x to y. ed( banana, ananas ) = 2 Edit Distance (Levenshtein distance) Applications: Computational Biology Text processing Web search

Polylog. Approx. for ED and the Asymmetric Query Complexity 4 Basic task Compute ed(x,y) for input x,y   n  O(n 2 ) time [WF’74] banana a n a n a s D(i,j)= min D(i-1, j-1), if x[i]=y[j] D(i, j-1) + 1 D(i-1, j) + 1 D(i,j) = ed( x[1:i], y[1:j] ) Faster algorithms?

Polylog. Approx. for ED and the Asymmetric Query Complexity 5 Faster Algorithms? Compute ed(x,y) for given x,y   n  O(n 2 ) time [WF’74]  O(n 2 /log 2 n) time [MP’80] Linear time (or near-linear)?  Specific cases (average, smoothed, restricted input) and variants (block edit dist etc.) [U’83, LV’85, M’86, GG’88, GP’89, UW’90, CL’90, CH’98, LMS’98, U’85, CL’92, N’99, CPSV’00, MS’00,CM’02, AK’08, BF’08…]  2 Õ(√log n) approximation [OR’05,AO’09], improving earlier n c - approximation [BEKMRRS’03,BJKK’04,BES’06] Same “barrier” 2 Õ(√log n) -approximation also for related tasks:  Nearest neighbor search (text indexing), embedding into normed spaces, sketching [OR’05]

Polylog. Approx. for ED and the Asymmetric Query Complexity 6 Results I Theorem 1: Can approximate ed(x,y) within (log n) O(1/ε) factor in time n 1+ε (for any ε>0). Exponential improvement over previous factor 2 Õ(√log n) Fallout from the study of asymmetric query model …

Polylog. Approx. for ED and the Asymmetric Query Complexity 7 Approach: asymmetric query model “Compress” one string, x, to n ε information  Use dynamic programming to compute ed(x,y) in n 1+ε time How to compress?  Carefully subsample x… Focus on sample-size (number of queried positions) in x, for fixed y ? Obtain near-tight bounds x y

Polylog. Approx. for ED and the Asymmetric Query Complexity 8 Results II: Asymmetric Query Complexity Problem: Decide ed(x,y) ≥ n/10 vs ed(x,y) ≤ n/A Complexity = #queries into x (unlimited access to y) n 1-ε A Θ(log n) Θ(log 2 n) Θ(log 3 n) Θ(log t n) # queries n 1/2-ε n 1/2 n 1/3 n 1/4 n 1/t-ε n 1/(t+1) Approximation:(log n) O(1/ε) # Queries:O(n ε ) Ω(n ε/loglog n ) [n 1/(t+1), n 1/t-ε ] O(log t n) Ω(log t n)

Polylog. Approx. for ED and the Asymmetric Query Complexity 9 Upper bound Theorem 2: can distinguish ed(x,y) ≥ n/10 vs ed(x,y) ≤ n/A for A=(log n) O(1/ε) approximation with n ε queries into x (for any ε>0). Proof structure: 1. Characterize edit by “tree-distance” T xy Parameter b≥2 (degree) T xy ≈ ed(x,y) up to 6b*log n factor 2. Prune the tree to subsample x x1x1 x2x2 xnxn b sampled positions in x

Polylog. Approx. for ED and the Asymmetric Query Complexity 10 Step 1: Tree distance Partition x into b blocks, recursively, for h=log b n levels x[1:n] x[1:⅓n]x[⅔n:n] … x[1] x[2]x[3] x[⅓n:⅔n] y[1:n] y[u:u+⅓n] x[u:u+ ⅓ n] T i (s,u) = T-distance between x[s:s+ℓ i ] and y[u:u+ℓ i ] where ℓ i is the block-length at level i

Polylog. Approx. for ED and the Asymmetric Query Complexity 11 Tree distance: recursive definition Recall T i (s,u) = distance between x[s:s+ℓ i ] and y[u:u+ℓ i ] Base case: T h (s,u)=Hamming(x[s],y[u]) Output: T xy =T 0 (s=1,u=1) x[s:s+ℓ i ] y[u:u+ℓ i ] r0r0 x y

Polylog. Approx. for ED and the Asymmetric Query Complexity 12 T-distance approximates edit distance Lemma: T xy ≈ed(x,y) up to 6b*log b n factor. Hierarchical decomposition inspired by earlier approaches [BEKMRRS’03, OR’05]  All had approximation recurrence of the type A(n) = c*A(n/b) + b for c≥2  Solves to A(n) ≥ 2 √log n factor for every choice of b Our characterization has no multiplicative loss (c=1): A(n) = A(n/b) + b  Analysis inspired by algorithms for smoothed edit [AK’08]

Polylog. Approx. for ED and the Asymmetric Query Complexity 13 Step 2: Compute the tree distance For b=2, T-distance gives O(log n) approximation!  BUT know only how to compute T-distance in Õ(n 2 ) time Instead, for b=( log n) 1/ε, can prune the tree to n O(ε) nodes, and get 1+ε approximation Pruning: subsample (log n) O(1) children out of each node  Works only when ed(x,y) ≥  (n)  Generally, must subsample the tree non-uniformly, using the Precision Sampling Lemma b sampled positions in x

Polylog. Approx. for ED and the Asymmetric Query Complexity 14 Key tool: non-uniform sampling Goal:  For unknown a 1, a 2, …a n  [0,1]  Estimate their sum, up to an additive constant error  Using only “weak” estimates ã 1, ã 2, …ã n Sum Estimator Adversary 0. fix distribution U 1. Fix a 1,a 2,…a n (unknown) 2. pick “precisions” u i (our algorithm: u i ~U i.i.d.) 3. provide ã 1,ã 2,…ã n s.t. |a i -ã i |<1/u i 4. report S̃=S̃(ã 1,…, u 1,…) with |S̃ – ∑a i ̃| < 1.

Polylog. Approx. for ED and the Asymmetric Query Complexity 15 Precision Sampling Goal: estimate ∑a i from {ã i } s.t. |a i -ã i |<1/u i. Precision Sampling Lemma: Can achieve WHP  additive error 1 and multiplicative error 1.5  with expected precision E u_i~U [u i ]=O(log n). Inspired by a technique from [IW’05] for streaming (F k moments)  In fact, PSL gives simple & improved algorithms for F k moments, cascaded (mixed) norms, ℓ p -sampling problems [AKO’10] Also distant relative of Priority Sampling [DLT’07]

Polylog. Approx. for ED and the Asymmetric Query Complexity 16 Precision Sampling for Edit Distance Apply Precision Sampling to the tree from the characterization recursively at each node If a node has very weak precision, can trim the entire sub-tree

Polylog. Approx. for ED and the Asymmetric Query Complexity 17 Lower Bound Theorem Theorem 3: Achieving approximation A=O(log 7 n) for edit distance requires asymmetric query complexity n Ω(1/loglog n).  I.e., distinguishing ed(x,y)>n/10 vs ed(x,y)<n/10A Implications: First lower bound to expose hardness from repetitiveness in edit distance Contrast with edit on non-repetitive strings (Ulam’s distance)  Empirically easier (better algorithms are known for it)  Yet, all previous lower bounds essentially equivalent for the two variants [BEKMRRS’03, AN’10, KN’05, KR’06, AK’07, AJP’10] But asymmetric query complexity:  Ulam: 2-approx. with O(log n) queries [ACCL’04, SS’10]  Edit: requires n Ω(1/loglog n) queries

Polylog. Approx. for ED and the Asymmetric Query Complexity 18 Lower Bound Techniques Core gadget: ¾ (.) = cyclic shift operation  Observation: ed(x, ¾ j (x)) · 2j Lower bound outline:  exhibit lower bound via shifts  Amplification by “composing” the hard instance recursively We will see here: Theorem 4: Asymmetric query complexity of approximation n 1/2 to edit distance is Ω(log 2 n)

Polylog. Approx. for ED and the Asymmetric Query Complexity 19 The Shift Gadget Lemma: Ω(log n) query lower bound for approximation A=n 0.5. Hard distribution (x,y):  Fix specific z 1, z 2  {0,1} n (random-looking)  Set:  Formally: y=z 1 and x=σ j (z 1 OR z 2 ) and random j  [n 0.5 ] An algorithm is a set queried positions: Q ½ [n], |Q|<<log n  It “reads” (z 1 OR z 2 ) at positions Q+j Claim: Both z 1 | Q+j and z 2 | Q+j close to uniform dist. on {0,1} |Q|  up to ~2 |Q| /n 0.5 statistical distance Hence |Q| ¸ Ω(log n), even for approximation A=n y= x= ¾ j ( ) ) ed(x,y) · 2n 0.5 [close] ) ed(x,y) ¸ n/10 [far]

Polylog. Approx. for ED and the Asymmetric Query Complexity 20 Amplification via Substitution Product Ω(log 2 n) lower bound by amplification: “compose” two shift instances Hard distribution (x,y):  Fix z 1,z 2  {0,1} √n, w 0,w 1  {0,1} √n and y=z 1  (w 0,w 1 ) (substitution)  Choose either z=z 1 (close) or z=z 2 (far)  x = z  (w 0,w 1 ) but with random shifts j  [n 1/3 ] inside each block and between blocks Intuition: must distinguish z=z 1 from z=z 2  Must “learn” Ω(log n) positions i of z, and each requires reading Ω(log n) further positions in the corresponding blocks w z[i] z1=z1= w0=w0=w1=w1= x=

Polylog. Approx. for ED and the Asymmetric Query Complexity 21 Towards the Full Theorem For the full theorem: recursive composition Proof overview: 1. Define ® -similarity of k distributions ( ® ≈information per query) 2. ® -similarity ) query lower bound 1/ ® (for adaptive algorithms) 3. Initial “Shift metric” has high ® -similarity (induction basis) 4. ® -similarity amplified under substitution product (inductive step) 5. Prove edit distance concentrates well(requires large alphabet) 6. Can reduce large alphabet to binary (lossy, but done once)

Polylog. Approx. for ED and the Asymmetric Query Complexity 22 Conclusion We compute ed(x,y) up to (log n) O(1/ε) approximation in n 1+ε time  Via Asymmetric Query Complexity (new model) Open questions: Do faster / limitations:  E.g. O(log 2 n) approximation in n 1+o(1) time? Use these insights for related problems:  Nearest Neighbor Search?  Sublinear-time algorithms (symmetric queries)?  Embeddings? Communication complexity? Further thoughts: Practical ramifications? Asymmetric queries model? Paradigm for “fast dynamic programming”?