1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.

1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

2 Motivating Example: Near-Duplicate Elimination Web Syntactic clustering [Broder, Glassman, Manasse, Zweig 97] Group pages into clusters of “similar” pages Keep one “representative” from each cluster Crawler Duplicate elimination Page Repository

3 Syntactic Clustering via Sketching [Broder,Glassman,Manasse,Zweig 97] Corpus is huge (billions of pages, 10K/page) Streaming access Limited main memory Linear running time Challenges p h(p) Locality Sensitive Hashes [Indyk, Motwani 98] Pr h [h(p) = h(q)] = sim(p,q) Cluster: Collection of pages that have a common sketch Can compute sketches in one pass Sketches can be stored and processed on a single machine

4 Shingling and Resemblance [Broder,Glassman,Manasse,Zweig 97], [Broder,Charikar,Frieze,Mitzenmacher 98] S w (p ) S w (q ) w-shingling: S w (p) = all substrings of p of length w resemblance w (p,q) = Pr  [min(  (S w (p)) = min(  (S w (q))] =

5 The Sketching Model Alice Bob Referee d(x,y) · k x y  x)  y) d(x,y) ¸ r Shared Randomness k vs. r Gap Problem d(x,y) · k or d(x,y) ¸ r Decide which of the two holds. Approximation Promise: Goal:

6 Applications of Sketching Large data sets Clustering Nearest Neighbor schemes Data streams Management of Files over the Network Differential backup Synchronization Theory Low distortion embeddings Simultaneous messages communication complexity

7 Known Sketching Schemes Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98] Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01] Cosine similarity [Charikar 02] Earth mover distance [Charikar 02] In this talk: Edit Distance

8 Edit Distance x 2  n, y 2  m Minimum number of character insertions, deletions and substitutions that transform x to y. Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications Genomics Text processing Web search For simplicity: m = n,  = {0,1}. ED(x,y):

9 Computing Edit Distance Dynamic programming (1970)O(n 2 ) Masek and Paterson (1980)O(n 2 /log n) Exact Computation Impractical for comparing two very long strings. Natural question 1: can we do it in linear time? Impractical for handling massive document repositories. Natural question 2: are there constant size sketches of edit distance? Can we solve the above problems if we settle for approximation? Can we solve the above problems if we settle for approximation? Focus of this talk

10 Sketching Schemes for Edit Distance AlgorithmGapSketch size Batu et al O(n  ) vs.  (n) O(n max(  /2, 2  – 1) ) This paperk vs. O((kn) 2/3 )O(1) This paper (non-repetitive strings) k vs. O(k 2 )O(1) No known embeddings of Edit distance into a normed space. Every embedding of Edit distance into L 1 incurs ¸ 3/2 distortion [Andoni,Deza,Gupta,Indyk,Raskhodnikova 03] Weak nearest neighbor schemes [Indyk 04] Negative Indications

11 Hamming Distance Sketches [Kushilevitz, Ostrovsky, Rabani 98] Ham(x,y) = # of positions in which x,y differ Gap: k vs. 2k Sketch size: O(1) Shared randomness: r 1,…,r n 2 {0,1} are independent and Sketch: h(x) = (  i x i r i ) mod 2 h(y) = (  i y i r i ) mod 2 Analysis: Pr[h(x)  h(y)] = Pr[h(x) + h(y) = 1] = Pr[  i: x i  y i r i = 1] = ½(1- (1 – 1/k) Ham(x,y) )  x) = (h 1 (x),…,h t (x)),  y) = (h 1 (y),…,h t (y)), t = O(1)

12 Edit Distance Sketches: Basic Framework Underlying Principle ED(x,y) is small iff x and y share many common substrings at nearby positions. S x = set of pairs of the form ( ,h(i))  a substring of x h(i): a “locality sensitive” encoding of the substring’s position x SxSx y SySy ED(x,y) small iff intersection S x Å S y large common substrings at nearby positions

13 Basic Framework (cont.) Need to estimate size of symmetric difference Hamming distance computation of characteristic vectors Use constant size sketches [KOR] x SxSx y SySy ED(x,y) small iff symmetric difference S x  S y small Reduced Edit Distance to Hamming Distance

14 11 22 33 11 22 33 General Case: Encoding Scheme Gap: k vs. O((kn) 2/3 ) x y B = n 2/3 /k 1/3, W = n/B 1 S x = { S y = { 234567891011121314 1234567891011121314 (  1,1), (  1,1), (  2,1), (  2,1), (  3,2), (  3,2), … … B windows of size W each.,(  i, win(i)),…,(  i, win(i)),…

15 Analysis jj ii x y 1234567891011121314 1234567891011121314 Case 1: ED(x,y) · k If  i is “unmarked”, it has a matching “companion”  j (  i,win(i)) 2 S x n S y, only if: either  i is “marked” or  i is unmarked, but win(i)  win(j) At most kB marked substrings At most k * n/W = kB companions with mismatched windows Therefore, Ham(S x,S y ) · 4kB

16 Analysis (cont.) 22 11 x y 1234567891011121314 1234567891011121314 Case 2: Ham(S x,S y ) · 8kB If  i has a “companion”  j and win(i) = win(j), can align  i with  j using at most W operations Otherwise, substitute first character of  i At most 8kB substrings of x have no companion Therefore, ED(x,y) · 8kB + W * n/B = O((kn) 2/3 )  B+1  2B+1  B-1

17 y2y2 x2x2 y1y1 x1x1 Non-repetitive Case: Encoding Scheme 11 22 33 44 55 66 77 11 22 33 44 55 66 77 t ¸ 1 “non-repetitiveness” parameter, W = O(k * t) no substring of length t repeats within a window of size W x y W W Alice and Bob choose a sequence of “anchors” in a coordinated way  1 : a random permutation on {0,1} t  1 : minimal length-t substring of x 1 (under  1 )  1 : minimal length-t substring of y 1 (under  1 ) Gap: k vs. O(k W)

18 11 11 Encoding scheme (cont.) 22 33 44 55 66 77 11 22 33 44 55 66 77 22 33 44 55 66 77 88 11 22 33 44 55 66 77 88 x y S x = { (  1,1),…,(  8,8) } S y = { (  1,1),…,(  8,8) }

19 11 22 33 44 55 66 77 11 22 33 44 55 66 77 11 22 33 44 55 66 77 88 Analysis Case 1: ED(x,y) · k. All anchors are “unmarked” with probability 1 - kt/W =  (1) If  i,  i are unmarked, they are aligned # of mismatching substrings · 2k Ham(S x,S y ) · 2k x y 11 22 33 44 55 66 77 88

20 11 22 33 44 55 66 77 11 22 33 44 55 66 77 88 11 22 33 44 55 66 77 11 22 33 44 55 66 77 88 Analysis (cont.) Case 2: Ham(S x,S y ) · 4k # of mismatching substrings · 4k ED(x,y) · 2 ¢ W ¢ 4k = O(k W). x y

21 Approximation in Linear Time AlgorithmGapTimeApprox. factor in O(n) time Dynamic Programming k vs. k+1O(kn)None Batu et al O(n  ) vs.  (n) O(n max(  /2, 2  - 1) ) None Cole, Hariharank vs. 2kO(n + k 4 )O(n 3/4 ) This paperk vs. k 7/4 O(n)O(n 3/7 ) AlgorithmGapTimeApprox. factor in O(n) time Cole, Hariharank vs. 2kO(n + k 3 )O(n 2/3 ) This paperk vs. k 3/2 O(n)O(n 1/3 ) Non- repetitive Strings Arbitrary Strings

22 Summary and Open Problems Designed efficient approximation schemes for edit distance. –Best sketching and linear-time approximations to date Subsequent work: –O(n 2/3 ) distortion embedding of edit distance into L 1 [Indyk 04] [Rabani 04] –Better embeddings of edit distance into L 1 [Ostrovsky, Rabani, 05] –Embeddings of the Ulam metric into L 1 [Charikar, Krauthgamer, 05] Open Problems –Sketch size lower bounds –Constant factor approximations in linear time –Better embeddings of edit distance –Sketching schemes for other distance measures

23 Thank You

1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.

Similar presentations

Presentation on theme: "1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.

Similar presentations

Presentation on theme: "1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion."— Presentation transcript:

Similar presentations

About project

Feedback