1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005

Slides:



Advertisements
Similar presentations
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Advertisements

Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
High Dimensional Search Min-Hashing Locality Sensitive Hashing
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani.
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
CS Lecture 9 Storeing and Querying Large Web Graphs.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Sketching in Adversarial Environments Or Sublinearity and Cryptography 1 Moni Naor Joint work with: Ilya Mironov and Gil Segev.
1 Lecture 18 Syntactic Web Clustering CS
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Near Duplicate Detection
1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Outline Problem Background Theory Extending to NLP and Experiment
Data Stream Algorithms Lower Bounds Graham Cormode
Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.
DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Locality-sensitive hashing and its applications
15-499:Algorithms and Applications
Near Duplicate Detection
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
Sketching and Embedding are Equivalent for Norms
Near(est) Neighbor in High Dimensions
Near-Optimal (Euclidean) Metric Compression
Locality Sensitive Hashing
CSE 589 Applied Algorithms Spring 1999
Range-Efficient Computation of F0 over Massive Data Streams
Minwise Hashing and Efficient Search
On the resemblance and containment of documents (MinHash)
President’s Day Lecture: Advanced Nearest Neighbor Search
Approximating Edit Distance in Near-Linear Time
New Jersey, October 9-11, 2016 Field of theoretical computer science
Lecture-Hashing.
Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1,

2 Sketching

3 Outline Syntactic clustering of the web Locality sensitive hash functions Resemblance and shingling Min-wise independent permutations The sketching model Hamming distance Edit distance

4 Motivation: Near-Duplicate Elimination Many web pages are duplicates or near- duplicates of other pages  Mirror sites  FAQs, manuals, legal documents  Different versions of the same document  Plagiarism Duplicates are bad for search engines  Increase index size  Harm quality of search results Question: How to efficiently process the repository of crawled pages and eliminate (near)- duplicates?

5 Syntactic Clustering of the Web [Broder, Glassman, Manasse, Zweig 97] U: space of all possible documents S  U: collection of documents sim: U × U  [0,1]: a similarity measure among documents  If p,q are very similar sim(p,q) is close to 1  If p,q are very unsimilar, sim(p,q) is close to 0  Usually: sim(p,q) = 1 – d(p,q), where d(p,q) is a normalized distance between p and q. G: a graph on S:  p,q are connected by an edge iff sim(p,q)  t (t = threshold) Goal: find the connected components of G

6 Challenges S is huge  Web has 10 billion pages Documents are not compressed  Needs many disks to store S  Each sim computation is costly Documents in S should be processed in a stream Main memory is tine relative to |S| Cannot afford more than O(|S|) time How to create the graph G?  Naively, requires |S| passes and |S| 2 similarity computations

7 Sketching Schemes T = a small set (|S| < |T| << |U|) A sketching scheme for sim:  Compression function: a randomized mapping  : U  T  Reconstruction function:  : T  T  [0,1]  For every pair p,q, with high probability  (  (p),  (q))  sim(p,q)

8 Syntactic Clustering by Sketching 1. P  empty table of size |S| 2. G  empty graph on |S| nodes 3. for i = 1,…,|S| 4. read document p i from the stream 5. P[i]   (p i ) 6. for i = 1,…,|S| 7. for j = 1,…,|S| 8. if (  (P[i],P[j])  t) 9. add edge (i,j) to G 10. output connected components of G

9 Analysis Can compute sketches in one pass Table P can be stored in a single file on a single machine Creating G requires |S| 2 applications of   Easier than full-fledged computations of sim  Quadratic time is still a problem Connected components algorithm is heavy but feasible

10 Locality Sensitive Hashing (LSH) [Indyk, Motwani, 98] A special kind of sketching schemes H = { h | h: U  T }: a family of hash functions H is locality sensitive w.r.t. sim if for all p,q  U, Pr[h(p) = h(q)] = sim(p,q).  Probability is over random choice of h from H  Probability of collision = similarity between p and q

11 Syntactic Clustering by LSH 1. P  empty table of size |S| 2. G  empty graph on |S| nodes 3. for i = 1,…,|S| 4. read document p i from the stream 5. P[i]  h(p i ) 6. sort P and group by value 7. output groups

12 Analysis Can compute hash values in one pass Table P can be stored in a single file on a single machine Sorting and grouping takes O(|S| log |S|) simple comparisons Each group A consists of pages whose hash value is the same  By LSH property, they are likely to be similar to each other

13 Shingling and Resemblance [Broder et al 97] tokens: words, numbers, HTML tags, etc. tokenization(p): sequence of tokens produced from document p w: a small integer S w (p) = w-shingling of p = set all distinct contiguous subsequences of tokenization(p) of length w.  Ex: p = “a rose is a rose is a rose”, w = 4  S w (p) = { (a rose is a), (rose is a rose), (is a rose is) } resemblance w (p,q) =

14 LSH for Resemblance resemblance w (p,q) =  = a random permutation on  w   induces a random order on all length w sequences of tokens   also induces a random order on any subset X   W  For each such subset and for each x  X, Pr(min (  (X)) = x) = 1/|X| LSH for resemblance: h(p) = min(  (S w (p))) S w (p ) S w (q )

15 LSH for Resemblance (cont.) Lemma: Proof:

16 Min-Wise Independent Permutations [Broder, Charikar, Frieze, Mitzenmacher, 98] Usual problem: Storing  takes too much space  O(|  | w log |  | w ) bits to represent  Use small families of permutations A family  = {  |  is a permutation on  w } is min-wise independent, if  For all subsets X   w and for all x  X, Pr(min (  (X)) = x) = 1/|X| Explicit constructions of small families of “approximately” min-wise independent permutations [Indyk 98]

17 The Sketching Model Alice Bob Referee d(x,y) ≤ k x y  x)  y) d(x,y) ≥ r Shared Randomness k vs. r Gap Problem d(x,y) ≤ k or d(x,y) ≥ r Decide which of the two holds. Approximation Promise: Goal:

18 Applications Large data sets Clustering Nearest Neighbor schemes Data streams Management of Files over the Network Differential backup Synchronization Theory Low distortion embeddings Simultaneous messages communication complexity

19 Known Sketching Schemes Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98] Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01] Cosine similarity [Charikar 02] Earth mover distance [Charikar 02] Edit distance [Bar-Yossef, Jayram, Krauthgamer, Kumar 04]

20 Sketching Algorithm for Hamming Distance [Kushilevitz, Ostrovsky, Rabani 98] x,y: binary strings of length n HD(x,y) = # of positions in which x,y differ HD(x,y) = | { i | x i  y i } |  Ex: x = 10101, y = 01010, HD(x,y) = 5 Goal:  If HD(x,y) ≤ k, output “accept” w.p.  1 -   If HD(x,y) ≥ 2k, output “reject” w.p.  1 -  KOR algorithm: O(log(1/  )) size sketch.

21 The KOR Algorithm Shared randomness: n i.i.d. random bits r 1,…,r n, where Basic sketch: h(x) = (  i x i r i ) mod 2 Full sketch:  (x) = (h 1 (x),…,h t (x))  t = O(log(1/  ))  h 1,…,h t are generated independently like h Reconstruction: 1. for j = 1,…,t do 2. if (h j (x) = h j (y)) then 3. z j  1 4. else 5. z j  0 6. if avg(z 1,…,z t ) > 11/18 output “accept” and else output “reject”

22 KOR: Analysis dd Note: # of terms in the sum = HD(x,y) Given HD(x,y) independent random bits, each with probability 1/2k to be 1, what is the probability that their parity is 0?

23 KOR: Analysis (cont.) r 1,…,r m : m independent random bits For each j, Pr(r j = 1) =  What is Pr[  j r j = 0)? Can view distribution of each bit as a mixture of two distributions:  Dist A (with probability  ): the bit 0 w.p. 1  Dist B (with probability 2  ): a uniformly chosen bit Note:  If all bits “choose” Dist A, then the parity is 0 w.p. 1  If one of the m bits “chooses” Dist B, then the parity is 0 w.p. ½ Hence,

24 KOR Analysis (cont.) ff Therefore,  If HD(x,y) ≤ k, then Pr[h(x) = h(y)] ≥ 1/2 + 1/2e  4/6 = 12/18  If HD(x,y) ≥ 2k, then Pr[h(x) = h(y)] ≤ 1/2 + 1/2e 2  10/18 Define: If HD(x,y) ≤ k, then E[Z] ≥ 12/18 If HD(x,y) ≥ 2k, then E[Z] ≤ 10/18 By Chernoff, t = O(log(1/  )) enough to guarantee:  If HD(x,y) ≤ k, then Z > 11/18 w.p. 1 -   If HD(x,y) ≥ 2k, then Z ≤ 11/18 w.h.p 1 - 

25 Edit Distance x 2  n, y 2  m Minimum number of character insertions, deletions and substitutions that transform x to y. Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications Genomics Text processing Web search For simplicity: m = n,  = {0,1}. ED(x,y):

26 Sketching Algorithm for Edit Distance [Bar-Yossef,Jayram,Krauthgamer,Kumar 04] x,y: binary strings of length n Goal:  If ED(x,y) ≤ k, output “accept” w.p.  1 -   If ED(x,y) ≥  ((kn) 2/3 ), output “reject” w.p. ≥ 1 -  BJKK algorithm: O(log(1/  )) size sketch.

27 Basic Framework Underlying Principle ED(x,y) is small iff x and y share many common substrings at nearby positions. S x = set of pairs of the form ( ,h(i))  a substring of x h(i): a “locality sensitive” encoding of the substring’s position x SxSx y SySy ED(x,y) small iff intersection S x Å S y large common substrings at nearby positions

28 Basic Framework (cont.) Need to estimate size of symmetric difference Hamming distance computation of characteristic vectors Use O(log(1/  )) size sketches [KOR] x SxSx y SySy ED(x,y) small iff symmetric difference S x  S y small Reduced Edit Distance to Hamming Distance

29 11 22 33 11 22 33 Encoding Scheme Gap: k vs. O((kn) 2/3 ) x y B = n 2/3 /k 1/3, W = n/B 1 S x = { S y = { (  1,1), (  1,1), (  2,1), (  2,1), (  3,2), (  3,2), … … B windows of size W each.,(  i, win(i)),…,(  i, win(i)),…

30 Analysis jj ii x y Case 1: ED(x,y) · k If  i is “unmarked”, it has a matching “companion”  j (  i,win(i)) 2 S x n S y, only if: either  i is “marked” or  i is unmarked, but win(i)  win(j) At most kB marked substrings At most k * n/W = kB companions with mismatched windows Therefore, Ham(S x,S y ) · 4kB

31 Analysis (cont.) 22 11 x y Case 2: Ham(S x,S y ) · 8kB If  i has a “companion”  j and win(i) = win(j), can align  i with  j using at most W operations Otherwise, substitute first character of  i At most 8kB substrings of x have no companion Therefore, ED(x,y) · 8kB + W * n/B = O((kn) 2/3 )  B+1  2B+1  B-1

32 End of Lecture 11