1 Lecture 18 Syntactic Web Clustering CS 728 - 2007.

1 Lecture 18 Syntactic Web Clustering CS 728 - 2007

2 Outline Previously  Studied web clustering based on web link structure  Some discussion of term-document vector spaces Today Syntactic clustering of the web  Identifying syntactic duplicates  Locality sensitive hash functions  Resemblance and shingling  Min-wise independent permutations  The sketching model  Hamming distance and Edit distance

3 Motivation: Near-Duplicate Elimination Many web pages are duplicates or near- duplicates of other pages  Mirror sites  FAQs, manuals, legal documents  Different versions of the same document  Plagiarism Duplicates are bad for search engines  Increase index size  Harm quality of search results Question: How to efficiently process the repository of crawled pages and eliminate (near)- duplicates?

4 Syntactic Clustering of the Web [Broder, Glassman, Manasse, Zweig 97] U: space of all possible documents S  U: collection of documents Given sim: U × U  [0,1]: a similarity measure among documents  If p,q are very similar sim(p,q) is close to 1  If p,q are very unsimilar, sim(p,q) is close to 0  Usually: sim(p,q) = 1 – d(p,q), where d(p,q) is a normalized distance between p and q. G: a threshold graph on S:  p,q are connected by an edge iff sim(p,q)  t (t = threshold)‏ Goal: find the connected components of G

5 Main Challenges S is huge  Web has 10 billion pages Documents are not compressed  Needs many disks to store S  Each sim computation is costly Documents in S should be processed in a stream Main memory is small relative to S Cannot afford more than O(|S|) time How to create the graph G?  Naively, requires |S| passes and |S| 2 similarity computations

6 Sketching Schemes T = a small set (|S| < |T| << |U|)‏ A sketching scheme for sim:  Compression function: a randomized mapping  : U  T  Reconstruction function:  : T  T  [0,1]  For every pair p,q, with high probability, have  (  (p),  (q))  sim(p,q)‏

7 Syntactic Clustering by Sketching P  empty table of size |S| G  empty graph on |S| nodes for i = 1,…,|S| read document p i from the stream P[i]   (p i )‏ for i = 1,…,|S| for j = 1,…,|S| if (  (P[i],P[j])  t) add edge (i,j) to G output connected components of G

8 Analysis Can compute sketches in one pass Table P can be stored in a single file on a single machine Creating G requires |S| 2 applications of   Easier than full-fledged computations of sim  Quadratic time is still a problem Connected components algorithm is heavy but feasible Need a linear time algorithm that is approximation  Idea: Use Hashing

Sketching vs Fingerprinting vs Hashing Hashing h:  k  Set Membership testing for set S of size n  Desire uniform distribution over bin address  k  Minimize collisions per bin – reduce lookup time  Minimize hash table size  n  N=2 k Fingerprinting f :  k  Object Equality testing over set S of size n  Distribution over  k is irrelevant  Avoid collisions altogether  Tolerate larger k – typically N > n 2 Sketching phi :  k  Similarity testing for set S of size n  Distribution over  k is irrelevant  Minimize collisions of dis-similar sets  Minimize table size  n  N=2 k

10 Sketching via Locality Sensitive Hashing (LSH) [Indyk, Motwani, 98] H = { h | h: U  T }: a family of hash functions H is locality sensitive w.r.t. sim if for all p,q  U, Pr[h(p) = h(q)] = sim(p,q).  Probability is over random choice of h from H  Probability of collision = similarity between p and q

11 Syntactic Clustering by LSH P  empty table of size |S| G  empty graph on |S| nodes Choose random h for i = 1,…,|S| read document p i from the stream P[i]  h(p i )‏ sort P and group by value output groups

12 Analysis Can compute hash values in one pass Table P can be stored in a single file on a single machine Sorting and grouping takes O(|S| log |S|) simple comparisons Each group consists of pages whose hash value is the same  By LSH property, they are likely to be similar to each other Let’s apply this to the web and see if makes sense  Need sim measure – Idea: shingling

13 Shingling and Resemblance [Broder et al 97] tokens: words, numbers, HTML tags, etc. tokenization(p): sequence of tokens produced from document p w: a small integer S w (p) = w-shingling of p = set all distinct contiguous subsequences of tokenization(p) of length w.  Ex: p = “a rose is a rose is a rose”, w = 4  S w (p) = { (a rose is a), (rose is a rose), (is a rose is) }  Possible to use multisets as well resemblance w (p,q) =

Shingling Example A = “a rose is a rose is a rose” B = “a rose is a flower which is a rose” Preserving multiplicity  w=1  sim(S A,S B ) = 0.7 S A = {a, a, a, is, is, rose, rose, rose} S B = {a, a, a, is, is, rose, rose, flower, which}  w=2  sim(S A,S B ) = 0.5  w=3  sim(S A,S B ) = 0.3 Disregarding multiplicity  w=1  sim(S A,S B ) = 0.6  w=2  sim(S A,S B ) = 0.5  w=3  sim(S A,S B ) = 0.4285

15 LSH for Resemblance resemblance w (p,q) =  = a random permutation on  w   induces a random order on all length w sequences of tokens   also induces a random order on any subset X   W  For each such subset and for each x  X, Pr(min (  (X)) = x) = 1/|X| LSH for resemblance: h(p) = min(  (S w (p)))‏ S w (p )‏ S w (q )‏

16 LSH for Resemblance (cont.)‏ Lemma: Proof:

Problems How do we pick  ?  Need random choice  Need to efficiently find min element How many possible values ?  |  | w ! So need O(|  | w log |  | w ) bits to represent at minimum  Still need to compute min element

Some Theory: Pairwise independent Universal Hash functions: (Pairwise independent)  H : a finite collection (family) of hash functions mapping U ! {0...m-1}  H is universal if, for h in H picked uniformly at random, and for all x 1, x 2 in U, x 1  x 2 Pr(h(x 1 ) = h(x 2 )) · 1/m The class of hash functions  h ab (x) = ((a x + b) mod p) mod m is universal (p ¸ m is a prime, a = {1…p-1}, b = {0…p-1})

Some Theory: Minwise independent Minwise independent permutations:  S n : a finite collection (family) of permutations mapping {1…n} to {1…n}  H is minwise independent if, for  in S n picked uniformly at random, and for X subset of {1…n}, and all x in X Pr(min{  (X)} =  (x)) = 1/|X| It is actually hard to find a “compact” collection of hash functions that is minwise independent, but we can use an approximation. In practice – universal hashes work well!

Back to similarity and resemblence If  in S n and S n is minwise independent then: This suggests we could just keep one minimum value as our “sketch”, but our confidence would be low (high variance) What we want for a sketch of size k is either  use k  ’s,  or keep the k minimum values for one 

Multiple Permutations Better Variance Reduction  Instead of larger k, stick with k=1  Multiple, independent permutations Sketch Construction  Pick p random permutations of U – π 1,π 2, …,π p  sk(A) = minimal elements under π 1 (S A ), …, π p (S A ) Claim: E[ sim(sk(A),sk(B)) ] = sim(S A,S B )  Earlier lemma  true for p=1  Linearity of expectations  Variance reduction – independence of π 1, …,π p

22 Other Known Sketching Schemes Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98] Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01] Cosine similarity [Charikar 02] Earth mover distance [Charikar 02] Edit distance [Bar-Yossef, Jayram, Krauthgamer, Kumar 04]

23 The General Sketching Model Alice Bob Referee d(x,y) ≤ k x y  x)‏  y)‏ d(x,y) ≥ r Shared Randomness k vs. r Gap Problem d(x,y) ≤ k or d(x,y) ≥ r Decide which of the two holds. Approximation Promise: Goal:

24 Applications Large data sets Clustering Nearest Neighbor schemes Data streams Management of Files over the Network Differential backup Synchronization Theory Low distortion embeddings Simultaneous messages communication complexity

1 Lecture 18 Syntactic Web Clustering CS 728 - 2007.

Similar presentations

Presentation on theme: "1 Lecture 18 Syntactic Web Clustering CS 728 - 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Lecture 18 Syntactic Web Clustering CS 728 - 2007.

Similar presentations

Presentation on theme: "1 Lecture 18 Syntactic Web Clustering CS 728 - 2007."— Presentation transcript:

Similar presentations

About project

Feedback