1 Lecture 18 Syntactic Web Clustering CS 728 - 2007.

Slides:



Advertisements
Similar presentations
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Advertisements

Xiaoming Sun Tsinghua University David Woodruff MIT
Hashing.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
High Dimensional Search Min-Hashing Locality Sensitive Hashing
Nonparametric Link Prediction in Dynamic Graphs Purnamrita Sarkar (UC Berkeley) Deepayan Chakrabarti (Facebook) Michael Jordan (UC Berkeley) 1.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
CS Lecture 9 Storeing and Querying Large Web Graphs.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Near Duplicate Detection
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Locality Sensitive Hashing Basics and applications.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Outline Problem Background Theory Extending to NLP and Experiment
Data Stream Algorithms Lower Bounds Graham Cormode
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Locality-sensitive hashing and its applications
Information Complexity Lower Bounds
Lecture 10 Hashing.
15-499:Algorithms and Applications
Near Duplicate Detection
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Hashing Alexandra Stefan.
Sketching, Locality Sensitive Hashing
Hashing Alexandra Stefan.
Sublinear Algorithmic Tools 3
Sublinear Algorithmic Tools 2
Sketching and Embedding are Equivalent for Norms
Near(est) Neighbor in High Dimensions
Near-Optimal (Euclidean) Metric Compression
Locality Sensitive Hashing
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
On the resemblance and containment of documents (MinHash)
CS 3343: Analysis of Algorithms
Presentation transcript:

1 Lecture 18 Syntactic Web Clustering CS

2 Outline Previously  Studied web clustering based on web link structure  Some discussion of term-document vector spaces Today Syntactic clustering of the web  Identifying syntactic duplicates  Locality sensitive hash functions  Resemblance and shingling  Min-wise independent permutations  The sketching model  Hamming distance and Edit distance

3 Motivation: Near-Duplicate Elimination Many web pages are duplicates or near- duplicates of other pages  Mirror sites  FAQs, manuals, legal documents  Different versions of the same document  Plagiarism Duplicates are bad for search engines  Increase index size  Harm quality of search results Question: How to efficiently process the repository of crawled pages and eliminate (near)- duplicates?

4 Syntactic Clustering of the Web [Broder, Glassman, Manasse, Zweig 97] U: space of all possible documents S  U: collection of documents Given sim: U × U  [0,1]: a similarity measure among documents  If p,q are very similar sim(p,q) is close to 1  If p,q are very unsimilar, sim(p,q) is close to 0  Usually: sim(p,q) = 1 – d(p,q), where d(p,q) is a normalized distance between p and q. G: a threshold graph on S:  p,q are connected by an edge iff sim(p,q)  t (t = threshold)‏ Goal: find the connected components of G

5 Main Challenges S is huge  Web has 10 billion pages Documents are not compressed  Needs many disks to store S  Each sim computation is costly Documents in S should be processed in a stream Main memory is small relative to S Cannot afford more than O(|S|) time How to create the graph G?  Naively, requires |S| passes and |S| 2 similarity computations

6 Sketching Schemes T = a small set (|S| < |T| << |U|)‏ A sketching scheme for sim:  Compression function: a randomized mapping  : U  T  Reconstruction function:  : T  T  [0,1]  For every pair p,q, with high probability, have  (  (p),  (q))  sim(p,q)‏

7 Syntactic Clustering by Sketching P  empty table of size |S| G  empty graph on |S| nodes for i = 1,…,|S| read document p i from the stream P[i]   (p i )‏ for i = 1,…,|S| for j = 1,…,|S| if (  (P[i],P[j])  t) add edge (i,j) to G output connected components of G

8 Analysis Can compute sketches in one pass Table P can be stored in a single file on a single machine Creating G requires |S| 2 applications of   Easier than full-fledged computations of sim  Quadratic time is still a problem Connected components algorithm is heavy but feasible Need a linear time algorithm that is approximation  Idea: Use Hashing

Sketching vs Fingerprinting vs Hashing Hashing h:  k  Set Membership testing for set S of size n  Desire uniform distribution over bin address  k  Minimize collisions per bin – reduce lookup time  Minimize hash table size  n  N=2 k Fingerprinting f :  k  Object Equality testing over set S of size n  Distribution over  k is irrelevant  Avoid collisions altogether  Tolerate larger k – typically N > n 2 Sketching phi :  k  Similarity testing for set S of size n  Distribution over  k is irrelevant  Minimize collisions of dis-similar sets  Minimize table size  n  N=2 k

10 Sketching via Locality Sensitive Hashing (LSH) [Indyk, Motwani, 98] H = { h | h: U  T }: a family of hash functions H is locality sensitive w.r.t. sim if for all p,q  U, Pr[h(p) = h(q)] = sim(p,q).  Probability is over random choice of h from H  Probability of collision = similarity between p and q

11 Syntactic Clustering by LSH P  empty table of size |S| G  empty graph on |S| nodes Choose random h for i = 1,…,|S| read document p i from the stream P[i]  h(p i )‏ sort P and group by value output groups

12 Analysis Can compute hash values in one pass Table P can be stored in a single file on a single machine Sorting and grouping takes O(|S| log |S|) simple comparisons Each group consists of pages whose hash value is the same  By LSH property, they are likely to be similar to each other Let’s apply this to the web and see if makes sense  Need sim measure – Idea: shingling

13 Shingling and Resemblance [Broder et al 97] tokens: words, numbers, HTML tags, etc. tokenization(p): sequence of tokens produced from document p w: a small integer S w (p) = w-shingling of p = set all distinct contiguous subsequences of tokenization(p) of length w.  Ex: p = “a rose is a rose is a rose”, w = 4  S w (p) = { (a rose is a), (rose is a rose), (is a rose is) }  Possible to use multisets as well resemblance w (p,q) =

Shingling Example A = “a rose is a rose is a rose” B = “a rose is a flower which is a rose” Preserving multiplicity  w=1  sim(S A,S B ) = 0.7 S A = {a, a, a, is, is, rose, rose, rose} S B = {a, a, a, is, is, rose, rose, flower, which}  w=2  sim(S A,S B ) = 0.5  w=3  sim(S A,S B ) = 0.3 Disregarding multiplicity  w=1  sim(S A,S B ) = 0.6  w=2  sim(S A,S B ) = 0.5  w=3  sim(S A,S B ) =

15 LSH for Resemblance resemblance w (p,q) =  = a random permutation on  w   induces a random order on all length w sequences of tokens   also induces a random order on any subset X   W  For each such subset and for each x  X, Pr(min (  (X)) = x) = 1/|X| LSH for resemblance: h(p) = min(  (S w (p)))‏ S w (p )‏ S w (q )‏

16 LSH for Resemblance (cont.)‏ Lemma: Proof:

Problems How do we pick  ?  Need random choice  Need to efficiently find min element How many possible values ?  |  | w ! So need O(|  | w log |  | w ) bits to represent at minimum  Still need to compute min element

Some Theory: Pairwise independent Universal Hash functions: (Pairwise independent)  H : a finite collection (family) of hash functions mapping U ! {0...m-1}  H is universal if, for h in H picked uniformly at random, and for all x 1, x 2 in U, x 1  x 2 Pr(h(x 1 ) = h(x 2 )) · 1/m The class of hash functions  h ab (x) = ((a x + b) mod p) mod m is universal (p ¸ m is a prime, a = {1…p-1}, b = {0…p-1})

Some Theory: Minwise independent Minwise independent permutations:  S n : a finite collection (family) of permutations mapping {1…n} to {1…n}  H is minwise independent if, for  in S n picked uniformly at random, and for X subset of {1…n}, and all x in X Pr(min{  (X)} =  (x)) = 1/|X| It is actually hard to find a “compact” collection of hash functions that is minwise independent, but we can use an approximation. In practice – universal hashes work well!

Back to similarity and resemblence If  in S n and S n is minwise independent then: This suggests we could just keep one minimum value as our “sketch”, but our confidence would be low (high variance) What we want for a sketch of size k is either  use k  ’s,  or keep the k minimum values for one 

Multiple Permutations Better Variance Reduction  Instead of larger k, stick with k=1  Multiple, independent permutations Sketch Construction  Pick p random permutations of U – π 1,π 2, …,π p  sk(A) = minimal elements under π 1 (S A ), …, π p (S A ) Claim: E[ sim(sk(A),sk(B)) ] = sim(S A,S B )  Earlier lemma  true for p=1  Linearity of expectations  Variance reduction – independence of π 1, …,π p

22 Other Known Sketching Schemes Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98] Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01] Cosine similarity [Charikar 02] Earth mover distance [Charikar 02] Edit distance [Bar-Yossef, Jayram, Krauthgamer, Kumar 04]

23 The General Sketching Model Alice Bob Referee d(x,y) ≤ k x y  x)‏  y)‏ d(x,y) ≥ r Shared Randomness k vs. r Gap Problem d(x,y) ≤ k or d(x,y) ≥ r Decide which of the two holds. Approximation Promise: Goal:

24 Applications Large data sets Clustering Nearest Neighbor schemes Data streams Management of Files over the Network Differential backup Synchronization Theory Low distortion embeddings Simultaneous messages communication complexity