Near-Duplicates Detection

Slides:

Advertisements

Similar presentations

Tables and Information Retrieval

Advertisements

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.

Longest Common Subsequence

CpSc 881: Information Retrieval. 2 Web search overview.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Primality Testing Patrick Lee 12 July 2003 (updated on 13 July 2003)

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Chapter 5 Orthogonality

1 Lecture 18 Syntactic Web Clustering CS

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Near Duplicate Detection

Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.

Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.

Multiple Sequence alignment Chitta Baral Arizona State University.

Lecture 10: Search Structures and Hashing

Finding Similar Items.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 19: Web Search 1.

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)

Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

Course Web Page Most information about the course (including the syllabus) will be posted on the course wiki:

1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.

Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.

David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.

DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

Algorithm Analysis Chapter 5. Algorithm An algorithm is a clearly specified set of instructions which, when followed, solves a problem. –recipes –directions.

Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)

Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 

1 Analysis of Non-fortuitous Predictive States of the RC4 Keystream Generator Souradyuti Paul and Bart Preneel K.U. Leuven, ESAT/COSIC Indocrypt 2003 India.

DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.

Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.

1 Closures of Relations Based on Aaron Bloomfield Modified by Longin Jan Latecki Rosen, Section 8.4.

Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.

Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.

Big Data Infrastructure

CS276A Text Information Retrieval, Mining, and Exploitation

Near Duplicate Detection

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Hashing Alexandra Stefan.

Streaming & sampling.

Sketching, Locality Sensitive Hashing

Hashing Alexandra Stefan.

Finding Similar Items: Locality Sensitive Hashing

Enumerating Distances Using Spanners of Bounded Degree

Hashing Alexandra Stefan.

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.

Minwise Hashing and Efficient Search

On the resemblance and containment of documents (MinHash)

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Three Essential Techniques for Similar Documents

Presentation transcript:

Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky

Why duplicate detection? About 30-40% of the pages on the Web are (near) duplicates of other pages. E.g., mirror sites Search engines try to avoid indexing duplicate pages Save storage Save processing time Avoid returning duplicate pages in search results Improved user’s search experience  The goal: detect duplicate pages The web is full of duplicated content Strict duplicate detection = exact match Not as common But many, many cases of near duplicates E.g., Last modified date the only difference between two copies of a page There is little point in indexing (nearly) the same content over and over again There is certainly no reason to return the same content multiple times in search result pages How can near-duplicate pages be identified in a scalable and reliable manner?

Exact-duplicates detection A naïve approach – detect exact duplicates Map each page to some fingerprint, e.g. 64-bit If two web pages have an equal fingerprint  check if content is equal

Near-duplicates What about near-duplicates? The challenge Pages that are almost identical. Common on the Web. E.g., only date differs. Eliminating near duplicates is desired! The challenge How to efficiently detect near duplicates? Exhaustively comparing all pairs of web pages wouldn’t scale.

Shingling K-shingles of a document d is defined to be the set of all consecutive sequences of k terms in d k is a positive integer E.g., 4-shingles of “My name is Inigo Montoya. You killed my father. Prepare to die”: { my name is inigo name is inigo montoya is inigo montoya you inigo montoya you killed montoya you killed my, you killed my father killed my father prepare my father prepare to father prepare to die }

Computing Similarity Intuition: two documents are near-duplicates if their shingles sets are ‘nearly the same’. Measure similarity using Jaccard coefficient Degree of overlap between two sets Denote by S (d) the set of shingles of document d J(S(d1),S(d2)) = |S (d1)  S (d2)| / |S (d1)  S (d2)| If J exceeds a preset threshold (e.g. 0.9) declare d1,d2 near duplicates. Issue: computation is costly and done pairwise How can we compute Jaccard efficiently ? Metric: Symmetry, reflexive, trainable inequality

Hashing shingles Map each shingle into a hash value integer Over a large space, say 64 bits H(di) denotes the hash values set derived from S(di) Need to detect pairs whose sets H() have a large overlap How to do this efficiently ? In next slides …

Permuting Let p be a random permutation over the hash values space Let P(di) denote the set of permuted hash values in H(di) Let xi be the smallest integer in P(di)

Illustration Document 1 264 Start with 64-bit H(shingles) Permute on the number line with p Pick the min value 264 264 264

Key Theorem Theorem: J(S(di),S(dj)) = P(xi = xj) xi, xj of the same permutation Intuition: if shingle sets of two documents are ‘nearly the same’ and we randomly permute, then there is a high probability that the minimal values are equal.

Proof (1) View sets S1,S2 as columns of a matrix A Example one row for each element in the universe. aij = 1 indicates presence of item i in set j Example S1 S2 0 1 1 0 1 1 Jaccard(S1,S2) = 2/5 = 0.4 0 0 1 1

Proof (2) Let p be a random permutation of the rows of A Denote by P(Sj) the column that results from applying p to the j-th column Let xi be the index of the first row in which the column P(Si) has a 1 P(S1) P(S2) 0 1 1 0 1 1 0 0

Proof (3) For columns Si, Sj, four types of rows A 1 1 B 1 0 C 0 1 D 0 0 Let A = # of rows of type A Clearly, J(S1,S2) = A/(A+B+C)

Proof (4) Previous slide: J(S1,S2) = A/(A+B+C) Claim: P(xi=xj) = A/(A+B+C) Why ? Look down columns Si, Sj until first non-Type-D row I.e., look for xi or xj (the smallest or both if they are equal) P(xi) = P(xj)  type A row As we picked a random permutation, the probability for a type A row is A/(A+B+C)  P(xi=xj) = J(S1,S2)

Sketches Thus – our Jaccard coefficient test is probabilistic Method: Need to estimate P(xi=xj) Method: Pick k (~200) random row permutations P Sketchdi = list of xi values for each permutation List is of length k Jaccard estimation: Fraction of permutations where sketch values agree | Sketchdi  Sketchdj | / k

Example Sketches S1 S2 S3 Perm 1 = (12345) 1 2 1 Similarities 1-2 1-3 2-3 0/3 2/3 0/3

Algorithm for Clustering Near-Duplicate Documents 1.Compute the sketch of each document 2.From each sketch, produce a list of <shingle, docID> pairs 3.Group all pairs by shingle value 4.For any shingle that is shared by more than one document, output a triplet <smaller-docID, larger-docID, 1>for each pair of docIDs sharing that shingle 5.Sort and aggregate the list of triplets, producing final triplets of the form <smaller-docID, larger-docID, # common shingles> 6.Join any pair of documents whose number of common shingles exceeds a chosen threshold using a “Union-Find”algorithm 7.Each resulting connected component of the UF algorithm is a cluster of near-duplicate documents Implementation nicely fits the “map-reduce”programming paradigm

Implementation Trick Permuting universe even once is prohibitive Row Hashing Pick P hash functions hk Ordering under hk gives random permutation of rows One-pass Implementation For each Ci and hk, keep slot for min-hash value Initialize all slot(Ci,hk) to infinity Scan rows in arbitrary order looking for 1’s Suppose row Rj has 1 in column Ci For each hk, if hk(j) < slot(Ci,hk), then slot(Ci,hk)  hk(j)

Example C1 C2 R1 1 0 R2 0 1 R3 1 1 R4 1 0 R5 0 1 C1 slots C2 slots h(1) = 1 1 - g(1) = 3 3 - h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(5) = 0 1 0 g(5) = 1 2 0