Near Duplicate Detection

Slides:

Advertisements

Similar presentations

Lecture outline Nearest-neighbor search in low dimensions

Advertisements

Similarity and Distance Sketching, Locality Sensitive Hashing

Near-Duplicates Detection

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

High Dimensional Search Min-Hashing Locality Sensitive Hashing

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

Randomized / Hashing Algorithms

Min-Hashing, Locality Sensitive Hashing Clustering

Large-scale matching CSE P 576 Larry Zitnick

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

My Favorite Algorithms for Large-Scale Data Mining

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

1 Lecture 18 Syntactic Web Clustering CS

15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.

Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Finding Similar Items.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)

ITCS 6265 Information Retrieval and Web Mining Lecture 10: Web search basics.

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

CS2110 Recitation Week 8. Hashing Hashing: An implementation of a set. It provides O(1) expected time for set operations Set operations Make the set empty.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

Similarity and Distance Sketching, Locality Sensitive Hashing

CS276 Lecture 16 Web characteristics II: Web size measurement Near-duplicate detection.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.

Crawling Slides adapted from

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.

Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.

Locality Sensitive Hashing Basics and applications.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.

DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Randomized / Hashing Algorithms Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman.

MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 

DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.

Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:

Shingling Minhashing Locality-Sensitive Hashing

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Mining Data Streams (Part 1)

CS276 Information Retrieval and Web Search

CS276A Text Information Retrieval, Mining, and Exploitation

Modified by Dongwon Lee from slides by

15-499:Algorithms and Applications

Near Duplicate Detection

Finding Similar Items Jeffrey D. Ullman Application: Similar Documents

Crawler (AKA Spider) (AKA Robot) (AKA Bot)

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.

Sketching, Locality Sensitive Hashing

Lecture 16: Web search/Crawling/Link Analysis

Finding Similar Items: Locality Sensitive Hashing

Finding Similar Items: Locality Sensitive Hashing

Shingling Minhashing Locality-Sensitive Hashing

Theory of Locality Sensitive Hashing

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

Implementation of Relational Operations

Minwise Hashing and Efficient Search

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Near Neighbor Search in High Dimensional Data (1)

Information Retrieval and Web Design

Three Essential Techniques for Similar Documents

Locality-Sensitive Hashing

Presentation transcript:

Near Duplicate Detection Slides adapted from Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan CS345A, Winter 2009: Data Mining. Stanford University, Anand Rajaraman, Jeffrey D. Ullman

Web search overall picture Sec. 19.4.1 Web search overall picture User Web spider queries Indexer Search links Indexes The Web Ad indexes

Duplicate documents The web is full of duplicated content Sec. 19.6 Duplicate documents The web is full of duplicated content About 30% are duplicates Duplicates need to be removed for Crawling Indexing Statistical studies Strict duplicate detection = exact match Not as common But many, many cases of near duplicates E.g., Last modified date the only difference between two copies of a page

Other applications Many Web-mining problems can be expressed as finding “similar” sets: Topic classification--Pages with similar words, Mirror web sites, Similar news articles Recommendation systems--NetFlix users with similar tastes in movies. movies with similar sets of fans. Images of related things. Plagiarism

Algorithms for finding similarities Edit distance Distance between A and B is defined as the minimal number of operations to edit A into B Mathematically elegant Many applications (like auto-correction of spelling) Not efficient Shingling

Techniques for Similar Documents Shingling : convert documents, emails, etc., to sets. Minhashing : convert large sets to short signatures, while preserving similarity. Candidate pairs : those pairs of signatures that we need to test for similarity. Minhash- ing Signatures : short integer vectors that represent the sets, and reflect their similarity Shingling Docu- ment The set of terms of length k that appear in the document From Anand Rajaraman (anand @ kosmix dt com), Jeffrey D. Ullman

Shingles A k -shingle (or k -gram) for a document is a sequence of k terms that appears in the document. Example: a rose is a rose is a rose → a rose is a rose is a rose is a rose is The set of shingles is {a rose is a, rose is a rose, is a rose is, a rose is a} Note that “a rose is a rose” is repeated twice, but only appear once in the set Option: regard shingles as a bag, and count “a rose is a” twice. Represent a doc by its set of k-shingles. Documents that have lots of shingles in common have similar text, even if the text appears in different order. Careful: you must pick k large enough. If k=1, most documents overlap a lot.

Jaccard similarity 2 in intersection. 7 in union. Jaccard similarity a rose is a rose is a rose  {a rose is a, rose is a rose, is a rose is, a rose is a} A rose is a rose that is it  {a rose is a, rose is a rose, is a rose that, a rose that is, rose that is it} 2 in intersection. 7 in union. Jaccard similarity = 2/7 Is a rose is Is a rose that Rose is a rose A rose that is a rose is a A rose is a rose that is it

The size is the problem The shingle set can be very large There are many documents (many shingle sets) to compare Problems: Memory: When the shingle sets are so large or so many that they cannot fit in main memory. Time: Or, when there are so many sets that comparing all pairs of sets takes too much time. Or both.

Shingles + Set Intersection Computing exact set intersection of shingles between all pairs of documents is expensive/intractable Approximate using a cleverly chosen subset of shingles from each (a sketch) Estimate (size_of_intersection / size_of_union) based on a short sketch Doc A Shingle set A Sketch A Doc B Shingle set B Sketch B Jaccard

Set Similarity of sets Ci , Cj Sec. 19.6 Set Similarity of sets Ci , Cj View sets as columns of a matrix A; one row for each element in the universe. aij = 1 indicates presence of item i in set j Example C1 C2 0 1 1 0 1 1 Jaccard(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1

Key Observation For columns Ci, Cj, four types of rows Sec. 19.6 Key Observation For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0 Overload notation: A = # of rows of type A Claim

Estimating Jaccard similarity Sec. 19.6 C1 C2 0 1 1 0 1 1 0 0 0 1 Estimating Jaccard similarity Randomly permute rows Hash h(Ci) = index of first row with 1 in column Ci Surprising Property Why? Both are A/(A+B+C) Look down columns Ci, Cj until first non-Type-D row h(Ci) = h(Cj)  type A row

Representing documents and shingles To compress long shingles, we can hash them to (say) 4 bytes. Represent a doc by the set of hash values of its k-shingles. Represent the documents as a matrix 4 documents 7 shingles in total Column is a document Each row is a shingle In real application the matrix is sparse—there are many empty cells doc1 doc2 doc3 Doc4 Shingle 1 1 Shingle 2 Shingle 3 Shingle 4 Shingle 5 Shingle 6 Shingle 7

Random permutation Input matrix Signature matrix Hashed Sorted 1 3 4 7 6 1 2 5 3 4 7 6 1 2 5 2 1 3 6 4 7 5 2 1 min 4 docs Similarities: 1~3: 1 2~4: 1 1~4: 0 sort Hash

Repeat the previous process Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5

More Hashings produce better result Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5 Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0

Sec. 19.6 Sketch of a document Create a “sketch vector” (of size ~200) for each document Documents that share ≥ t (say 80%) corresponding vector elements are near duplicates For doc D, sketchD[ i ] is as follows: Let f map all shingles in the universe to 0..2m (e.g., f = fingerprinting) Let pi be a random permutation on 0..2m Pick MIN {pi(f(s))} over all shingles s in D

Computing Sketch[i] for Doc1 Sec. 19.6 Computing Sketch[i] for Doc1 Document 1 264 Start with 64-bit f(shingles) Permute on the number line with pi Pick the min value 264 264 264

Test if Doc1.Sketch[i] = Doc2.Sketch[i] Sec. 19.6 Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 2 Document 1 264 264 264 264 264 264 A B 264 264 Are these equal? Test for 200 random permutations: p1, p2,… p200