Detecting Phrase-Level Duplication on the World Wide Web Fetterly, Manasse, Najork Paper Presentation by: Vinay Goel
Introduction Problem Example Identify instances “slice and dice” generation Example German spammer 1 million URLs originating from single IP (but use of many host names) Pages changed completely on every download Pages consisted of grammatically well-formed sentences stitched together at random
Goal Find instances of sentence level synthesis of web pages More generally, of pages with an unusually large number of popular phrases
The Data Datasets DS1 DS2 BFS crawl starting at www.yahoo.com 151 million HTML pages DS2 Large crawl conducted by MSN search 96 million HTML pages chosen at random
Finding Phrase Replication Sampling Reduce each document to a feature vector Employ a variant of the shingling algorithm of Broder et al. Significantly reduces the data volume
Sampling method Replace all HTML markup by white-space k-phrases of a document: all sequences of k consecutive words Treat the document as a circle: last word followed by first word n word document has exactly n phrases
Sampling method Exploit properties of Rabin fingerprints Rabin fingerprints support efficient extension and prefix deletion Fingerprints of distinct bit patterns are distinct
Computing feature vectors Fingerprint each word in the document - gives n tokens Compute fingerprint of each k-token phrase - gives n phrase fingerprints Apply m different fingerprint functions Retain the smallest of the n resulting values for each function Vector of m fingerprints representative of document (elements referred to as shingles)
Duplicate Suppression Replication rampant on the web Clustered all pages in data set into equivalence classes Each class contains all pages that are exact or near duplicates of one another
Popular phrases Occur in more documents than would be expected by chance Assumptions: “Normal” web pages characterized by a generative model Sought web pages - copying model (need to consider number of phrases, length of typical documents…)
Popular Phrases Limit attention to the shingles chosen by sampling functions Phrase is popular if selected as shingle in sufficiently many documents To determine popular phrases, consider triplets (i,s,d)
Popular Phrases First 24 most popular phrases not very interesting Starting from the 36th phrase, discover phrases caused by machine generated content Templatic form: common text, “fill in the blank” slots and optional 60th phrase - instance of idiomatic phrase
Zipfian Distribution
Histogram of popular shingles per doc
Covering set Covering sets for shingles of each page Approximate a minimum covering set using a greedy heuristic
Distribution of covering set sizes
German spammer
Looking for likely sources
Conclusion Power law distribution Popular phrases Often limited by design choices Legal disclaimers Navigational phrases “fill in the blanks” More replicated than original content