Detecting Phrase-Level Duplication on the World Wide Web

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Problem Semi supervised sarcasm identification using SASI
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
1 Lecture 18 Syntactic Web Clustering CS
Near Duplicate Detection
Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.
1 ADVANCED MICROSOFT WORD Lesson 15 – Creating Forms and Working with Web Documents Microsoft Office 2003: Advanced.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Web Characterization: What Does the Web Look Like?
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
1 Characterizing Botnet from Spam Records Presenter: Yi-Ren Yeh ( 葉倚任 ) Authors: L. Zhuang, J. Dunagan, D. R. Simon, H. J. Wang, I. Osipkov, G. Hulten,
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Confidentiality/date line: 13pt Arial Regular, white Maximum length: 1 line Information separated by vertical strokes, with two spaces on either side Disclaimer.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Evolution of Web from a Search Engine Perspective Saket Singam
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF.
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Session 1: Introduction to HTML Fall Today’s Agenda Talk about the functions of the Internet Cover useful terminology for today’s session HTML,
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages , Feb Apr
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Near Duplicate Detection
Text Based Information Retrieval
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Introduction to Web Mining
Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
CS 345A Data Mining Lecture 1
On the resemblance and containment of documents (MinHash)
CS 345A Data Mining Lecture 1
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Data Pre-processing Lecture Notes for Chapter 2
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Exploiting the Power of Group Differences to Solve Data Analysis Problems Outlier & Intrusion Detection Guozhu Dong, PhD, Professor CSE
Information Retrieval and Web Design
Presentation transcript:

Detecting Phrase-Level Duplication on the World Wide Web Fetterly, Manasse, Najork Paper Presentation by: Vinay Goel

Introduction Problem Example Identify instances “slice and dice” generation Example German spammer 1 million URLs originating from single IP (but use of many host names) Pages changed completely on every download Pages consisted of grammatically well-formed sentences stitched together at random

Goal Find instances of sentence level synthesis of web pages More generally, of pages with an unusually large number of popular phrases

The Data Datasets DS1 DS2 BFS crawl starting at www.yahoo.com 151 million HTML pages DS2 Large crawl conducted by MSN search 96 million HTML pages chosen at random

Finding Phrase Replication Sampling Reduce each document to a feature vector Employ a variant of the shingling algorithm of Broder et al. Significantly reduces the data volume

Sampling method Replace all HTML markup by white-space k-phrases of a document: all sequences of k consecutive words Treat the document as a circle: last word followed by first word n word document has exactly n phrases

Sampling method Exploit properties of Rabin fingerprints Rabin fingerprints support efficient extension and prefix deletion Fingerprints of distinct bit patterns are distinct

Computing feature vectors Fingerprint each word in the document - gives n tokens Compute fingerprint of each k-token phrase - gives n phrase fingerprints Apply m different fingerprint functions Retain the smallest of the n resulting values for each function Vector of m fingerprints representative of document (elements referred to as shingles)

Duplicate Suppression Replication rampant on the web Clustered all pages in data set into equivalence classes Each class contains all pages that are exact or near duplicates of one another

Popular phrases Occur in more documents than would be expected by chance Assumptions: “Normal” web pages characterized by a generative model Sought web pages - copying model (need to consider number of phrases, length of typical documents…)

Popular Phrases Limit attention to the shingles chosen by sampling functions Phrase is popular if selected as shingle in sufficiently many documents To determine popular phrases, consider triplets (i,s,d)

Popular Phrases First 24 most popular phrases not very interesting Starting from the 36th phrase, discover phrases caused by machine generated content Templatic form: common text, “fill in the blank” slots and optional 60th phrase - instance of idiomatic phrase

Zipfian Distribution

Histogram of popular shingles per doc

Covering set Covering sets for shingles of each page Approximate a minimum covering set using a greedy heuristic

Distribution of covering set sizes

German spammer

Looking for likely sources

Conclusion Power law distribution Popular phrases Often limited by design choices Legal disclaimers Navigational phrases “fill in the blanks” More replicated than original content