Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presented by Yen-Yi Hung.

Slides:

Advertisements

Similar presentations

iRobot: An Intelligent Crawler for Web Forums

Advertisements

Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences.

Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.

Chapter 5: Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,

Near-Duplicates Detection

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.

Big Data Lecture 6: Locality Sensitive Hashing (LSH)

Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

Near Duplicate Detection

Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Recommender systems Ram Akella November 26 th 2008.

Chapter 5: Information Retrieval and Web Search

India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.

Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.

Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

Multimedia Databases (MMDB)

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.

Chapter 6: Information Retrieval and Web Search

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.

Web Search Algorithms By Matt Richard and Kyle Krueger.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.

Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.

Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)

Search Engines By: Faruq Hasan.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu /12/5.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,

1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting.

Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.

Near Duplicate Detection

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Finding replicated web collections

Panagiotis G. Ipeirotis Luis Gravano

Evaluation of Relational Operations: Other Techniques

Information Retrieval and Web Design

Presentation transcript:

Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presented by Yen-Yi Hung

Overview Introduction Previous Related Work Algorithm Evaluation Future Work Pros & Cons Comment Reference

Introduction – The drawback of duplicate pages Waste network bandwidth Affect refresh times Impact politeness constraints Increase storage cost Affect the quality of search indexes Increase the load on the remote host that is serving such web pages Affect customer satisfaction

Introduction – Challenge and Contribution of this paper Challenge: Dealing with scale issue Determining near-duplicates efficiently Contribution: Showing that simhash could be used to deal with the huge amount query Developing a way to solve Hamming Distance Problem quickly (for online single query or batch multi-queries)

Previous Related Work Many related techniques are different when they deal with different corpus, end goals, feature sets, or signature schema. Corpus: Web Documents Files in a file system s Domain-Specific corpora

Previous Related Work (II) End Goals: Web mirrors Clustering for related documents query data extraction Plagiarism spam detection duplicates in domain-specific corpora Feature Sets: Shingles from page content Document vector from page content Connectivity information Anchor text, anchor window Phrases

Previous Related Work (III) Signature Schema: Mod-p shingles Min-hash for Jaccard similarity of sets Signatures/fingerprints over IR-based document vectors Checksums This paper focus on Web Documents. Its goal is to improve web crawling using Simhash technique.

Algorithm – Simhash fingerprinting What could Simhash do? Mapping high-dimensional vectors to small-sized fingerprints. The atypical Simhash property Similar documents have similar hash values. How to apply? Converting web pages to a set of weighted features (computed using standard IR techniques.)

Algorithm – Hamming Distance Problem Hamming Distance Problem Given a collection of f-bit fingerprints and a query fingerprint F, we need to identify whether an existing fingerprint differs from F in at most k bits. But simply probing the fingerprint collection is impractical. So what should we do? 1. Build t Table T 1, T 2, …, T t. Each table has an integer p i and a permutation Π i. 2. Apply permutation Π i to each existing fingerprint in each Table T i and sort each T i.

Algorithm – Hamming Distance Problem 3. Given fingerprint F and an integer k which used to determine the hamming distance. We use the following 2 steps to solve hamming distance problem. Step 1: Find all permuted fingerprints in T i whose top p i bit- positions match the top pi bit-positions of Π i (F) Step 2: For each fingerprints found in step 1, check if it differs from Π i (F) in at most k bit-posions. Time Complexity: Step 1 can be done in O(p i ) steps using binary search. Step 2 can be shrink to O(log p i ) steps using interpolation search.

Algorithm – Compression of Fingerprints Step1: The first fingerprint in the block is remembered in its entirety. Step2: Get the most significant 1-bit in the XOR of two successive fingerprints, and we denote it as h. Step3: Append the Huffman code of h to the block. Step4: Append the bits to the right of the most-significant 1-bit to the block. Step5: Repeat step 2,3,4 till a block (1024 bytes) is full

Algorithm – Batch query implementation Both File F (existing fingerprints) and File Q (the batch of query fingerprints) are stored in a shared-nothing distributed file system GFS. The batch queries could be spilt into 2 phases Phase 1: We solve the hamming distance problem over some chunks of F and the entire file Q as input. The outputs of the computation are near-duplicate fingerprints. Phase 2: MapReduce will remove duplicates and produces a single sorted file according to the results of phase 1.

Evaluation Is simhash a reasonable technique when dealing with de-duplication issue? when choosing k=3, precision and recall ≒ 0.75 * According to the result of “ Finding near-duplicate web pages: a large-scale evaluation of algorithms ” by M. R. Henzinger in 2006, its precision and recall are around 0.8.

Evaluation Will the characteristic of simhash affect the results? If yes, then is it a significant impact? Fig 2(a): Right-half displays the specific distribution but not the Left-half. This is because some similar contents only have moderate difference in Simhash values. Fig2(b): Distribution has some spikes because of empty pages, file not found pages, and the similar login pages for some bulletin board software.

Evaluation 32GB batch queries fingerprints with 200 mappers, the combined rates could exceed 1GBps. Given fixed number of mappers, the time taken is roughly proportional to the size of file Q. [Compression plays an important role.]

Future Work Based on this paper: Document size Category information de-duplication Near-duplication vs. Clustering Other Research topic: More cost-effective approach of using just the URLs information for de-duplication

Pros Pros: Efficient and Practical Using compression and specific database design (GFS) to solve the problem of fingerprint based de-duplication issues Given a compact but thorough description of de-duplication related work

Cons Cons: Limit of accuracy -- not based on explicit content matching of the document but the possibility of similarity This paper does not provide any evaluation results compared with other algorithm Though providing compression techniques, the cost of space still remain questioned Content-based de-duplication can only be implemented after the Web pages have been downloaded. So it does not help reduce the waste of bandwidth in crawling.

Comment This technique is good. It provides an efficient way of using Simhash to solve de-duplication issue for a large amount of data. Though not the first paper focusing on large amount of web pages, but it indeed provides actual query size in the real world.

Reference Paolo Ferragina, Roberto Grossi, Ankur Gupta, Rahul Shah, Jeffrey Scott Vitter, On searching compressed string collections cache-obliviously, Proceedings of the twenty-seventh ACM SIGMOD-SIGACT- SIGART symposium on Principles of database systems, June 09-12, 2008, Vancouver, Canada Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, Dmitri Loguinov, IRLbot: Scaling to 6 billion pages and beyond, ACM Transactions on the Web (TWEB), v.3 n.3, p.1-34, June 2009 Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China Anirban Dasgupta, Ravi Kumar, Amit Sasturkar, De-duping URLs via rewrite rules, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA Lian'en Huang, Lei Wang, Xiaoming Li, Achieving both high precision and high recall in near-duplicate detection, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA Edith Cohen, Haim Kaplan, Leveraging discarded samples for tighter estimation of multiple-set aggregates, Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, June 15-19, 2009, Seattle, WA, USA Amit Agarwal, Hema Swetha Koppula, Krishna P. Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, Amit Sasturkar, URL normalization for de-duplication of web pages, Proceeding of the 18th ACM conference on Information and knowledge management, November 02-06, 2009, Hong Kong, China Hema Swetha Koppula, Krishna P. Leela, Amit Agarwal, Krishna Prasad Chitrapura, Sachin Garg, Amit Sasturkar, Learning URL patterns for webpage de-duplication, Proceedings of the third ACM international conference on Web search and data mining, February 04-06, 2010, New York, New York, USA M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR 2006, pages , 2006.