Finding Replicated web collections

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Creating a Similarity Graph from WordNet

DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Aki Hecht Seminar in Databases (236826) January 2009

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:

1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina.

Chapter 19: Information Retrieval

An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.

Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.

Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.

National Institute of Science & Technology Algorithm to Find Hidden Links Pradyut Kumar Mallick [1] Under the guidance of Mr. Indraneel Mukhopadhyay ALGORITHM.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,

Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.

Using Hyperlink structure information for web search.

Information Retrieval in Folksonomies Nikos Sarkas Social Information Systems Seminar DCS, University of Toronto, Winter 2007.

Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?

Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”

The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,

Clustering Personalized Web Search Results Xuehua Shen and Hong Cheng.

1 Crawling The Web. 2 Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.

Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.

7. Replication & HA Objectives –Understand Replication and HA Contents –Standby server –Failover clustering –Virtual server –Cluster –Replication Practicals.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Algorithmic Detection of Semantic Similarity WWW 2005.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma

DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM Submitted in partial fulfillment of requirement for the V Sem MCA Mini Project Under Visvesvaraya Technological.

Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.

Session 1 Module 1: Introduction to Data Integrity

1 CS 430: Information Discovery Lecture 5 Ranking.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.

Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.

Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)

Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.

Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.

Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,

Data mining in web applications

Efficient Multi-User Indexing for Secure Keyword Search

SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.

Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Information Retrieval

7CCSMWAL Algorithmic Issues in the WWW

IDENTIFICATION OF DENSE SUBGRAPHS FROM MASSIVE SPARSE GRAPHS

Information Retrieval on the World Wide Web

NJVR: The NanJing Vocabulary Repository

Information Retrieval

Finding replicated web collections

The Recommendation Click Graph: Properties and Applications

Information retrieval and PageRank

Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

Junghoo “John” Cho UCLA

Chapter 31: Information Retrieval

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Chapter 19: Information Retrieval

Discussion Class 9 Google.

Presentation transcript:

Finding Replicated web collections Authors Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Paper Presentation By Radhika Malladi and Vijay Reddy Mara

Introduction Similarity Measures Similarity in Web Pages Similarity in Link Structure Similar Clusters Computing Implementing Quality of Clusters Exploiting Clusters Improving Crawling Improving search engine results

Introduction Replication across the web ? Replicated collections constitute several hundreds and thousands of pages. Mirrored Web Pages?

Mirrored pages.

What we will know in this paper? Similarity Measures: Compute similarity measures for collections of web pages. Improved Crawling: use replication information to improve crawling. Reduce clutter from search engines: use replication information.

By automatically identifying mirrored collections We can improve the following: Crawling: fetches web pages. Ranking: pages ranked by replication factor. Archiving: stores subset of the web. Caching: holds web pages that are frequently accessed.

Difficulties detecting replicated collections: Update frequency: Mirror copies may not be updated regularly. Mirror partial coverage: differ from primary Different formats Partial crawls: snapshots may be different

Similarity Measures Definition (Web Graph): Graph G=(V,E) having nodes vi for each page pi and a directed edge from vi and vJ ,if there is a hyperlink from pi to pJ . Definition (Collection): Collection Size: No. of pages in the collection

Definition (Identical Collections): Equisized collections C1 and C2 are identical if there is a one-to-one mapping M that maps C1 pages to C2 pages such that: Identical pages: For each page p C1 ,p M( p). Identical link structure: For each link in C1 from page a to b, we have a link from M(a) to M(b) in C2 .

Similarity of link structure: Collection sizes: One-to-One:

Break points: Link Similarity:

Definition (Similar Collections): Equisized collections C and C are similar if there is a one-to-one mapping M that maps all C pages to all C pages such that Similar pages: For each page p C1 ,p M( p). Similar links: Two corresponding pages should have atleast one parent in their corresponding collections that are also similar pages.

Similar collections

Similar Clusters Computing Example 1 of similar clusters Cluster size(cardinality):2 Collection size:5

Computing Example 2 of similar clusters Cluster size(cardinality):3 Collection size:3

Cluster Algorithm Example: Step 1: Find all trivial clusters

Step2: Merge trivial clusters that leads to similar clusters

Step 3: Outcome

Another example of cluster algorithm with two possible clusters.

Cont…

Quality of Clusters

Concept of fingerprints: entire document fingerprint four line fingerprint two line fingerprint

Exploiting Clusters Improving Crawling:

Improving Search engine results: Reduces clutter by using a prototype. Prototype has links to ‘Collections’ and ‘Replica’.

Conclusion

Discussion Question: Should Cluster size be more or collection size?

Discussion Question: Suppose p is similar to pI and pII is similar pI and p and pII are not similar. Do you think all the three pages are similar?