Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam AND.

Slides:



Advertisements
Similar presentations
Chapter 11 – Virtual Memory Management
Advertisements

A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:
1 Cache and Caching David Sands CS 147 Spring 08 Dr. Sin-Min Lee.
Center for E-Business Technology Seoul National University Seoul, Korea Socially Filtered Web Search: An approach using social bookmarking tags to personalize.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
What should you Cache? A Global Analysis on YouTube Related Video Caching Dilip Kumar Krishnappa, Michael Zink and Carsten Griwodz NOSSDAV 2013.
CSCI 572’s Class Project Measuring the performance of parallel crawlers in different modes Huy Pham PhD – Computer Science Spring 2011.
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Web Caching Robert Grimm New York University. Before We Get Started  Interoperability testing  Type theory 101.
Chapter 11 – Virtual Memory Management Outline 11.1 Introduction 11.2Locality 11.3Demand Paging 11.4Anticipatory Paging 11.5Page Replacement 11.6Page Replacement.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
Web Caching Robert Grimm New York University. Before We Get Started  Illustrating Results  Type Theory 101.
Caching And Prefetching For Web Content Distribution Presented By:- Harpreet Singh Sidong Zeng ECE Fall 2007.
Web Caching Schemes For The Internet – cont. By Jia Wang.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
By Huang et al., SOSP 2013 An Analysis of Facebook Photo Caching Presented by Phuong Nguyen Some animations and figures are borrowed from the original.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
Crawling Slides adapted from
Segment-Based Proxy Caching of Multimedia Streams Authors: Kun-Lung Wu, Philip S. Yu, and Joel L. Wolf IBM T.J. Watson Research Center Proceedings of The.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
SCrawler Group: Priyanshu Gupta WHAT WILL I DO?? I will develop a multi-threaded parallel crawler.I will run them in both cross-over and Exchange mode.
DFTL: A flash translation layer employing demand-based selective caching of page-level address mappings A. gupta, Y. Kim, B. Urgaonkar, Penn State ASPLOS.
An Effective Disk Caching Algorithm in Data Grid Why Disk Caching in Data Grids?  It takes a long latency (up to several minutes) to load data from a.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.
Practical LFU implementation for Web Caching George KarakostasTelcordia Dimitrios N. Serpanos University of Patras.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
CSCI 572’s Class Project Measuring the performance of parallel crawlers in different modes Huy Pham PhD – Computer Science Spring 2011.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Adaptive Configuration of a Web Caching Hierarchy Pranav A. Desai Jaspal Subhlok Presented by: Pranav A. Desai.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
UbiCrawler: a scalable fully distributed Web crawler
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Cache Memory Presentation I
An Analysis of Facebook photo Caching
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Kalyan Boggavarapu Lehigh University
Junghoo “John” Cho UCLA
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Anwar Alhenshiri.
Presentation transcript:

Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam AND

Parallel Crawlers Hector Garcia-Molina Stanford University Junghoo Cho University of California

ABSTRACT Design an effective and scalable parallel crawler Propose multiple architectures for a parallel crawler Identify fundamental issues related to parallel crawling Metrics to evaluate a parallel crawler Compare the proposed architectures using 40 million pages

Challenges for parallel crawlers Overlap Quality Communication bandwidth

Advantages Scalability Network-load dispersion Network-load reduction Compression Difference Summarization

Related work General architecture Page selection Page update

Geographical categorization Intra-site parallel crawler Distributed crawler

Communication Independent Dynamic assignment Static assignment

Crawling modes (Static) Firewall mode Cross-over mode Exchange mode

URL exchange minimization Batch communication Replication

Partitioning function URL-hash based Site-hash based Hierarchical

Evaluation models Overlap Coverage Quality Communication overhead

Firewall mode and coverage

Cross-over mode and overlap

Exchange mode and communication

Quality and batch communication

Conclusion Firewall mode is good if processes <= 4 URL exchange poses network overhead < 1% Quality is maintained even in the batch communication Replicating 10,000 to 100,000 popular URLs can reduce 40% communication overhead

Efficient URL Caching for World Wide Web Crawling Andrei Z. Broder IBM TJ Watson Research Center Janet L. Wiener Hewlett Packard Labs Marc Najork Microsoft Research

Introduction Fetch a page Parse it to extract all linked URLs For all the URLs not seen before, repeat the process

Challenges The web is very large (coverage) doubling every 9-12 months Web pages are changing rapidly (freshness) all changes (40% weekly) changes by a third or more (7% weekly)

Crawlers IA crawler Original Google crawler Mercator web crawler Cho and Garcia-Molina’s crawler WebFountain UbiCrawler Shkapenyuk and Suel’s crawler

Caching Analogous to OS cache Non-uniformity of requests Temporal correlation or locality of reference

Caching algorithms Infinite cache (INFINITE) Clairvoyant caching (MIN) Least recently used (LRU) CLOCK Random replacement (RANDOM) Static caching (STATIC)

Experimental setup

URL Streams full trace cross sub-trace

Result plots

Results LRU & CLOCK performed equally well but slightly worse than MIN except for critical region (for both traces) RANDOM is slightly inferior to CLOCK and LRU, while STATIC is generally much worse Concludes considerable locality of reference in the traces For very large cache STATIC is better than MIN (excluding initial k misses) STATIC is relatively better for cross trace Lack of deep links, often pointing to home pages. Intersection between the most popular URLs and the cross trace tends to be larger

Critical region Miss rate for all efficient algorithms is constant (~70%) in k = 2^14 - 2^18 Above k = 2^18 miss rate drops abruptly to ~20%

Cache Implementation

Conclusions and future directions 1,800 simulations over billion URLs resulted cache of 50,000 entries gives 80% hit rate Cache size of 100 ~ 500 entries per thread is recommended CLOCK or RANDOM implementation using scatter table with circular chain is recommended To what order graph traversal method affects caching? Global cache or per thread cache is better?

THANKS