Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.

Slides:



Advertisements
Similar presentations
Information Retrieval CSE 8337 (Part II) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture.
Crawling the WEB Representation and Management of Data on the Internet.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
Crawling and web indexes. Today’s lecture Crawling Connectivity servers.
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google.
PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Web Crawlers.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
CSCI 5417 Information Retrieval Systems Jim Martin
A Web Crawler Design for Data Mining
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 20: Crawling and web indexes.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.
Crawling Slides adapted from
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 19 11/1/2011.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
1 CS 430 / INFO 430: Information Discovery Lecture 19 Web Search 1.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Web Search Algorithms By Matt Richard and Kyle Krueger.
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney’s IR course at UT Austin)
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam AND.
1 CS 430 / INFO 430 Information Retrieval Lecture 19 Web Search 1.
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Spidering (Crawling)
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p ,
CS276 Information Retrieval and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Statistics Visualizer for Crawler
Lecture 17 Crawling and web indexes
UbiCrawler: a scalable fully distributed Web crawler
CS 430: Information Discovery
Crawler (AKA Spider) (AKA Robot) (AKA Bot)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
7CCSMWAL Algorithmic Issues in the WWW
Spiders, crawlers, harvesters, bots
INF 141: Information Retrieval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Anwar Alhenshiri.
cs430 lecture 02/22/01 Kamen Yotov
Presentation transcript:

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1

Introduction to Information Retrieval Overview ❶ A simple crawler ❷ A real crawler 2

Introduction to Information Retrieval Outline ❶ A simple crawler ❷ A real crawler 3

Introduction to Information Retrieval 4 Basic crawler operation  Initialize queue with URLs of known seed pages  Repeat  Take URL from queue  Fetch and parse page  Extract URLs from page  Add URLs to queue  Fundamental assumption: The web is well linked. 4

Introduction to Information Retrieval 5 Magnitude of the crawling problem  To fetch 20,000,000,000 pages in one month... ... we need to fetch almost 8000 pages per second!  Actually: many more since many of the pages we attempt to crawl will be duplicates, unfetchable, spam etc. 5

Introduction to Information Retrieval 6 What a crawler must do Be robust  Be immune to spider traps, duplicates, very large pages, very large websites, dynamic pages etc 6 Be polite  Don’t hit a site too often  Only crawl pages you are allowed to crawl: robots.txt

Introduction to Information Retrieval 7 Robots.txt  Protocol for giving crawlers (“robots”) limited access to a website  Examples:  User-agent: * Disallow: /yoursite/temp/  User-agent: searchengine Disallow: /  Important: cache the robots.txt file of each site we are crawling 7

Introduction to Information Retrieval 8 Example of a robots.txt (nih.gov) User-agent: PicoSearch/1.0 Disallow: /news/information/knight/ Disallow: /nidcd/... Disallow: /news/research_matters/secure/ Disallow: /od/ocpl/wag/ User-agent: * Disallow: /news/information/knight/ Disallow: /nidcd/... Disallow: /news/research_matters/secure/ Disallow: /od/ocpl/wag/ Disallow: /ddir/ Disallow: /sdminutes/ 8

Introduction to Information Retrieval 9 What any crawler should do  Be capable of distributed operation  Be scalable: need to be able to increase crawl rate by adding more machines and extra bandwidth  Performance and efficiency: make efficient use of various system resources  Fetch pages of higher quality first  Continuous operation: get fresh version of already crawled pages  Extensible: crawlers should be designed to be extensible 9

Introduction to Information Retrieval Outline ❶ A simple crawler ❷ A real crawler 10

Introduction to Information Retrieval 11 Basic crawl architecture 11

Introduction to Information Retrieval 12 URL frontier 12

Introduction to Information Retrieval 13 URL frontier  The URL frontier is the data structure that holds and manages URLs we’ve seen, but that have not been crawled yet.  Can include multiple pages from the same host  Must avoid trying to fetch them all at the same time  Must keep all crawling threads busy 13

Introduction to Information Retrieval 14 URL normalization  Some URLs extracted from a document are relative URLs.  E.g., at we may have aboutsite.html  This is the same as:  During parsing, we must normalize (expand) all relative URLs. 14

Introduction to Information Retrieval 15 Content seen  For each page fetched: check if the content is already in the index  Check this using document fingerprints  Skip documents whose content has already been indexed 15

Introduction to Information Retrieval 16 Distributing the crawler  Run multiple crawl threads, potentially at different nodes  Usually geographically distributed nodes  Partition hosts being crawled into nodes 16

Introduction to Information Retrieval 17 Distributed crawler 17

Introduction to Information Retrieval 18 URL frontier: Two main considerations  Politeness: Don’t hit a web server too frequently  E.g., insert a time gap between successive requests to the same server  Freshness: Crawl some pages (e.g., news sites) more often than others 18

Introduction to Information Retrieval 19 Mercator URL frontier  URLs flow in from the top into the frontier.  Front queues manage prioritization.  Back queues enforce politeness.  Each queue is FIFO. 19

Introduction to Information Retrieval 20 Mercator URL frontier: Front queues  Prioritizer assigns to URL an integer priority between 1 and F.  Then appends URL to corresponding queue  Heuristics for assigning priority: refresh rate, PageRank etc 20

Introduction to Information Retrieval 21 Mercator URL frontier: Front queues  Selection from front queues is initiated by back queues  Pick a front queue from which to select next URL: Round robin, randomly, or more sophisticated variant  But with a bias in favor of high-priority front queues 21

Introduction to Information Retrieval 22 Mercator URL frontier: Back queues  Invariant 1. Each back queue is kept non- empty while the crawl is in progress.  Invariant 2. Each back queue only contains URLs from a single host.  Maintain a table from hosts to back queues. 22

Introduction to Information Retrieval 23 Mercator URL frontier: Back queues  In the heap:  One entry for each back queue  The entry is the earliest time t e at which the host corresponding to the back queue can be hit again.  The earliest time t e is determined by (i) last access to that host (ii) time gap heuristic 23

Introduction to Information Retrieval 24 Mercator URL frontier: Back queues  How fetcher interacts with back queue:  Repeat (i) extract current root q of the heap (q is a back queue)  and (ii) fetch URL u at head of q... 24

Introduction to Information Retrieval 25 Mercator URL frontier: Back queues  When we have emptied a back queue q:  Repeat (i) pull URLs u from front queues and (ii) add u to its corresponding back queue... ... until we get a u whose host does not have a back queue.  Then put u in q and create heap entry for it. 25

Introduction to Information Retrieval 26 Mercator URL frontier  URLs flow in from the top into the frontier.  Front queues manage prioritization.  Back queues enforce politeness. 26