1 Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2007/10/02 Lecture 3: Crawling the Web.

Slides:



Advertisements
Similar presentations
Enabling Secure Internet Access with ISA Server
Advertisements

4.01 How Web Pages Work.
Mining the WebChakrabarti and Ramakrishnan1 Overview of Web-Crawlers  Neal Richter & Anthony Arnone Nov 30, 2005 – CS Conference Room These slides are.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
EEC-484/584 Computer Networks Lecture 6 Wenbing Zhao
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Layer 7- Application Layer
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Implementing ISA Server Caching. Caching Overview ISA Server supports caching as a way to improve the speed of retrieving information from the Internet.
Crawling the Web  Web pages Few thousand characters long Served through the internet using the hypertext transport protocol (HTTP) Viewed at client end.
Topics in this presentation: The Web and how it works Difference between Web pages and web sites Web browsers and Web servers HTML purpose and structure.
1 Web Content Delivery Reading: Section and COS 461: Computer Networks Spring 2007 (MW 1:30-2:50 in Friend 004) Ioannis Avramopoulos Instructor:
PRASHANTHI NARAYAN NETTEM.
Internet Basics.
Chapter 10 Publishing and Maintaining Your Web Site.
Evolved from ARPANET (Advanced Research Projects Agency of the U.S. Department of Defense) Was the first operational packet-switching network Began.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Lecturer: Ghadah Aldehim
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
Lecture#2 on Internet and World Wide Web. Internet Applications Electronic Mail ( ) Electronic Mail ( ) Domain mail server collects incoming mail.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Crawling Slides adapted from
Chapter 9 Publishing and Maintaining Your Site. 2 Principles of Web Design Chapter 9 Objectives Understand the features of Internet Service Providers.
Application Layer Khondaker Abdullah-Al-Mamun Lecturer, CSE Instructor, CNAP AUST.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
ECEN “Internet Protocols and Modeling”, Spring 2012 Course Materials: Papers, Reference Texts: Bertsekas/Gallager, Stuber, Stallings, etc Class.
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Client-Server Model of Interaction Chapter 20. We have looked at the details of TCP/IP Protocols Protocols Router architecture Router architecture Now.
TCP/IP (Transmission Control Protocol / Internet Protocol)
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Web Services. 2 Internet Collection of physically interconnected computers. Messages decomposed into packets. Packets transmitted from source to destination.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.
Bigtable: A Distributed Storage System for Structured Data
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Ch 2. Application Layer Myungchul Kim
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
COMPUTER NETWORKS Hwajung Lee. Image Source:
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
Bigtable A Distributed Storage System for Structured Data.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
4.01 How Web Pages Work.
WWW and HTTP King Fahd University of Petroleum & Minerals
Web Caching? Web Caching:.
Internet Networking recitation #12
IS333D: MULTI-TIER APPLICATION DEVELOPMENT
Chapter 27 WWW and HTTP.
4.01 How Web Pages Work.
Presentation transcript:

1 Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2007/10/02 Lecture 3: Crawling the Web (Chap 2, Charkrabarti)

Crawling the Web Web pages — Few thousand characters long — Served through the internet using the hypertext transport protocol (HTTP) — Viewed at client end using `browsers’ Crawler — To fetch the pages to the computer — At the computer Automatic programs can analyze hypertext documents

3 HTML HyperText Markup Language Lets the author — specify layout and typeface — embed diagrams — create hyperlinks. expressed as an anchor tag with a HREF attribute HREF names another page using a Uniform Resource Locator (URL), — URL = protocol field (“HTTP”) + a server hostname (“ + file path (/, the `root' of the published file system).

4 HTTP(hypertext transport protocol) Built on top of the Transport Control Protocol (TCP) Steps (from client end) — resolve server host name to Internet address (IP) Use Domain Name Server (DNS) DNS is a distributed database of name-to-IP mappings maintained at a set of known servers — contact the server using TCP connect to default HTTP port (80) on the server. Enter the HTTP requests header (E.g.: GET) Fetch the response header — MIME (Multipurpose Internet Mail Extensions) — A meta-data standard for and Web content transfer Fetch the HTML page

5 Crawl “all” Web pages? Problem: no catalog of all accessible URLs on the Web. Solution: — start from a given set of URLs — Progressively fetch and scan them for new outlinking URLs — fetch these pages in turn….. — Submit the text in page to a text indexing system — and so on……….

6 Crawling procedure Simple — Great deal of engineering goes into industry- strength crawlers — Industry crawlers crawl a substantial fraction of the Web — E.g.: Alta Vista, Northern Lights, Inktomi No guarantee that all accessible Web pages will be located in this fashion Crawler may never halt ……. — pages will be added continually even as it is running.

7 Crawling overheads Delays involved in — Resolving the host name in the URL to an IP address using DNS — Connecting a socket to the server and sending the request — Receiving the requested page in response Solution: Overlap the above delays by — fetching many pages at the same time

8 Anatomy of a crawler. Page fetching threads — Starts with DNS resolution — Finishes when the entire page has been fetched Each page — stored in compressed form to disk/tape — scanned for outlinks Work pool of outlinks — maintain network utilization without overloading it Dealt with by load manager Continue till the crawler has collected a sufficient number of pages.

9 Typical anatomy of a large-scale crawler

10 Large-scale crawlers: performance and reliability considerations Need to fetch many pages at same time — utilize the network bandwidth — single page fetch may involve several seconds of network latency Highly concurrent and parallelized DNS lookups Use of asynchronous sockets — Explicit encoding of the state of a fetch context in a data structure — Polling socket to check for completion of network transfers — Multi-processing or multi-threading: Impractical Care in URL extraction — Eliminating duplicates to reduce redundant fetches — Avoiding “spider traps” (e.g., fake URLs)

11 DNS caching, pre-fetching and resolution A customized DNS component with….. 1.Custom client for address resolution 2.Caching server 3.Prefetching client

12 Custom client for address resolution Tailored for concurrent handling of multiple outstanding requests Allows issuing of many resolution requests together — polling at a later time for completion of individual requests Facilitates load distribution among many DNS servers.

13 Caching server With a large cache, persistent across DNS restarts Residing largely in memory if possible. DNS resolution –UNIX gethostbyname: cannot provide concurrent requests –Mercator: reduce time from 87% to 25% –ADNS: asynchronous DNS client library

14 Prefetching client Steps 1.Parse a page that has just been fetched 2.extract host names from HREF targets 3.Make DNS resolution requests to the caching server Usually implemented using UDP — User Datagram Protocol — connectionless, packet-based communication protocol — does not guarantee packet delivery Does not wait for resolution to be completed.

15 Multiple concurrent fetches Managing multiple concurrent connections — A single download may take several seconds — Open many socket connections to different HTTP servers simultaneously Multi-CPU machines not useful — crawling performance limited by network and disk Two approaches 1.using multi-threading 2.using non-blocking sockets with event handlers

16 Multi-threading logical threads — physical thread of control provided by the operating system (E.g.: pthreads) OR — concurrent processes fixed number of threads allocated in advance programming paradigm — create a client socket — connect the socket to the HTTP service on a server — Send the HTTP request header — read the socket (recv) until no more characters are available — close the socket. use blocking system calls

17 Multi-threading: Problems performance penalty — mutual exclusion — concurrent access to data structures slow disk seeks. — great deal of interleaved, random input-output on disk — Due to concurrent modification of document repository by multiple threads

18 Non-blocking sockets and event handlers non-blocking sockets — connect, send or recv call returns immediately without waiting for the network operation to complete. — poll the status of the network operation separately “select” system call — lets application suspend until more data can be read from or written to the socket — timing out after a pre-specified deadline — Monitor polls several sockets at the same time More efficient memory management — code that completes processing not interrupted by other completions — No need for locks and semaphores on the pool — only append complete pages to the log

19 Link extraction and normalization Goal: Obtaining a canonical form of URL URL processing and filtering — Avoid multiple fetches of pages known by different URLs — many IP addresses For load balancing on large sites Mirrored contents/contents on same file system “Proxy pass“ Mapping of different host names to a single IP address need to publish many logical sites — Relative URLs need to be interpreted w.r.t a base URL.

20 Canonical URL Formed by Using a standard string for the protocol Canonicalizing the host name Adding an explicit port number Normalizing and cleaning up the path /books/../papers/index.html => /papers/index.html

21 Robot exclusion Check — whether the server prohibits crawling a normalized URL — In robots.txt file in the HTTP root directory of the server species a list of path prefixes which crawlers should not attempt to fetch. Meant for crawlers only User-agent specification

22 Eliminating already-visited URLs Checking if a URL has already been fetched — Before adding a new URL to the work pool — Needs to be very quick. — Achieved by computing MD5 ( Message-Digest algorithm 5) hash function ( X = MD5("vote1234") = 8339e38c61175dbd07846ad70dc226b2 Exploiting spatio-temporal locality of access Two-level hash function. — most significant bits (say, 24) derived by hashing the host name plus port — lower order bits (say, 40) derived by hashing the path concatenated bits use d as a key in a B-tree qualifying URLs added to frontier of the crawl. hash values added to B-tree.

23 Spider traps Protecting from crashing on — Ill-formed HTML E.g.: page with 68 kB of null characters — Misleading sites indefinite number of pages dynamically generated by CGI scripts paths of arbitrary depth created using soft directory links and path remapping features in HTTP server

24 Spider Traps: Solutions No automatic technique can be foolproof Check for URL length Guards — Preparing regular crawl statistics — Adding dominating sites to guard module — Disable crawling active content such as CGI form queries — Eliminate URLs with non-textual data types

25 Avoiding repeated expansion of links on duplicate pages Reduce redundancy in crawls Duplicate detection — Mirrored Web pages and sites Detecting exact duplicates — Checking against MD5 digests of stored URLs — Representing a relative link v (relative to aliases u1 and u2) as tuples (h(u1); v) and (h(u2); v) Detecting near-duplicates — Even a single altered character will completely change the digest ! E.g.: date of update/ name and of the site administrator — Solution : Shingling

26 Load monitor Keeps track of various system statistics — Recent performance of the wide area network (WAN) connection E.g.: latency and bandwidth estimates. — Operator-provided/estimated upper bound on open sockets for a crawler — Current number of active sockets.

27 Thread manager Responsible for  Choosing units of work from frontier  Scheduling issue of network resources  Distribution of these requests over multiple ISPs if appropriate. Uses statistics from load monitor

28 Per-server work queues Denial of service (DoS) attacks  limit the speed or frequency of responses to any fixed client IP address Avoiding DOS  limit the number of active requests to a given server IP address at any time  maintain a queue of requests for each server  Use the HTTP/1.1 persistent socket capability.  Distribute attention relatively evenly between a large number of sites Access locality vs. politeness dilemma

29 Text repository Crawler’s last task  Dumping fetched pages into a repository Decoupling crawler from other functions for efficiency and reliability preferred Page-related information stored in two parts  meta-data  page contents

30 Storage of page-related information Meta-data  relational in nature  usually managed by custom software to avoid relation database system overheads  text index involves bulk updates  includes fields like content-type, last-modified date, content-length, HTTP status code, etc.

31 Page contents storage Typical HTML Web page compresses to 2- 4 kB (using zlib) File systems have a 4-8 kB file block size  Too large !! Page storage managed by custom storage manager  simple access methods for  crawler to add pages  Subsequent programs (Indexer etc) to retrieve documents

32 Page Storage Small-scale systems  Repository fitting within the disks of a single machine  Use of storage manager (E.g.: Berkeley DB:  Manage disk-based databases within a single file  configuration as a hash-table/B-tree for URL access key  To handle ordered access of pages  configuration as a sequential log of page records.  Since Indexer can handle pages in any order

33 Page Storage Large-scale systems  Repository distributed over a number of storage servers  Storage servers  Connected to the crawler through a fast local network (E.g.: gigabit Ethernet)  Hashed by URLs  `T3' grade leased lines.  To handle 10 million pages (40 GB) per hour

34 Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled.

35 Refreshing crawled pages Search engine's index should be fresh Web-scale crawler never `completes' its job High variance of rate of page changes “If-modified-since” request header with HTTP protocol  Impractical for a crawler Solution  At commencement of new crawling round estimate which pages have changed

36 Determining page changes “Expires” HTTP response header  For page that come with an expiry date Otherwise need to guess if revisiting that page will yield a modified version.  Score reflecting probability of page being modified  Crawler fetches URLs in decreasing order of score.  Assumption : recent past predicts the future

37 Estimating page change rates Brewington and Cybenko & Cho  Algorithms for maintaining a crawl in which most pages are fresher than a specified epoch. Prerequisite  average interval at which crawler checks for changes is smaller than the inter-modification times of a page Small scale intermediate crawler runs  to monitor fast changing sites  E.g.: current news, weather, etc.  Patched intermediate indices into master index

38 Putting together a crawler  Reference implementation of the HTTP client protocol  World-wide Web Consortium (  w3c-libwww package

39 Design of the core components: Crawler class. To copy bytes from network sockets to storage media Three methods to express Crawler's contract with user  pushing a URL to be fetched to the Crawler (fetchPush)  Termination callback handler (fetchDone) called with same URL  Method (start) which starts Crawler's event loop. Implementation of Crawler class  Need for two helper classes called DNS and Fetch

40

41

42

43

44 Crawler at WKD Lab. of Academia Sinica

45 Parameters

46 Initial URLs

47 Crawling

48 Storing pages

49 Criterions for the Project Crawler Performance — Speed — Scalability Interesting/useful design strategies — Prefetching — Threading — Compressing — Data management — Others Algorithms: MD5, Depth first search/Breadth first search Configuration: flexible parameter setting