Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google.

Slides:



Advertisements
Similar presentations
The Internet.
Advertisements

Introduction to Web Design Lecture number:. Todays Aim: Introduction to Web-designing and how its done. Modelling websites in HTML.
Enabling Secure Internet Access with ISA Server
4.01 How Web Pages Work.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
CP476 Internet Computing Browser and Web Server 1 Web Browsers A client software program that allows you to access and view Web pages on the Internet –Examples.
Internet Basics.
Web Client/Server Communication A290/A590, Fall /09/2014.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
INTRODUCTION TO WEB DATABASE PROGRAMMING
PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Lecturer: Ghadah Aldehim
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
14 Publishing a Web Site Section 14.1 Identify the technical needs of a Web server Evaluate Web hosts Compare and contrast internal and external Web hosting.
Web application architecture
1 Computer Communication & Networks Lecture 28 Application Layer: HTTP & WWW p Waleed Ejaz
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
How Web Servers and the Internet Work by by: Marshall Brainby: Marshall Brain
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
HOW WEB SERVER WORKS? By- PUSHPENDU MONDAL RAJAT CHAUHAN RAHUL YADAV RANJIT MEENA RAHUL TYAGI.
Web Crawling. Next week I am attending a meeting, Monday into Wednesday. I said I could go only if I can get back for class. My flight is due in PHL at.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 20: Crawling and web indexes.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.
Crawling Slides adapted from
Web Engineering we define Web Engineering as follows: 1) Web Engineering is the application of systematic and proven approaches (concepts, methods, techniques,
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Chapter 9 Publishing and Maintaining Your Site. 2 Principles of Web Design Chapter 9 Objectives Understand the features of Internet Service Providers.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
WWW: an Internet application Bill Chu. © Bei-Tseng Chu Aug 2000 WWW Web and HTTP WWW web is an interconnected information servers each server maintains.
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 1 Fundamentals.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Web Server.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
Web Services. 2 Internet Collection of physically interconnected computers. Messages decomposed into packets. Packets transmitted from source to destination.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
4343 X2 – Outline The Domain Name System The Web.
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
The Internet. Important Terms Network Network Internet Internet WWW (World Wide Web) WWW (World Wide Web) Web page Web page Web site Web site Browser.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
Data mining in web applications
2.2 Internet Basics.
IS1500: Introduction to Web Development
WWW and HTTP King Fahd University of Petroleum & Minerals
Web Development Web Servers.
Lecture 17 Crawling and web indexes
Web Caching? Web Caching:.
Chapter 27 WWW and HTTP.
Anwar Alhenshiri.
Presentation transcript:

Web Crawling Fall 2011 Dr. Lillian N. Cassel

Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google to travel sites to resource collections, present results found by crawling the Web to find specific materials of interest to the application theme. Crawling the Web involves technical issues, politeness conventions, characterization of materials, decisions about the breadth and depth of a search, and choices about what to present and how to display results. This course will explore all of these issues. In addition, we will address what happens after you crawl the web and acquire a collection of pages. You will decide on the questions, but some possibilities might include these: What summer jobs are advertised on web sites in your favorite area? What courses are offered in most (or few) computer science departments? What theatres are showing what movies? etc? Students will develop a web site built by crawling at least some part of the web to find appropriate materials, categorize them, and display them effectively. Prerequisites: some programming experience: CSC 1051 or the equivalent.

Tonight’s class Review the syllabus and general plan Get some information about student experience Look at some background about the web and information handling Note: We will touch on several topics that we will explore in greater depth later in the semester

Overview -2 Note: – Students will develop a web site built by crawling at least some part of the web to find appropriate materials, categorize them, and display them effectively. You will need to choose a project, or at least a general area of interest, soon. – Prerequisites: some programming experience: CSC 1051 or the equivalent. I need to know what programming experience you have.

Overview – 3 Web presence – Syllabus.html Syllabus.html – Blackboard site We will use it for the things it is good for – Keep in touch with both the web site and the Blackboard site – Also – be sure to watch your . Any announcements or general information will go there – to the address on the class list

The Web Beginnings – Information Webs – Vannevar Bush As We May Think – Hypertext. Ted Nelson and others The World Wide Web – Origins and Structure Directed graph Possibility of cycles – Characteristics

Web Document Collection No design/co-ordination Distributed content creation, linking, democratization of publishing Content includes truth, lies, obsolete information, contradictions … Unstructured (text, html, …), semi- structured (XML, annotated photos), structured (Databases)… Scale much larger than previous text collections … but corporate records are catching up Growth – slowed down from initial “volume doubling every few months” but still expanding Content can be dynamically generated The Web

Web Operation Basic Client Server model – The http protocol HyperText Transfer Protocol – Few simple commands that allow communication between the server and an application requesting something from the server – usually a browser, but not always. – Server The site where the content resides. Most of the web is served up by Apache and its byproducts. – Client The program requesting something from the server. Browsers most often, but also web crawlers and other applications.

HTTP: GET and POST GET HTTP/ – Requests that the server send the specific page at back to the requestor. – The version number allows compatible communication – Server sends header and the requested file (page). – Additional requests can follow. POST – Similar to a GET but allows additional information to be sent to the server. – Useful for purchases or page edits.

HEAD HEAD HTTP/ Useful for checking whether a previously fetched web page has changed. The request results in header information, but not the page itself. Response: – Confirm http version compatibility – Date: – Server: – Last-Modified:

Full set of HTTP commands CONNECT Command DISCONNECT Command GET Command HEAD Command LOAD RESPONSE_INFO BODY Command LOAD RESPONSE_INFO HEADER Command POST Command SYNCHRONIZE REQUESTS Command

Search Search engines, whether general engines like Google or Yahoo do not crawl the web looking for results after receiving a query. – That would take much too long and provide unacceptable performance Search engines actually search a carefully constructed database with indices created for efficiently locating content

Architecture of a Search Engine Ref: Manning Introduction to Information Retrieval

Crawling in Context So, we see that crawling is just one step in a complex process of acquiring information from the Web to use in any application. Usually, we will want to sort through the information we found to get the most relevant part for our use. So, the example of a search engine is relevant.

Our goals We will crawl the Web. We will also store the results of the crawl so that we can retrieve what we want when we want it. We will learn about fundamental concepts of information retrieval, including index creation.

Crawling A web crawler (aka a spider) is a program – Starts with one or more URL – the seed Other URLs will be found in the pages pointed to by the seed URLs. They will be the starting point for further crawling – Uses the standard protocols for requesting a resource from a server Requirements for respecting server policies Politeness – Parses the resource obtained Obtains additional URLs from the fetched page – Implements policies about duplicate content – Recognizes and eliminates duplicate or unwanted URLs – Adds found URLs to the queue and continues from the request to server step

The Depth of the Web A URL gives access to a web page. That page may have links to other pages. Some pages are generated only when information is provided through a form. These pages cannot be discovered just by crawling. The surface web is huge. The deeper web is unfathomable.

Anatomy of a URL That is a web page. Three parts – http – the protocol to use for retrieving the page – -- the name of the domain csc is a subdomain of the villanova domain – ~cassel Abbreviation for index.html in the directory cassel at the machine associated with

The major domain categories Generic categories: –.net -- Originally restricted to major participants in maintaining the Internet. Now open. –.org -- Generally non profit organizations, including professional organizations such as acm.org –.com -- Commercial organizations such as amazon.com, etc. –.edu -- Restricted to higher education (post secondary) institutions. High schools and elementary schools are not allowed to use it. –.gov – Government organizations, such as nsf.gov –.mil – Military sites –.int – International Country Codes –.us –.uk Uses second level domains such as ac.uk or co.uk – And other country designations. Who is.tv? Newer ones:.biz,.name, etc. All regulated by the Internet Assigned Numbers Authoriity (IANA)

If not then what? Other protocols can be specified in the request to a server: – file:// local file on the current host – ftp:// use the ftp protocol to fetch the file – Etc.

Domain categories The domain categories serve to partition the universe of domain names. Domain Name Servers (DNS) do lookup to translate a domain name to an IP address. An IP address locates a particular machine and makes a communication path known. – Most common still: 32 bit IPv4 addresses – Newer: 128 bit IPv6 A server will typically have many programs running, several listening for network connections. – A port number (16 bits) identifies the specific process for the desired connection. – Default port for web connections: 80 – If other than 80, it must be specified in the URL

Crawler features A crawler must be – Robust: Survive spider traps. Websites that fool a spider into fetching large or limitless numbers of pages within the domain. Some deliberate; some errors in site design – Polite: Crawlers can interfere with the normal operation of a web site. Servers have policies, both implicit and explicit, about the allowed frequency of visits by crawlers. Responsible crawlers obey these. Others become recognized and rejected outright. Ref: Manning Introduction to Information Retrieval

Crawler features A crawler should be – Distributed: able to execute on multiple systems – Scalable: The architecture should allow additional machines to be added as needed – Efficient: Performance is a significant issue if crawling a large web – Useful: Quality standards should determine which pages to fetch – Fresh: Keep the results up-to-date by crawling pages repeatedly in some organized schedule – Extensible: Modular, well crafter architecture allows the crawler to expand to handle new formats, protocols, etc. Ref: Manning Introduction to Information Retrieval

Scale A one month crawl of a billion pages requires fetching several hundred pages per second It is easy to lose sight of the numbers when dealing with data sources on the scale of the Web. – 30 days * 24 hours/day * 60 minutes/hour * 60 seconds/minute = 2,592,000 seconds – 1,000,000,000 pages/2,592,000 seconds = pages/second Note that those numbers assume that the crawling is continuous Ref: Manning Introduction to Information Retrieval

Google Search  See &hl=en# &hl=en#  Marissa Mayer of Google on how a search happens at Google.

Basic Crawl Architecture WWW DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters Fetch Ref: Manning Introduction to Information Retrieval

Crawler Architecture Modules: – The URL frontier (the queue of URLs still to be fetched, or fetched again) – A DNS resolution module (The translation from a URL to a web server to talk to) – A fetch module (use http to retrieve the page) – A parsing module to extract text and links from the page – A duplicate elimination module to recognize links already seen Ref: Manning Introduction to Information Retrieval

Crawling threads With so much space to explore, so many pages to process, a crawler will often consist of many threads, each of which cycles through the same set of steps we just saw. There may be multiple threads on one processor or threads may be distributed over many nodes in a distributed system.

Politeness Not optional. Explicit – Specified by the web site owner – What portions of the site may be crawled and what portions may not be crawled robots.txt file Implicit – If no restrictions are specified, still restrict how often you hit a single site. – You may have many URLs from the same site. Too much traffic can interfere with the site’s operation. Crawler hits are much faster than ordinary traffic – could overtax the server. (Constitutes a denial of service attack) Good web crawlers do not fetch multiple pages from the same server at one time.

Robots.txt Protocol nearly as old as the web – – File: URL/robots.txt Contains the access restrictions – Example: User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow: All robots (spiders/crawlers) Robot named searchengine only Nothing disallowed

Processing robots.txt First line: – User-agent – identifies to whom the instruction applies. * = everyone; otherwise, specific crawler name – Disallow: or Allow: provides path to exclude or include in robot access. Once the robots.txt file is fetched from a site, it does not have to be fetched every time you return to the site. – Just takes time, and uses up hits on the server – Cache the robots.txt file for repeated reference

Robots tag robots.txt provides information about access to a directory. A given file may have an html meta tag that directs robot behavior A responsible crawler will check for that tag and obey its direction. Ex: – – OPTIONS: INDEX, NOINDEX, FOLLOW, NOFOLLOW See and

Crawling Pick a URL from the frontier Fetch the document at the URL Parse the URL – Extract links from it to other docs (URLs) Check if URL has content already seen – If not, add to indices For each extracted URL – Ensure it passes certain URL filter tests – Check if it is already in the frontier (duplicate URL elimination) Ref: Manning Introduction to Information Retrieval Which one? E.g., only crawl.edu, obey robots.txt, etc.

Basic Crawl Architecture Ref: Manning Introduction to Information Retrieval WWW DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters Fetch

DNS – Domain Name Server Internet service to resolve URLs into IP addresses Distributed servers, some significant latency possible OS implementations – DNS lookup is blocking – only one outstanding request at a time. Solutions – DNS caching – Batch DNS resolver – collects requests and sends them out together Ref: Manning Introduction to Information Retrieval

Parsing Fetched page contains – Embedded links to more pages – Actual content for use in the application Extract the links – Relative link? Expand (normalize) – Seen before? Discard – New? Meet criteria? Append to URL frontier Does not meet criteria? Discard Examine content

Content Seen before? – How to tell? Finger Print, Shingles – Documents identical, or similar – If already in the index, do not process it again Ref: Manning Introduction to Information Retrieval

Distributed crawler For big crawls, – Many processes, each doing part of the job Possibly on different nodes Geographically distributed – How to distribute Give each node a set of hosts to crawl Use a hashing function to partition the set of hosts – How do these nodes communicate? Need to have a common index Ref: Manning Introduction to Information Retrieval

Communication between nodes Ref: Manning Introduction to Information Retrieval WWW Fetch DNS Parse Content seen? URL filter Dup URL elim Doc FP’s URL set URL Frontier robots filters Host splitter To othe r hosts From othe r hosts The output of the URL filter at each node is sent to the Duplicate URL Eliminator at all nodes

URL Frontier Two requirements – Politeness: do not go too often to the same site – Freshness: keep pages up to date News sites, for example, change frequently Conflicts – The two requirements may be directly in conflict with each other. Complication – Fetching URLs embedded in a page will yield many URLs located on the same server. Delay fetching those. Ref: Manning Introduction to Information Retrieval

More … We will examine these things more completely. What will you actually do? Goal – Write a simple crawler Not distributed, not multi-threaded Use a seed URL, connect with the server, fetch the document, extract links, extract content – Explore existing crawlers Evaluate their characteristics Learn to use one to do serious crawling – Process the documents fetched to serve some purpose. Create a web site for that purpose. Ref: Manning Introduction to Information Retrieval

Processing the documents Create an index and store the documents and the index so that appropriate content can be found when needed. Learn the fundamentals of information retrieval as they apply to web services

Next week A simple web crawler Web crawl visualization Criteria for judging web crawlers Some web crawlers to compare