1 Crawling The Web
2 Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines, web archives, spammer applications The idea: use links between pages to traverse the Web Since the Web is dynamic, updates should be done continuously (or frequently)
3 Stored Data Pages -Search Engines, Archives Specific files -Pictures, Docs, … addresses -Spammers
4 Crawling Basic Algorithm Init Get next URL Get page Extract DataExtract Links initial seeds to-visit URLS visited URLS database www
5 The Web as a Graph The Web is modeled as a directed graph -The nodes are the Web pages -The edges are pairs (P 1, P 2 ) such that there is a link from P 1 to P 2 Crawling the Web is a graph traversal (search algorithm) Can we traverse all of the Web this way?
6 The Hidden Web The hidden Web consists of -Pages that no other page has a link to them how can we get to this pages? -Dynamic pages that are created as a result of filling a form
7 Traversal Orders Different traversal orders can be used: -Breadth-First Crawlers to-visit pages are stored in a queue -Depth-First Crawlers to-visit pages are stored in a stack -Best-First Crawlers to-visit pages are stored in a priority-queue, according to some metric -How should the traversal order be chosen?
8 Additional Characteristics Internal depth -“Depth” under the initial URL seeds -Is it an absolute value ? External Depth Maximum pages number
9 Avoiding Cycles To avoid visiting the same page more than once, a crawler has to keep a list of the URLs it has visited The target of every encountered link is checked before inserting it to the to-visit list Which data structure for visited-links should be used?
10 Directing Crawlers Sometimes people want to direct automatic crawling over their resources “Do not visit my files!” “Do not index my files!” “Only my crawler may visit my files!” “Please, follow my useful links…” “Please update your data after X time…” Solution: publish instructions in some known format Crawlers are expected to follow these instructions
11 Robots Exclusion Protocol A method that allows Web servers to indicate which of their resources should not be visited by crawlers Will be used in ex1
12 Robots Meta Tag A Web-page author can also publish directions for crawlers These are expressed by the meta tag with name robots, inside the HTML file Format: Options: - index ( noindex ): index (do not index) this file - follow ( nofollow ): follow (do not follow) the links of this file
13 Robots Meta Tag... … An Example: How should a crawler act when it visits this page?
14 Revisit Meta Tag Web page authors may want Web applications to have an up-to-date copy of their page Using the revisit meta tag, page authors can give crawlers some idea of how often the page is being updated For example:
15 Stronger Restrictions It is possible for a (non-polite) crawler to ignore the restrictions imposed by robots.txt and robots meta directions Therefore, if one wants to ensure that automatic robots do not visit her resources, she has to use other mechanisms -For example, password protections
16 Resources Read more: A nice tutorial about web crawling: Crawler directions: A dictionary of HTML meta tags: