Crawling the WEB Representation and Management of Data on the Internet.

Crawling the WEB Representation and Management of Data on the Internet

Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines, Web archives The idea: use links between pages to traverse the Web Since the Web is dynamic updates should be done continuously (or frequently)

Crawling Basic Algorithm Init Get next URL Get page Extract DataExtract Links initial seeds to-visit URLS visited URLS database www

The Web as a Graph The Web is modeled as a directed graph –The nodes are the Web pages –The edges are pairs (P1, P2) such that there is a link from P1 to P2 Crawling the Web is a graph traversal (search algorithm) Can we traverse all of the Web this way?

The Hidden Web The hidden Web consists of –Pages that no other page has a link to them (how can we get to this pages?) –Dynamic pages that are created as a result of filling a form, e.g., http://www.google.com/search?q=dbi

Traversal Orders Different traversal orders can be used: –Breath First Crawlers to-visit pages are stored in a queue –Depth First Crawlers to-visit pages are stored in a stack –Best First Crawlers to-visit pages are stored in a priority-queue, according to some metric –How should we choose the traversal order?

Avoiding Cycles To avoid visiting the same page more than once, a crawler has to keep a list of the URLs it has visited The target of every encountered link is checked before inserting it to the to-visit list Which data structure for visited-links can provide high efficiency?

Directing Crawlers Sometimes people want to direct automatic crawling over their resources Direction examples: “Do not visit my files!” “Do not index my files!” “Only my crawler may visit my files!” “Please, follow my useful links…” Solution: publish instructions to crawlers in a known format Crawlers are expected to follow these instructions

Robots Exclusion Protocol A method that allows Web servers to indicate which of their resources should not be visited by crawlers Put the file robots.txt at the root directory of the server –http://www.cnn.com/robots.txt –http://www.w3.org/robots.txt –http://www.mit.edu/robots.txt

robots.txt Format A robots.txt file consists of several records Each record consists of a set of some crawler id’s and a set of URLs these crawlers are not allowed to visit – “User-agent” lines: which crawlers are directed? –“Disallowed” lines: Which URLs are not to be visited by these crawlers (agents)?

robots.txt Format The following example is taken from http://www.w3.org/robots.txt: User-agent: W3Crobot/1 Disallow: /Out-Of-Date User-agent: * Disallow: /Team Disallow: /Project Disallow: /Systems Disallow: /Web Disallow: /History Disallow: /Out-Of-Date W3Crobot/1 is not allowed to visit files under directory Out- of-Date And those that are not W3Crobot/1…

Robots Meta Tag A Web-page author may also publish directions for crawlers These are expressed by the META tag with name robots, inside the HTML file Format: – Options: –index(noindex): index (do not index) this file –follow(nofollow): follow (do not follow) the links of this file

Robots Meta Tag... … An Example: How should a crawler act when it visits this page?

Revisit Meta Tag Web page authors may want Web applications to have an up-to-date copy of their page Using the revisit meta tag, page authors may give crawlers some idea of how often the page is being updated For example:

Stronger Restrictions It is possible for a (non-polite) crawler to ignore the restrictions imposed by robots.txt and robots meta data Therefore, if one wants to ensure that automatic robots do not visit his resources, he has to use other mechanisms –For example, password protections

Resources Read this nice tutorial about web crawling: http://informatics.indiana.edu/fil/Papers/crawling.pdf http://informatics.indiana.edu/fil/Papers/crawling.pdf To find more about crawler direction visit www.robotstxt.org A dictionary of HTML META tags can be found at http://vancouver-webpages.com/META/ http://vancouver-webpages.com/META/

Crawling the WEB Representation and Management of Data on the Internet.

Similar presentations

Presentation on theme: "Crawling the WEB Representation and Management of Data on the Internet."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Crawling the WEB Representation and Management of Data on the Internet.

Similar presentations

Presentation on theme: "Crawling the WEB Representation and Management of Data on the Internet."— Presentation transcript:

Similar presentations

About project

Feedback