IST 497 Vladimir Belyavskiy 11/21/02

IST 497 Vladimir Belyavskiy 11/21/02
Web Crawlers IST 497 Vladimir Belyavskiy 11/21/02

Overview Introduction to Crawlers Focused Crawling Issues to consider
Parallel Crawlers Ambitions for the future Conclusion

Introduction What is a crawler? Why are crawlers important?
Used by many Main use is to create indexes for search engines Tool was needed to keep track of web content In March of 2002 there were 38,118,962 web sites Web has doubled in less than two years.

Web crawlers start by parsing a specified web page, noting any hypertext links on that page that point to other web pages. They then parse those pages for new links, and so on, recursively. Web-crawler software doesn't actually move around to different computers on the Internet, as viruses or intelligent agents do. A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a web browser does when the user clicks on links. All the crawler really does is to automate the process of following links. Following links isn't greatly useful in itself, of course. The list of linked pages almost always serves some subsequent purpose. The most common use is to build an index for a web search engine, but crawlers are also used for other purposes,

Focused Crawling Focused Crawler: selectively seeks out pages that are relevant to a pre-defined set of topics. Topics specified by using exemplary documents (not keywords) Crawl most relevant links Ignore irrelevant parts. Leads to significant savings in hardware and network resources.

Issues to consider Where to start crawling? Keyword search Users input
User specifies keywords Search for given criteria Popular sites are found using weighted degree measures Approached used for 966 Yahoo category searches (ex Business/Electronics) Users input User gives document examples Crawler compared documents to find matches

Issues to consider URLs found are stored in a queue, stack or a deck
Which link do you crawl next? Ordering metrics: Breadth-First URLs are placed in the queue in order discovered First link found is the first to crawl

Issues to consider Backlink count Page Rank
Counts the number of links to the page Site with greatest # of links is given priority Page Rank backlinks are also counted Popular backlinks are given extra value (Ex. Yahoo) Works the best

Issues to consider What pages should crawler download?
Not enough space Not enough time How to keep content fresh? Fixed Order - Explicit list of URL’s to visit Random Order – Start from seed and follow links Purely Random – Refresh pages on demand In most cases, the crawler cannot download all pages on the Web. Even the most comprehensive search engine currently indexes a small fraction of the entire Web [LG99, BB99]. Given this fact, it is important for the crawler to carefully select the pages and to visit “important” pages, so that the fraction of the Web that is visited (and kept up-to-date) is more meaningful. Once the crawler has downloaded a significant number of pages, it has to start revisiting the downloaded pages in order to detect changes and refresh the downloaded collection. Because Web pages are changing at very different rates [CGM00a, WM99], the crawler needs to carefully decide, which pages to revisit and which pages to skip in order to achieve high “freshness” of pages. For example, if a certain page rarely changes, the crawler may want to revisit the page less often, in order to visit more frequently changing ones. Change is defined by user. Ex: 30% change in a page, or 3 different columns

Issues to consider Estimate frequency of changes
Visit pages once a week for five weeks Estimate change frequency Adjust revisit frequency based on the estimate Most effective method

Issues to consider How to minimize the load on visited pages?
Crawler should obey the constraints Crawler html tags Robot.txt file User-Agent: * Disallow: / Spider Traps When the crawler collects pages from the Web, it consumes resources belonging to other organizations [Kos95]. For example, when the crawler downloads page p on site S, the site needs to retrieve page p from its file system, consuming disk and CPU resources. After this retrieval the page then needs to be transferred through the network, which is another resource shared by multiple organizations. Therefore, the crawler should minimize its impact on these resources [Rob]. Otherwise, the administrators of a Web site or a particular network may complain and sometimes may completely block access by the crawler.

Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided Independent assignment Each crawler starts with its own set of URLs Follows links without consulting other crawlers Reduces communication overhead Some overlap is unavoidable

Parallel Crawlers Dynamic assignment Static assignment
Central coordinator divides web into partitions Crawlers crawl their assigned partition Links to other URLs are given to Central coordinator Static assignment Web is partitioned and divided to each crawler Crawler only crawls its part of the web in dynamic assignment, the central coordinator may become a major bottleneck because it has to maintain a large number of URLs reported from all C-proc’s and has to constantly coordinate all C-proc’s. For static assignment user Must know what they want to crawl. They may not know all the desired domains.

Evaluation Content Quality better for single-process crawler
Overlap in most multiple processors or they don’t cover all of the content Overall crawlers are useful tools parallel crawler may be worse than that of a single-process crawler, because many importance metrics depend on the global structure of the Web (e.g., backlink count): Each C-proc in a parallel crawler may know only the pages that are downloaded by itself and may make a poor crawling decision based solely on its own pages. In contrast, a single-process crawler knows all pages it has downloaded and may make a more informed decision. Certain parts of domains can only be reached from other domains. If a crawler isn’t allowed to access the other domain it won’t be able to crawl those documents.

Future Query interface pages Detect web page changes better
Ex. Detect web page changes better Separate dynamic from static content Share data better between servers and crawlers As more and more pages are dynamically generated, however, some pages are “hidden” behind a query Interface. They are reachable only when the user issues keyword queries to a query interface. In order to crawl them Crawler has to figure out what keywords to issue. Crawler can use the context of pages to guess the keywords and retrieve the data. Some web pages only change in certain sections. Ex: On eBay prices change frequently, but product description doesn’t. Crawlers should ignore changes in dynamic portion, since its irrelevant for description of the webpage. This way you save some resources by not downloading web pages all the time. A mechanism needs to be developed, which will allow crawlers to subscribe to the changes its interested in. Both servers and crawlers will benefit if the changes made on the server were published. Then crawler can make better crawling decisions. This will limit the amount of information that needs to be saved by a crawler and will reduce traffic on the server.

Bibliography Cheng, Rickie & Kwong, April. April Cho, Junghoo Dom, Brian. March 1999. Polytechnic University, CIS Department

The End Any Questions?

IST 497 Vladimir Belyavskiy 11/21/02

Similar presentations

Presentation on theme: "IST 497 Vladimir Belyavskiy 11/21/02"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IST 497 Vladimir Belyavskiy 11/21/02

Similar presentations

Presentation on theme: "IST 497 Vladimir Belyavskiy 11/21/02"— Presentation transcript:

Similar presentations

About project

Feedback