Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture.

Similar presentations


Presentation on theme: "Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture."— Presentation transcript:

1 Web Crawlers Nutch

2 Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture

3 Web crawlers Crawl or visit web pages and download them Starting from one page –determine which page(s) to go to next This is where we know how good/bad, efficient a crawler is Mainly depends on crawling policies used

4 Crawl policies Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy –Pageranks –Path ascending –Focused crawling

5 Re-visit policy –Freshness –Age Politeness –So that crawlers don’t overload web servers –Set a delay between GET requests Parallelization –Distributed web crawling –To maximize download rate

6 Nutch Is a Open Source web crawler Nutch Web Search Application –Maintain DB of pages and links –Pages have scores, assigned by analysis –Fetches high-scoring, out-of-date pages –Distributed search front end –Based on Lucene http://lucene.apache.org/nutch/

7


Download ppt "Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture."

Similar presentations


Ads by Google