Download presentation
Presentation is loading. Please wait.
1
Web Crawlers Nutch
2
Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture
3
Web crawlers Crawl or visit web pages and download them Starting from one page –determine which page(s) to go to next This is where we know how good/bad, efficient a crawler is Mainly depends on crawling policies used
4
Crawl policies Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy –Pageranks –Path ascending –Focused crawling
5
Re-visit policy –Freshness –Age Politeness –So that crawlers don’t overload web servers –Set a delay between GET requests Parallelization –Distributed web crawling –To maximize download rate
6
Nutch Is a Open Source web crawler Nutch Web Search Application –Maintain DB of pages and links –Pages have scores, assigned by analysis –Fetches high-scoring, out-of-date pages –Distributed search front end –Based on Lucene http://lucene.apache.org/nutch/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.