Download presentation
Presentation is loading. Please wait.
Published byTyler Morton Modified over 9 years ago
1
Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin
2
Outline Overview Nutch as a web crawler Nutch as a complete web search engine Special features Installation/Usage (with Demo) Exercises
3
Overview Complete web search engine Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins + MapReduce & Distributed FS (Hadoop) Java based, open source Features: Customizable Extensible (Next meeting) Distributed (Next meeting)
4
Nutch as a crawler Initial URLs GeneratorFetcher Segment Webpages/files Web Parser generate Injector CrawlDB read/write CrawlDBTool update get read/write
5
Nutch as a complete web search engine Indexer (Lucene) Segments Index Searcher (Lucene) GUI CrawlDBLinkDB (Tomcat)
6
Special Features Customizable Configuration files (XML) Required user parameters http.agent.name http.agent.description http.agent.url http.agent.email Adjustable parameters for every component E.g. for fetcher: Threads-per-host Threads-per-ip
7
Special Features URL Filters (Text file) Regular expression to filter URLs during crawling E.g. To ignore files with certain suffix: -\.(gif|exe|zip|ico)$ To accept host in a certain domain +^http://([a-z0-9]*\.)*apache.org/ Plugin-information (XML) The metadata of the plugins (More details next week)
8
Installation & Usage Installation Software needed Nutch release Java Apache Tomcat (for GUI) Cgywin (for windows)
9
Installation & Usage Usage Crawling Initial URLs (text file or DMOZ file) Required parameters (conf/nutch-site.xml) URL filters (conf/crawl-urlfilter.txt) Indexing Automatic Searching Location of files (WAR file, index) The tomcat server
10
Demo time!
11
Exercises Questions: What are the things that need to be done before starting a crawl job with Nutch? What are the ways tell Nutch what to crawl and what not? What can you do if you are the owner of a website? Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind? What do you think are good crawling behaviors? Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking? What are the advantages of using Nutch instead of commercial search engines?
12
Answers What are the things that need to be done before starting a crawl job with Nutch? Set the CLASSPATH to the Lucene Core Set the JAVA_HOME path Create a folder containing urls to be crawled Amend the crawl-urlfilter file Amend the nutch-site.xml file to include the user parameters
13
What are the ways tell Nutch what to crawl and what not? Url filters Depth in crawling Scoring function for urls What can you do if you are the owner of a website? Web Server Administrators Use the Robot Exclusion Protocol by adding the following in /robots.txt HTML Author Add the Robots META tag
14
Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind? To ensure accountability (although tracing is still possible without them) What do you think are good crawling behaviors? Be Accountable Test Locally Don't hog resources Stay with it Share results
15
Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking? True but one can always make changes in Nutch to minimize the effect. What are the advantages of using Nutch instead of commercial search engines? Open-source Transparent Able to define the what are to be returned in searches and the index ranking
16
Exercises Hands-on exercises Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI Repeat the crawling process without using the crawl command Modify your configuration to perform each of the following crawl jobs and think when they would be useful. To crawl only webpages and pdfs but not anything else To crawl the files on your harddisk To crawl but not to parse (Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state
17
Q&A?
18
Next Meeting Special Features Extensible Distributed Feedback and discussion
19
References http://lucene.apache.org/nutch/ -- Official website http://lucene.apache.org/nutch/ http://wiki.apache.org/nutch/ -- Nutch wiki (Seriously outdated. Take with a grain of salt.) http://wiki.apache.org/nutch/ http://lucene.apache.org/nutch/release/ Nutch source code http://lucene.apache.org/nutch/release/ www.nutchinstall.blogspot.com Installation guide www.nutchinstall.blogspot.com http://www.robotstxt.org/wc/robots.html The web robot pages http://www.robotstxt.org/wc/robots.html
20
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.