Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin
Outline Recap Special features Running Nutch in a distributed environment (with demo) Q&A Discussion
Recap Complete web search engine Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins + MapReduce & Distributed FS (Hadoop) Java based, open source Features: Customizable Extensible Distributed
Nutch as a crawler Initial URLs GeneratorFetcher Segment Webpages/files Web Parser generate Injector CrawlDB read/write CrawlDBTool update get read/write
Special Features Extensible (Plugin system) Most of the essential functionalities of Nutch are implemented as plugins Three layers Extension points What can be extended: Protocol, Parser, ScoringFilter, etc. Extensions The interfaces to be implemented for the extension points Plugins The actual implementation
Special Features Extensible (Plugin system) Anyone can write a plugin Write the code Prepare metadata files Plugin.xml: what has been extended by what Build.xml: how ant can build your source code Ask nutch to include your plugin in conf/nutch- site.xml Tell ant to build your in src/plugin/build.xml More
Special Features Extensible (Plugin system) To use a plugin Make sure you have modified Nutch-site.xml to include the plugin Then, either Nutch would automatically call it when needed, or You can write something to call it with its classname and then use it
Special Features Distributed (Hadoop) Map-Reduce (Diagram) Map-ReduceDiagram A framework for distributed programming Map -- Process the splits of data to get intermediate results and the keys to indicate what should be put together later Reduce -- Process the intermediate results with the same key and output final result
Special Features Distributed (Hadoop) MapReduce in Nutch Example1: Parsing Input: files from fetch Map(url,content) by calling parser plugins Reduce is identity Example2: Dumping a segment Input:, etc. files from segment Map is identity Reduce(url, value*) by simply concatenating the text representation of values
Special Features Distributed (Hadoop) Distributed File system Write-once-read-many coherence model High throughput Master/slave Simple architecture Single point of failure Transparent Access via Java API More
Running Nutch in a distributed environment MapReduce In hadoop-site.xml Specify job tracker host & port mapred.job.tracker Specify task numbers mapred.map.tasks mapred.reduce.tasks Specify location for temporary files Mapred.local.dir
Running Nutch in a distributed environment DFS In hadoop-site.xml Specify namenode host, port & directory fs.default.name dfs.name.dir Specify location for files on each datanode dfs.data.dir
Demo time!
Q&A
Discussion
Exercises Hands-on exercises Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI Repeat the crawling process without using the crawl command Modify your configuration to perform each of the following crawl jobs and think when they would be useful. To crawl only webpages and pdfs but not anything else To crawl the files on your harddisk To crawl but not to parse (Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state
Reference -- Information on Nutch plugins Hadoop homepage Hadoop Wiki data/attachments/Presentations/attachments/mapred.pdf "MapReduce in Nutch" data/attachments/Presentations/attachments/mapred.pdf data/attachments/Presentations/attachments/oscon05.pdf "Scalable Computing with MapReduce“ data/attachments/Presentations/attachments/oscon05.pdf Updated tutorial on setting up Nutch, Hadoop and Lucene together
Excursion: MapReduce Problem Find the number of occurrences of “cat” in a file What if the file is 20GB large? Why not do it with more computers? Solution PC1 PC PC1500 Split 1 Split 2 File
Excursion: MapReduce Problem Find the number of occurrences of both “cat” and “dog” in a very large file Solution PC1 PC2 200, , 250 PC1cat:500 Split 1 Split 2 File cat: 200, dog: 250 cat: 300, dog: 250 PC2dog:500 cat: 200, 300 dog: 250, 250 Input Files Map Intermediate files Reduce Output files Sort/Group
Excursion: MapReduce Generalized Framework Split 1 Split 2 Split 3 Split 4 Worker k1:v1 k3:v2 k1:v3 k2:v4 k2:v5 k4:v6 k1:v1,v2 k2:v4,v5 k3:v2 Worker Output 1 Output 2 k4:v6 Output 3 Master back Input Files Map Intermediate files Reduce Output files Sort/Group