Download presentation
Presentation is loading. Please wait.
Published byPierce Nicholson Modified over 9 years ago
1
Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin
2
Outline Recap Special features Running Nutch in a distributed environment (with demo) Q&A Discussion
3
Recap Complete web search engine Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins + MapReduce & Distributed FS (Hadoop) Java based, open source Features: Customizable Extensible Distributed
4
Nutch as a crawler Initial URLs GeneratorFetcher Segment Webpages/files Web Parser generate Injector CrawlDB read/write CrawlDBTool update get read/write
5
Special Features Extensible (Plugin system) Most of the essential functionalities of Nutch are implemented as plugins Three layers Extension points What can be extended: Protocol, Parser, ScoringFilter, etc. Extensions The interfaces to be implemented for the extension points Plugins The actual implementation
6
Special Features Extensible (Plugin system) Anyone can write a plugin Write the code Prepare metadata files Plugin.xml: what has been extended by what Build.xml: how ant can build your source code Ask nutch to include your plugin in conf/nutch- site.xml Tell ant to build your in src/plugin/build.xml More details @ http://wiki.apache.org/nutch/PluginCentral http://wiki.apache.org/nutch/PluginCentral
7
Special Features Extensible (Plugin system) To use a plugin Make sure you have modified Nutch-site.xml to include the plugin Then, either Nutch would automatically call it when needed, or You can write something to call it with its classname and then use it
8
Special Features Distributed (Hadoop) Map-Reduce (Diagram) Map-ReduceDiagram A framework for distributed programming Map -- Process the splits of data to get intermediate results and the keys to indicate what should be put together later Reduce -- Process the intermediate results with the same key and output final result
9
Special Features Distributed (Hadoop) MapReduce in Nutch Example1: Parsing Input: files from fetch Map(url,content) by calling parser plugins Reduce is identity Example2: Dumping a segment Input:, etc. files from segment Map is identity Reduce(url, value*) by simply concatenating the text representation of values
10
Special Features Distributed (Hadoop) Distributed File system Write-once-read-many coherence model High throughput Master/slave Simple architecture Single point of failure Transparent Access via Java API More info @ http://lucene.apache.org/hadoop/hdfs_design.html http://lucene.apache.org/hadoop/hdfs_design.html
11
Running Nutch in a distributed environment MapReduce In hadoop-site.xml Specify job tracker host & port mapred.job.tracker Specify task numbers mapred.map.tasks mapred.reduce.tasks Specify location for temporary files Mapred.local.dir
12
Running Nutch in a distributed environment DFS In hadoop-site.xml Specify namenode host, port & directory fs.default.name dfs.name.dir Specify location for files on each datanode dfs.data.dir
13
Demo time!
14
Q&A
15
Discussion
16
Exercises Hands-on exercises Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI Repeat the crawling process without using the crawl command Modify your configuration to perform each of the following crawl jobs and think when they would be useful. To crawl only webpages and pdfs but not anything else To crawl the files on your harddisk To crawl but not to parse (Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state
17
Reference http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch plugins http://wiki.apache.org/nutch/PluginCentral http://lucene.apache.org/hadoop/ -- Hadoop homepage http://lucene.apache.org/hadoop/ http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki http://wiki.apache.org/lucene-hadoop/ http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/mapred.pdf "MapReduce in Nutch" http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/mapred.pdf http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/oscon05.pdf "Scalable Computing with MapReduce“ http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/oscon05.pdf http://www.mail-archive.com/nutch- commits@lucene.apache.org/msg01951.html Updated tutorial on setting up Nutch, Hadoop and Lucene together http://www.mail-archive.com/nutch- commits@lucene.apache.org/msg01951.html
18
Excursion: MapReduce Problem Find the number of occurrences of “cat” in a file What if the file is 20GB large? Why not do it with more computers? Solution PC1 PC2 200 300 PC1500 Split 1 Split 2 File
19
Excursion: MapReduce Problem Find the number of occurrences of both “cat” and “dog” in a very large file Solution PC1 PC2 200, 250 300, 250 PC1cat:500 Split 1 Split 2 File cat: 200, dog: 250 cat: 300, dog: 250 PC2dog:500 cat: 200, 300 dog: 250, 250 Input Files Map Intermediate files Reduce Output files Sort/Group
20
Excursion: MapReduce Generalized Framework Split 1 Split 2 Split 3 Split 4 Worker k1:v1 k3:v2 k1:v3 k2:v4 k2:v5 k4:v6 k1:v1,v2 k2:v4,v5 k3:v2 Worker Output 1 Output 2 k4:v6 Output 3 Master back Input Files Map Intermediate files Reduce Output files Sort/Group
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.