Download presentation
Presentation is loading. Please wait.
1
Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006
2
Why use Nutch? Front-end to large collections of documents Demonstrate research without writing lots of extra code
3
Outline Nutch - information retrieval –Pros & Cons –Crawling the Local Filesystem –How Nutch Works –Indexing a Database –Query Filters: Searching with Nutch
4
Nutch Open source search engine Written in Java Built on top of Apache Lucene
5
Advantages of Nutch Scalable –Index local host or entire Internet Portable –Runs anywhere with Java Flexible –Plugin system + API Code pretty easy to read & work with Better than implementing it yourself!
6
Disadvantages of Nutch Documentation still somewhat lacking Not yet fully mature No GUI Odd Tomcat setup Several “gotchas”
7
Crawling the Local Filesystem Step 1: Create list of files to index file_list: /data0/projects/clairlib/CLAIR/aleClairlib.pl /data0/projects/clairlib/CLAIR/buildALE.pl /data0/projects/clairlib/CLAIR/get_cosine_example.pl /data0/projects/clairlib/CLAIR/lookUpTFIDF.pl /data0/projects/clairlib/CLAIR/makeCorpus.pl /data0/projects/clairlib/CLAIR/normalize_cosines.pl /data0/projects/clairlib/CLAIR/queryALE.pl /data0/projects/clairlib/CLAIR/testCluster.pl /data0/projects/clairlib/CLAIR/testCorpusDownload.pl /data0/projects/clairlib/CLAIR/testDocument.pl /data0/projects/clairlib/CLAIR/testDocumentPair.pl /data0/projects/clairlib/CLAIR/testIP.pl /data0/projects/clairlib/CLAIR/testUtil.pl /data0/projects/clairlib/CLAIR/testWebSearch.pl /data0/projects/clairlib/CLAIR/NSIR/bin/testNSIR.pl /data0/projects/clairlib/CLAIR/NSIR/bin/nsir_web.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Parser.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Tnt2PreCass.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanEmptySentences.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanPunctuation_tnt.pl
8
Crawling the Local Filesystem Step 2: Edit Configuration –crawl-urlfilter.txt Very restrictive by default Must allow file: URLs
9
crawl-urlfilter.txt default # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # skip everything else -.
10
crawl-urlfilter.txt # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip image and other suffixes we can't yet parse.\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # allow everything else +.
11
Crawling the Local Filesystem Step 3: Edit Configuration –nutch-site.xml (overrides nutch-default.xml) Enable protocol-file plugin and parse plugins plugin.includes nutch-extensionpoints|protocol-file|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query- (basic|site|url) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins.
12
Crawling the Local Filesystem Step 4: Run the crawl –bin/nutch crawl myurls Step 5: Start Tomcat –GOTCHA: must start in the crawl directory! –Or edit WEB-INF/classes/nutch-site.xml searcher.dir /oriole0/nutch-0.7.1/crawl-20051208231019
13
Modifying the Results Page Just customize search.jsp! For example, display external ‘citations’ link instead of ‘anchors’ ( &query= "> ) ( ">citations ) "> ) --%>
14
How Nutch Works Protocol plugin URL Content byte[] content String contentType URL url Properties metadata Protocol. getProtocolOutput
15
How Nutch Works Parsing plugins URL Content byte[] content String contentType URL url Properties metadata Protocol. getProtocolOutput Parse String text Parser. getParse ParseData data Properties metadata Outlink[] outlinks String title ParseStatus status
16
Indexing a Database Need to write a new plugin Luckily interface is pretty simple Much less tightly coupled than full-text search inside database
17
Indexing a Database Approach –Get the text out –Generate a 1:1 mapping from URLs to documents in the database
18
Indexing a Database Protocol plugin –Replaces default ‘http’ plugin –Converts http request to database request
19
Indexing a Database Parse plugin –Replaces text or HTML parser –Protocol plugin gets the text and metadata, so don’t need to do much here
20
Indexing a Database Configuration - plugin.xml
21
Indexing a Database Configuration - nutch-site.xml –Add correct plugin Make sure Nutch can find plugin –$NUTCH_HOME/plugins
22
Improving the Plugin Configuration via XML Determine which database to use for what URLs Automatically ‘crawl’ database Pass unknown URLs to default plugin
23
Searching with Nutch Parse query - NutchAnalysis Filter query - QueryFilters Pass to Lucene - IndexSearcher –Optimization/caching - LuceneQueryOptimizer –Translate hits from Lucene back to Nutch
24
Query Filter Nutch Query QueryFilter. filter() Lucene Query
25
Date Query Filter Date query filter restricts by date
26
Basic Query Filter Boosts weight of particular fields Manipulates phrases
27
Additional Query Filters Could implement relevance feedback in this framework Manual relevance feedback –could add morelike:somedocument operator Automatic relevance feedback - extend BasicQueryFilter
28
Additional Capabilities Distributed searching –Nutch Distributed File System MapReduce a la Google More
29
Nutch Distributed Filesystem Write-once Stream-oriented (append-only, sequential read) Distributed, transparent, replicated, fault-tolerant Distribute index and content
30
MapReduce Distributed processing technique Idea from functional programming
31
Map Apply same operation to several data items Example (Python): def getDocument(docid): """ fetch document with given docid from database """ # do some stuff... return document docids = [1, 2, 3, 4, 5] documents = map(getDocument,docids) Mapping for individual items is independent - distributable!
32
Reduce Combine results of map operation Simple example - sum of squares measurements = [4, 2, 6, 9] def sum(x,y): return x+y def square(x): return x^2 result = reduce(sum,map(square,measurements))
33
Can use to distribute crawling, indexing, etc MapReduce in Nutch
34
Conclusions Nutch is –featureful –flexible –extensible –scalable Get started with nutch: http://lucene.apache.org/nutch Sample plugins and code samples: http://umich.edu/~aelkiss/nutch
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.