Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006.

Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006

Why use Nutch? Front-end to large collections of documents Demonstrate research without writing lots of extra code

Outline Nutch - information retrieval –Pros & Cons –Crawling the Local Filesystem –How Nutch Works –Indexing a Database –Query Filters: Searching with Nutch

Nutch Open source search engine Written in Java Built on top of Apache Lucene

Advantages of Nutch Scalable –Index local host or entire Internet Portable –Runs anywhere with Java Flexible –Plugin system + API Code pretty easy to read & work with Better than implementing it yourself!

Disadvantages of Nutch Documentation still somewhat lacking Not yet fully mature No GUI Odd Tomcat setup Several “gotchas”

Crawling the Local Filesystem Step 1: Create list of files to index file_list: /data0/projects/clairlib/CLAIR/aleClairlib.pl /data0/projects/clairlib/CLAIR/buildALE.pl /data0/projects/clairlib/CLAIR/get_cosine_example.pl /data0/projects/clairlib/CLAIR/lookUpTFIDF.pl /data0/projects/clairlib/CLAIR/makeCorpus.pl /data0/projects/clairlib/CLAIR/normalize_cosines.pl /data0/projects/clairlib/CLAIR/queryALE.pl /data0/projects/clairlib/CLAIR/testCluster.pl /data0/projects/clairlib/CLAIR/testCorpusDownload.pl /data0/projects/clairlib/CLAIR/testDocument.pl /data0/projects/clairlib/CLAIR/testDocumentPair.pl /data0/projects/clairlib/CLAIR/testIP.pl /data0/projects/clairlib/CLAIR/testUtil.pl /data0/projects/clairlib/CLAIR/testWebSearch.pl /data0/projects/clairlib/CLAIR/NSIR/bin/testNSIR.pl /data0/projects/clairlib/CLAIR/NSIR/bin/nsir_web.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Parser.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Tnt2PreCass.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanEmptySentences.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanPunctuation_tnt.pl

Crawling the Local Filesystem Step 2: Edit Configuration –crawl-urlfilter.txt Very restrictive by default Must allow file: URLs

crawl-urlfilter.txt default # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # skip everything else -.

crawl-urlfilter.txt # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip image and other suffixes we can't yet parse.\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # allow everything else +.

Crawling the Local Filesystem Step 3: Edit Configuration –nutch-site.xml (overrides nutch-default.xml) Enable protocol-file plugin and parse plugins plugin.includes nutch-extensionpoints|protocol-file|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query- (basic|site|url) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins.

Crawling the Local Filesystem Step 4: Run the crawl –bin/nutch crawl myurls Step 5: Start Tomcat –GOTCHA: must start in the crawl directory! –Or edit WEB-INF/classes/nutch-site.xml searcher.dir /oriole0/nutch-0.7.1/crawl-20051208231019

Modifying the Results Page Just customize search.jsp! For example, display external ‘citations’ link instead of ‘anchors’ ( &query= "> ) ( ">citations ) "> ) --%>

How Nutch Works Protocol plugin URL Content byte[] content String contentType URL url Properties metadata Protocol. getProtocolOutput

How Nutch Works Parsing plugins URL Content byte[] content String contentType URL url Properties metadata Protocol. getProtocolOutput Parse String text Parser. getParse ParseData data Properties metadata Outlink[] outlinks String title ParseStatus status

Indexing a Database Need to write a new plugin Luckily interface is pretty simple Much less tightly coupled than full-text search inside database

Indexing a Database Approach –Get the text out –Generate a 1:1 mapping from URLs to documents in the database

Indexing a Database Protocol plugin –Replaces default ‘http’ plugin –Converts http request to database request

Indexing a Database Parse plugin –Replaces text or HTML parser –Protocol plugin gets the text and metadata, so don’t need to do much here

Indexing a Database Configuration - plugin.xml

Indexing a Database Configuration - nutch-site.xml –Add correct plugin Make sure Nutch can find plugin –$NUTCH_HOME/plugins

Improving the Plugin Configuration via XML Determine which database to use for what URLs Automatically ‘crawl’ database Pass unknown URLs to default plugin

Searching with Nutch Parse query - NutchAnalysis Filter query - QueryFilters Pass to Lucene - IndexSearcher –Optimization/caching - LuceneQueryOptimizer –Translate hits from Lucene back to Nutch

Query Filter Nutch Query QueryFilter. filter() Lucene Query

Date Query Filter Date query filter restricts by date

Basic Query Filter Boosts weight of particular fields Manipulates phrases

Additional Query Filters Could implement relevance feedback in this framework Manual relevance feedback –could add morelike:somedocument operator Automatic relevance feedback - extend BasicQueryFilter

Additional Capabilities Distributed searching –Nutch Distributed File System MapReduce a la Google More

Nutch Distributed Filesystem Write-once Stream-oriented (append-only, sequential read) Distributed, transparent, replicated, fault-tolerant Distribute index and content

MapReduce Distributed processing technique Idea from functional programming

Map Apply same operation to several data items Example (Python): def getDocument(docid): """ fetch document with given docid from database """ # do some stuff... return document docids = [1, 2, 3, 4, 5] documents = map(getDocument,docids) Mapping for individual items is independent - distributable!

Reduce Combine results of map operation Simple example - sum of squares measurements = [4, 2, 6, 9] def sum(x,y): return x+y def square(x): return x^2 result = reduce(sum,map(square,measurements))

Can use to distribute crawling, indexing, etc MapReduce in Nutch

Conclusions Nutch is –featureful –flexible –extensible –scalable Get started with nutch: http://lucene.apache.org/nutch Sample plugins and code samples: http://umich.edu/~aelkiss/nutch

Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006.

Similar presentations

Presentation on theme: "Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006.

Similar presentations

Presentation on theme: "Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006."— Presentation transcript:

Similar presentations

About project

Feedback