Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.

Slides:



Advertisements
Similar presentations
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advertisements

How to Use LucidWorks Search
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Having Your Cake and Eating It Too With Apache OODT and Apache Solr Andrew F. Hart Paul M. Ramirez.
IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features INIS Training Seminar 7-11 October 2013, Vienna Domenico.
ARCHIMÈDE Presented by Guy Teasdale Directeur, Services soutien et développement Bibliothèque de l’Université Laval CARL Workshop on Institutional Repositories.
Advisory Board Meeting  Portland, Oregon  08 November 2000 System Architecture David Maier
Search Engines and Information Retrieval
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Introduction to Apache Tika CSCI 572: Information Retrieval and Search Engines Summer 2010.
Implementing search with free software An introduction to Solr By Mick England.
ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Introduction to Apache Lucene/Solr CSCI 572: Information Retrieval and Search Engines Summer 2010.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
Search Engines and Information Retrieval Chapter 1.
1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.
Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.
© 2014 Jenzabar, Inc. Presented by Jude Bowman Jenzabar, Inc. Oct. 17 th, 2014 Latest Enhancements to JICS: Search.
Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.
Patient Empowerment for Chronic Diseases System Sifat Islam Graduate Student, Center for Systems Integration, FAU, Copyright © 2011 Center.
Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
OpenURL Link Resolvers 101
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
Content Detection and Analysis CSCI 572: Information Retrieval and Search Engines Summer 2010.
Deutscher Wetterdienst DAR Metadata Catalog Markus Heene, DWD
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
CSCI 572: Information Retrieval and Search Engines: Summer 2011 Prof. Chris A. Mattmann.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
.  A multi layer architecture powered by Spring Framework, ExtJS, Spring Security and Hibernate.  Taken advantage of Spring’s multi layer injection.
IUScholarWorks Technical Overview Randall Floyd Digital Library Program Programmer/Database Administrator.
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson 1 VDC/TWG Meeting August 09.
Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Design a full-text search engine for a website based on Lucene
Dean Anderson Polk County, Oregon GIS in Action 2014 Modifying Open Source Software (A Case Study)
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Plug-in Architectures Presented by Truc Nguyen. What’s a plug-in? “a type of program that tightly integrates with a larger application to add a special.
IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features The Role of the International Nuclear Information System.
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory.
DSpace System Architecture 11 July 2002 DSpace System Architecture.
Oct HPS Collaboration Meeting Jeremy McCormick (SLAC) HPS Web 2.0 OR Web Apps and Databases (Oh My!) Jeremy McCormick (SLAC)
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
VIVO architecture March 1, Major Components Vitro is a general-purpose Web-based application leveraging semantic standards VIVO is a customized.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
How To Get Involved In Open Source Nick Burch Senior Developer, Alfresco Software VP ConCom, ASF Member.
GeoNetwork OpenSource: Geographic data sharing for everyone
IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.
Building Search Systems for Digital Library Collections
Extraction, aggregation and classification at Web Scale
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Introduction to Nutch Zhao Dongsheng
The Search Engine Architecture
Web archives as a research subject
CSCI 572: Information Retrieval and Search Engines: Summer 2010
Presentation transcript:

Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10CS572-Summer2010CAM-2 Outline What is Nutch? –Motivation –Architecture –What currently exists? –How I got involved Deploying Nutch on NASA’s Planetary Data System (PDS) –Free text “Google-like” search of the PDS catalog –Architecture/Implementation

May-20-10CS572-Summer2010CAM-3 What is Nutch? The brainchild of Doug Cutting –Research/programmer guru who has worked at several high profile research labs (Yahoo, Bell Labs) Nutch builds upon Cutting’s lower level text indexing library and API called Lucene Nutch provides crawling services, protocol services, parsing services, content management services on top of the indexing capability provided by Lucene

May-20-10CS572-Summer2010CAM-4 Motivation Observation: Web Search is a commodity –Why can’t it be provided freely? Allows tweaking of typically “hidden” ranking algorithms Allows developers to focus less on the infrastructure (since Brin & Page’s paper, the infrastructure is well-known), and more on providing value-added capabilities

May-20-10CS572-Summer2010CAM-5 Motivation Value-added capabilities –Improving fetching speed –Parsing and handling of the hundreds of different content types available on the internet –Handling different protocols for obtaining content –Better ranking algorithms (OPIC, PageRank) More or less, in Nutch, these capabilities all map to extension points available via Nutch’s plugin framework

May-20-10CS572-Summer2010CAM-6 Nutch’s Architecture Nutch Core facilities –Parsing –Indexing –Crawling –Content Management –Querying –Plugin Framework Nutch’s extension points –Scoring, Parsing, Indexing, Querying, URLFiltering

May-20-10CS572-Summer2010CAM-7 Nutch’s Architecture Maps to Search engine architecture proposed by Brin & Page

May-20-10CS572-Summer2010CAM-8 What Currently Exists? Version 0.6.x –First easily deployable version Version 0.7.x –Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter extension point, first Apache release after Incubation, mime type system Version 0.8.x –Completely new underlying architecture based on Hadoop –Parse plugins framework, multi-valued metadata container –Parser Factory enhancement Version 0.9.x –Major bug fixes –Hadoop, and Lucene library upgrades Version 1.0 –Flexible filter framework –Flexible scoring –Initial integration with Tika –Full Search Engine functionality and capabilities, in production at large scale (Internet Archive) Version 1.1, For full list, see

May-20-10CS572-Summer2010CAM-9 What Doesn’t? Plenty! Bug fixes (> 200 issues in JIRA right now with no resolution) Nutch 2.0 architecture – –Refactored Nutch architecture, delegating to Solr, HBase, Tika, and ORM

May-20-10CS572-Summer2010CAM-10 How I got involved In this very class! –Okay well it used to be called Cs599, but you get the picture Started out by contributing RSS parsing plugin –My final project in 599 Moved on from there to –NUTCH-88, redesign of the parsing framework –NUTCH-139, Metadata container support –NUTCH-210, Web Context application file –And various other bug fixes, and contributions here and there –Mailing list support –Wiki support Became committer in October 2006 Helped spin Nutch into Apache TLP, March 2010, Nutch PMC member

May-20-10CS572-Summer2010CAM-11 Real world application of Nutch I work at NASA’s Jet Propulsion Laboratory NASA’s Planetary Data System –NASA’s archive for all planetary science data collected by missions over the past 30 years –Collected 20 TB over the past 30 years Increasing to over 200 TB in the next 3 years! –Built up a catalog of all data collected Where does Nutch fit in?

May-20-10CS572-Summer2010CAM-12 Where does Nutch fit into the PDS? PDS Management Council decide they want “Google-like” search of the PDS catalog Our plan: use Nutch to implement capability for PDS

May-20-10CS572-Summer2010CAM-13 PDS Google-like Search Architecture Search Engine Architecture (e.g. Nutch, Google) PDS Catalog PDS-DPDS-D Existing PDS Query Indexer Index Lucene Crawler PDS Extract Parser PDS Parser pds.war Tomcat Web Server Catalog Metadata

May-20-10CS572-Summer2010CAM-14 Approach Export PDS catalog datasets in RDF format (flat files) Use nutch to crawl RDF files –protocol-file plugin in Nutch Wrote our own parse-pds plugin –Parse the RDF files, and then extract the metadata Wrote our own index-pds plugin –Index the fields that we want from the parsed metadata Wrote our own query-pds plugin –Search the index on the fields that we want

May-20-10CS572-Summer2010CAM-15 Search Interface

May-20-10CS572-Summer2010CAM-16 Results

May-20-10CS572-Summer2010CAM-17 Lessons Learned Nutch currently isn’t exactly simple to deploy, or configure –There is much discussion on mailing lists that refer to “magic configuration” properties that aren’t intuitive Nutch documentation is currently…lacking If you know how to use Nutch then it is extremely easy to use, and a time-saver Active participation in mailing lists, wiki, necessary to use Nutch

May-20-10CS572-Summer2010CAM-18 Good News Nutch is here to stay –Only open source, implementation for commodity web search –If you want to start your own Google++, Nutch is a great place to start Participation is welcome –Look what happened to me (student-> commiter) –Plenty of areas to improve (including documentation)

May-20-10CS572-Summer2010CAM-19 Your Class Project It’s probably a good idea to at least take a look at Nutch, whether you use it or not You can see how a real implementation of theory described in class operates –Implemented in pure Java (1.5) Add/extend capabilities within Nutch –Help finish plugging Nutch into HBase –Configure Nutch using Spring –Fully integrate Nutch and Solr –Fix *important* bugs –Add more scoring algorithm implementations

May-20-10CS572-Summer2010CAM-20 Wrapup Thanks for your attention! Nutch home page: – Mailing lists (developer’s (user’s