Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory.

Slides:

Advertisements

Similar presentations

EUFORIA FP7-INFRASTRUCTURES , Grant JRA4 Overview and plans M. Haefele, E. Sonnendrücker Euforia kick-off meeting 22 January 2008 Gothenburg.

Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Having Your Cake and Eating It Too With Apache OODT and Apache Solr Andrew F. Hart Paul M. Ramirez.

Search Engines and Information Retrieval

Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.

Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Introduction to Apache Tika CSCI 572: Information Retrieval and Search Engines Summer 2010.

System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.

Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.

Introduction to Apache Lucene/Solr CSCI 572: Information Retrieval and Search Engines Summer 2010.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.

Open Source Workshop1 IBM Software Group Working with Apache Tuscany A Hands-On Workshop Luciano Resende Haleh.

Workflow Management Chris A. Mattmann OODT Component Working Group.

Kuali Rice at Indiana University Rice Setup Options July 29-30, 2008 Eric Westfall.

Search Engines and Information Retrieval Chapter 1.

Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.

Raffaele Di Fazio Connecting to the Clouds Cloud Brokers and OCCI.

CIS 375—Web App Dev II Microsoft’s.NET. 2 Introduction to.NET Steve Ballmer (January 2000): Steve Ballmer "Delivering an Internet-based platform of Next.

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010.

Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin.

Crawlers - Presentation 2 - April (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.

Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.

Oct 26, 2005 CDT DOM Roadmap Doug Schaefer. Parser History  CDT 1.0 ► JavaCC based parser  Used to populate CModel and Structure Compare ► ctags based.

Cloud Computing Computer Science Innovations, LLC.

Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.

Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.

Metadata Lessons Learned Katy Ginger Digital Learning Sciences University Corporation for Atmospheric Research (UCAR)

Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.

Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant.

Shakeh Elisabeth Khudikyan NASA Jet Propulsion Laboratory, California Institute of Technology A Look at Apache OODT Balance Framework.

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

Event-Based Hybrid Consistency Framework (EBHCF) for Distributed Annotation Records Ahmet Fatih Mustacoglu Advisor: Prof. Geoffrey.

@ For more details visit : Opportunities for participation Modular Architecture Trace JIT compiler Interpreter Memory manager.

CSCI 572: Information Retrieval and Search Engines: Summer 2011 Prof. Chris A. Mattmann.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.

Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.

Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.

Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.

Technical Update 2008 Sandy Payette, Executive Director Eddie Shin, Senior Developer April 3, 2008 Open Repositories 2008, Fedora User Group.

Dean Anderson Polk County, Oregon GIS in Action 2014 Modifying Open Source Software (A Case Study)

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

DSpace System Architecture 11 July 2002 DSpace System Architecture.

Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

Informatics and the caTissue Wrapper for the Early Detection Research Network Chris A. Mattmann, Ph.D. Senior Computer Scientist Instrument Software/ Science.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

Breeda Herlihy, IR Manager, UCC Library. UCC selected DSpace in 2008 Software selection group Staff from Library IT, Computer Centre, Special Collections,

How To Get Involved In Open Source Nick Burch Senior Developer, Alfresco Software VP ConCom, ASF Member.

Customer Webinar December 3, 2008

IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.

CS122B: Projects in Databases and Web Applications Winter 2017

Processes The most important processes used in Web-based systems and their internal organization.

Extraction, aggregation and classification at Web Scale

CS6604 Digital Libraries IDEAL Webpages Presented by

CSCI 572: Information Retrieval and Search Engines: Summer 2010

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

SDMX IT Tools SDMX Registry

Presentation transcript:

Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant Professor, Univ. of Southern California Member, Apache Software Foundation

Roadmap What is Nutch? What are the current versions of Nutch? What can it do? What did we do right? What did we do wrong? Where is Nutch going?

And you are? Apache Member involved in –Tika (VP,PMC), Nutch (PMC), Incubator (PMC), OODT (Mentor), SIS (Mentor), Lucy (Mentor) and Gora (Champion) Architect/Developer at NASA JPL in Pasadena, CA Software Architecture/Engineeri ng Prof at USC

is… A project originally started by Doug Cutting Nutch builds upon the lower level text indexing library and API called Lucene Nutch provides crawling services, protocol services, parsing services, content management services on top of the indexing capability provided by Lucene Allows you to sand up a web-scale infra.

Community Mailing lists –User: 972 peeps –Dev: 520 peeps Committers/PMC –8 peeps –All 8 active: SERIOUSLY Releases –11 releases so far –Working on 2.0 Credit: svnsearch.org

Why Nutch? Observation: Web Search is a commodity –Why can’t it be provided freely? Allows tweaking of typically “hidden” ranking algorithms Allows developers to focus less on the infrastructure (since Brin & Page’s paper, the infrastructure is well-known), and more on providing value-added capabilities

Why Nutch? Value-added capabilities –Improving fetching speed –Parsing and handling of the hundreds of different content types available on the internet –Handling different protocols for obtaining content –Better ranking algorithms (OPIC, PageRank) More or less, in Nutch, these capabilities all map to extension points available via Nutch’s plugin framework

Nutch’s Architecture Nutch Core facilities –Parsing –Indexing –Crawling –Content Acquisition –Querying –Plugin Framework Nutch’s extension points –Scoring, Parsing, Indexing, Querying, URLFiltering

Nutch’s Architecture Maps to Search engine architecture proposed by Brin & Page

What Currently Exists? Version 0.6.x –First easily deployable version Version 0.7.x –Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter extension point, first Apache release after Incubation, mime type system Version 0.8.x –Completely new underlying architecture based on Hadoop –Parse plugins framework, multi-valued metadata container –Parser Factory enhancement Version 0.9.x –Major bug fixes –Hadoop, and Lucene library upgrades Version 1.0 –Flexible filter framework –Flexible scoring –Initial integration with Tika –Full Search Engine functionality and capabilities, in production at large scale (Internet Archive)

What are the recent versions? Version 1.1, upgrade all Nutch library deps (Hadoop, Tika, etc.) and make Fetcher faster Version 1.2, fix some big time bugs (NPE in distributed search), lots of feature upgrades –You should be using this version

Some active dev areas Plenty! Bug fixes (> 200 issues in JIRA right now with no resolution) Nutch 2.0 architecture – lucene.com/m/gbrBF1RMWk9http://search- lucene.com/m/gbrBF1RMWk9 –Refactored Nutch architecture, delegating to Solr, HBase, Tika, and ORM

Real world application of Nutch I work at NASA’s Jet Propulsion Laboratory NASA’s Planetary Data System –NASA’s archive for all planetary science data collected by missions over the past 30 years –Collected 20 TB over the past 30 years Increasing to over 200 TB in the next 3 years! –Built up a catalog of all data collected Where does Nutch fit in?

Where does Nutch fit into the PDS? PDS Management Council decide they want “Google-like” search of the PDS catalog Our plan: use Nutch to implement capability for PDS

PDS Google-like Search Architecture Search Engine Architecture (e.g. Nutch, Google) PDS Catalog PDS-DPDS-D Existing PDS Query Indexer Index Lucene Crawler PDS Extract Parser PDS Parser pds.war Tomcat Web Server Catalog Metadata Credit: D. Crichton, S. Hughes, P. Ramirez, R. Joyner, S. Hardman, C. Mattmann

Approach Export PDS catalog datasets in RDF format (flat files) Use nutch to crawl RDF files –protocol-file plugin in Nutch Wrote our own parse-pds plugin –Parse the RDF files, and then extract the metadata Wrote our own index-pds plugin –Index the fields that we want from the parsed metadata Wrote our own query-pds plugin –Search the index on the fields that we want

Search Interface

Results

Some Nutch History In the next few slides, we’ll go through some of Nutch’s history, including my involvement, the history of Nutch dev, and how we came to today

How I got involved In CS72: Seminar on Search Engines at USC –Okay well it used to be called CS599, but you get the picture Started out by contributing RSS parsing plugin –My final project in 599 Moved on from there to –NUTCH-88, redesign of the parsing framework –NUTCH-139, Metadata container support –NUTCH-210, Web Context application file –And various other bug fixes, and contributions here and there –Mailing list support –Wiki support Became committer in October 2006 Helped spin Nutch into Apache TLP, March 2010, Nutch PMC member

The Big Yellow Elephant Before this guy was born Lots of folks interested in Nutch Hadoop is born (January 2008) Credit: svnsearch.org

Post Hadoop Life Nutch project kind of withered –Well more than “kind of” it did wither –Went years in-between a release 0.8 to 1.0 took a while Dev Community went into maintenance mode –Many committers simply went inactive User Community deteriorated

Some Observations It was pretty difficult to attract new committers –Took too long to VOTE them in –They were only interested in Hadoop type stuff –Not many organizations were doing web- scale search Existing active committers dwindled I was one of them!

Some Observations There wasn’t a plan for what to do next –What features to work on? –What bugs to fix? –Many considered Nutch to be “production” worthy in its current form and not a huge number of internet-scale users so people just “put up” with its existing issues, e.g., difficult to configure ?

Hadoop wasn’t the only spinoff A lot of us interested in content detection and analysis, another major Nutch strength, went off to work on that in some other Apache project that I can’t remember the name of

How can Nutch reorganize? Strong feeling from Nutch community that we should take whomever is left and think about what the “next generation” Nutch would look like (Several cycles of) Mailing threads started by Andrzej Bialecki, Dennis Kubes, Otis Gospondetic

Initial Nutch2 fizzles Ended up being a lot of talk, but there wasn’t enough interest to pick up a shovel and help dig the hole But…there were interesting things going on –Example: Nutchbase work from Dogacan, and Enis

What was “Nutchbase”? Take the Apache implementation of Google’s “BigTable” –Col oriented storge, high scalability in columns and rows Store Nutch Web page content +

Lots of interest in Nutchbase But, sadly maintained as a patch for a year or more –NUTCH-650 Hbase integration Brought about some interesting thoughts –If storage can be abstracted, what about? Messaging layer (JMS Nutch?) Parsing? Indexing (Solr, Lucene, you-name-it)

Post Nutch 1.0 Nutch 1.0 release was a true “1.”-oh! –Included production features –Those using it were happy, b/c they had bought into the model –Useable, tuneable But, how do we get to Nutch 2.0?

A few things happen in parallel 1.1 Release? –I had some free time and was willing to RM a Nutch 1.1 release to get things going Dogacan, Enis, Julien and Andrzej got interested in moving Nutchbase forward –But took it to the next level…we’ll get back to this We elected a new committer Julien Nioche Patches that had sat for years now got committed

Oh, and Nutch became TLP Grabbed folks that were active in Nutch community Decided to move forward with Nutch/HBase as the de-facto platform –No need to maintain home-grown storage formats –And, take it to the next level, to ORM-ness Decided to make Nutch a “delegator” rather than a workhorse –In other words…

Nutch2: “Delegator” Indexing/Querying? –Solr has a lot of interest and does tons of work in this area: let’s use it instead of vanilla Lucene Parsing? –Tika: ditto Storage –Let’s use the ORM layer that some of the Nutch committers were working on

Enter Gora: “that ORM technology” Initially baked up at Github Decided to move to the Incubator in Sept 2010 –I was contacted and asked to champion the effort What is Gora? –Uses Apache Avro to specify objects and their schema –ORM middleware takes Avro specs, generates Java code – plugs for HBase, Cassandra, in-memory SQL store, etc.

Nutch and Gora Throw out all code in Nutch that had to do with Writeable interface –Generated now by “Web Page” schema in Gora –Web Page is canonical Nutch object for storage Parse text, parse data, etc. No more web-db, crawl-db, etc.

Out with the old… Throw out Nutch webapp –Solr provides REST-ful services to get at metadata/index –We’ll add the REST (pun) for storage/etc. Throw out Lucene code Slowly trash existing Nutch parsers

In with the new Get rid of webapp –Nutch 2.x has seen contributions of REST web services for full crawl cycle, storage I/F Delegate indexing to Solr –Nutch 1.x first appearance of SolrIndexer and Nutch Solr schema Delegate parsing to Tika –Nutch 1.1 first appearance of parse-tika –Have been decommissioning existing parsers Suggested improvements to Tika during this process

Nutch2 Architecture

Learning from our mistakes Maintenance –Checking in jars made the Nutch checkout huge (even of just the “source”) Now using Ivy to manage dependencies –Patches sitting? Not on my watch! Encouragement to find and commit patches that have been sitting for a while, or simply disposition them –People want to use Nutch code as “dep” Build now includes ability for RM to push to Maven Central NOTE: CHRIS’S OPINION SLIDE

Learning from our mistakes Community –Folks contributing patches? Make em’ a committer –Folks providing good testing results? Make em’ a committer –Folks making good documentation? Make em’ a committer –It’s the sign of a healthy Apache project if new committers (and members) are being elected NOTE: CHRIS’S OPINION SLIDE

Learning from our mistakes Configuration of Nutch is hard –It still is –Getting easier though –Anyone have any great ideas or patches to integrate with a DI framework? –Things like GORA, Solr, etc, are making this easier Providing flexible service interfaces beyond Java APIs –Existing work on NUTCH-932, NUTCH-931 and NUTCH-880 is just the beginning

Interesting work going on I taught a class on Search Engines this past summer Some neat projects that I’m working with my students to contribute back to Apache –Implementation of Authority/Hub scoring –Deduplication improvements –Clustering plugin improvements –Work to improve Nutch-Solr-Drupal integration

Wrapup Nutch has seen tremendous highs and lows over years –We’re still kicking The newest version of Nutch (2.0) will have a vastly slimmed down footprint, and will use existing successful frameworks for heavy lifting –Solr, Tika, Gora, Hadoop If you’re interested in our dev, check us out at

Alright, I’ll shut up now Any questions? THANK YOU! on

Acknowledgements Nutch team Some material inspired from Andrzej Bialecki’s talks here OODT team at JPL