Apache Solr Beyond The Box Chris Hostetter 2008-11-05

Slides:



Advertisements
Similar presentations
Chapter 6 Server-side Programming: Java Servlets
Advertisements

Practical Solr Guide for Developers. First…some questions. How many of you in the room know what Solr is? How many have worked with Solr? How many will.
Lucene/Solr Architecture
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
 2002 Prentice Hall. All rights reserved. Chapter 9: Servlets Outline 9.1 Introduction 9.2 Servlet Overview and Architecture Interface Servlet and.
EasySearch Technical Overview. Ever seen a website without a full text search? BUT – Search is expensive Financially Computationally – Search is complicated.
Table of Contents This document describes about XML application to control, customize, initiate action of phone. Overview of XML Application Each Function.
INTRODUCTION TO ASP.NET MVC AND EXAMPLE WALKTHROUGH RAJAT ARYA EFECS - OIM DAWG – 4/21/2009 ASP.NET MVC.
Apache Solr Yonik Seeley 29 June 2006 Dublin, Ireland.
Semantic description of service behavior and automatic composition of services Oussama Kassem Zein Yvon Kermarrec ENST Bretagne France.
Computer Science 101 Web Access to Databases Overview of Web Access to Databases.
QAD .Net UI: New Enhancements
Implementing search with free software An introduction to Solr By Mick England.
ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.
Struts 2.0 an Overview ( )
CVSQL 2 The Design. System Overview System Components CVSQL Server –Three network interfaces –Modular data source provider framework –Decoupled SQL parsing.
Tracking Services for ANY websites and web applications Zhu Xiong CSE 403 LCO.
Digital Object: A Virtual Online Storage Solution 598C Course Project Huajing Li.
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Configuration Management and Server Administration Mohan Bang Endeca Server.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
© 2006 IBM Corporation IBM WebSphere Portlet Factory Architecture.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
Searching Business Data with MOSS 2007 Enterprise Search Presenter: Corey Roth Enterprise Consultant Stonebridge Blog:
Chapter 6 Server-side Programming: Java Servlets
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Mike Jackson EPCC OGSA-DAI Architecture + Extensibility OGSA-DAI Tutorial GGF17, Tokyo.
Server-side Programming The combination of –HTML –JavaScript –DOM is sometimes referred to as Dynamic HTML (DHTML) Web pages that include scripting are.
What is a Servlet? Java Program that runs in a Java web server and conforms to the servlet api. A program that uses class library that decodes and encodes.
A facilitator to discover and compose services Oussama Kassem Zein Yvon Kermarrec ENST Bretagne.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Interface for Glyco Vault Functionality and requirements. Initial proposal. Maciej Janik.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
©2001 Priority Technologies, Inc. All Rights Reserved Meteor Status Miami Face to Face Meeting January 16 – 18, 2002.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
® Sponsored by Improving Access to Point Cloud Data 98th OGC Technical Committee Washington DC, USA 8 March 2016 Keith Ryden Esri Software Development.
Introduction to Enterprise Search Corey Roth Blog: Twitter: twitter.com/coreyrothtwitter.com/coreyroth.
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
Click to edit Master subtitle style 9/30/2016 Next Generation Catalog with Integration of VuFind and Pazpar2 Presented by Mohan Raj Pradhan Associate Professor.
The Holmes Platform and Applications
Netscape Application Server
Section 13 - Integrating with Third Party Tools
Open Source distributed document DB for an enterprise
VI-SEEM Data Discovery Service
Chapter 6 Server-side Programming: Java Servlets
Creating Novell Portal Services Gadgets: An Architectural Overview
Building Search Systems for Digital Library Collections
Searching Business Data with MOSS 2007 Enterprise Search
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Searching Business Data with MOSS 2007 Enterprise Search
CS6604 Digital Libraries IDEAL Webpages Presented by
Chapter 27 WWW and HTTP.
Chapter 2 – Introduction to the Visual Studio .NET IDE
Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Advanced Programing practices
Lucene/Solr Architecture
Getting Started With Solr
CS122B: Projects in Databases and Web Applications Winter 2019
9/8/ :03 PM © 2006 Microsoft Corporation. All rights reserved.
Presentation transcript:

Apache Solr Beyond The Box Chris Hostetter

2 Why Are We Here? Plugins! ● What, How, Where, When, Why? ● Solr Internals In A Nutshell ● Real World Examples ● Testing ● Questions

3 What, How, Where, Who, When, Why?

4 What Is Solr (To Users) ● Information Retrieval Application ● Index/Query Via HTTP ● Comprehensive HTML Administration Interfaces ● Scalability - Efficient Replication To Other Solr Search Servers ● Highly Configurable Caching ● Flexible And Adaptable With XML Configuration Customizable Request Handlers And Response Writers Data Schema With Dynamic Fields And Unique Keys Analyzers Created At Runtime From Tokenizers And TokenFilters

What Is Solr (To Developers) ● Information Retrieval Application ● Java5 WebApp (WAR) With A Web Services-ish API ● Extensible Plugin Architecture ● MVC-ish Framework Around The Java Lucene Search Library ● Allows Custom Business Logic and Text Analysis Rules To Live Close To The Data ● Abstracts Away The Tricky Stuff: Index Consistency Data Replication Cache Management

How It Started

When/Why To Write A Plugin “X can be done more efficiently closer to the data.” OR “To force X for all clients.”

8 Solr Internals In A Nutshell

9 50,000' View HTTP SolrDispatchFilter Java EmbeddedSolrServer SolrCore SolrRequestHandler CoreContainer SolrQuery(Request/Res ponse) QueryResponseWriter

MVC-ish ● SolrRequestHandler... A Controller handleRequest(SolrQueryRequest, SolrQueryResponse ) ● SolrQueryRequest... An Event (++) Input Parameters List of ContentStreams Maintains SolrCore & SolrIndexSearcher References ● SolrQueryResponse... Model Tree of "Simple" Objects and DocLists ● ResponseWriter... View write(Writer,SolrQueryRequest, SolrQueryResponse)

11 public class HelloWorld extends RequestHandlerBase { public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) { String name = req.getParams().get("name"); Integer age = req.getParams().getInt("age"); rsp.add("greeting", "Hello " + name); rsp.add("yourage", age); } public String getVersion() { return "$Revision:$"; } public String getSource() { return "$Id:$"; } public String getSourceId() { return "$URL:$"; } public String getDescription() { return "Says Hello"; } } Hello World

Hello Hoss 32 { "responseHeader":{ "status":0, "Qtime":1}, "greeting":"Hello Hoss", "yourage":32 } Hello World Output

Types Of Plugins ● SolrRequestHandler SearchComponentQparserPluginValueSourceParser ● SolrHighlighter SolrFragmenterSolrFormatter ● UpdateRequestProcessorFactory ● QueryResponseWriter Italics: Only One Per SolrCore Color Color: Likelihood Of Needing To Write Your Own ● Similarity(Factory) ● Analyzer TokenizerFactoryTokenFilterFactory ● FieldType ● SolrCache CacheRegenerator ● SolrEventListener ● UpdateHandler

14 Real World Examples

15 Tibetan And Himalayan Digital Library Tools

16 public class TshegBarTokenizerFactory extends BaseTokenizerFactory { public TokenStream create(Reader input) { return new TshegBarTokenizer(input); } public class EdgeTshegTrimmerFactory extends BaseTokenFilterFactory { public TokenStream create(TokenStream input) { return new EdgeTshegTrimmer(input); } Tsheg Analysis Factories

17 DFLL

DFLL: Faceted Browsing

DFLL Category Metadata ● Category ID and Label: 3126 == “Tablet PCs” ● Category Query: tablet_form:[* TO *] ● Ordered List of Facets Facet ID and Label: == “OS Provided” Facet Display Info: Count vs. Alphabetical, etc... Ordered List of Constraints ● Constraint ID and Label: == “Apple OS X” ● Constraint Query: os:(“OSX10.1” “OSX10.2”...)

20 Document catMetaDoc = searcher.getFirstMatch(catDocId) Metadata m = parseAndCacheMetadata(catMetaDoc, searcher) m = m.clone() DocListAndSet results = searcher.getDocListAndSet(m.catQuery,...) response.add(“products”, results.docList) foreach (Facet f : m) { foreach (Constraint c : f) { c.setCount(searcher.numDocs(c.query, results.docSet)) } response.add(“metadata”, m.asSimpleObjects()) DfllHandler Psuedo-Code

Conceptual Picture DocLis t getDocListAndSet(Query,Query[],Sort,offset,n) os:(“OSX10.1” “OSX10.2”...) memory:[1GB TO *] tablet_form:[* TO *] price asc proc_manu:Intel proc_manu:AMD Section of ordered results DocSet Unordered set of all results price:[0 TO 500] price:[500 TO 1000] manu:Dell manu:HP manu:Lenovo numDocs() = 594 = 382 = 247 = 689 = 104 = 92 = 75 Query Response

OS provided Apple Mac OS X DFLL Response

23 DfllCacheRegenerator SolrCore “Auto-warms” all SolrCaches when new versions of the index are opened for searching (after a commit). public interface CacheRegenerator { public boolean regenerateItem(SolrIndexSearcher newSearcher, SolrCache newCache, SolrCache oldCache, Object oldKey, Object oldVal) throws IOException; }

24 DataImportHandler

25 Builds and incrementally updates indexes based on configured SQL or XPath queries. <entity name="item" pk="ID" query="select * from ITEM" deltaQuery="select ID... where ITEMDATE > '${dataimporter.last_index_time}'">... <entity name="f" pk="ITEMID" query="select DESC from FEATURE where ITEMID='${item.ID}'" deltaQuery="select ITEMID from FEATURE where UPDATEDATE > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from ITEM where ID=${f.ITEMID}">... DataImportHandler

DataImportHandler Plugins ● DataSource FileDataSource HttpDataSource JdbcDataSource ● EntityProcessor FileListEntityProcessor SqlEntityProcessor ● CachedSqlEntityProcessor XPathEntityProcessor ● Transformer DateFormatTransformer NumberFormatTransformer RegexTransformer ScriptTransformer TemplateTransformer

27 LocalSolr

LocalUpdateProcessorFactory ● Uses lat/lon fields to compute Cartesian Tier info ● Adds grid bodes of various sizes as new fields lat lng 9 17

LocalSolr Cartesian Tiers

LocalSolrQueryComponent ● Use in place of default QueryComponent ● Augments regular query with DistanceQuery and DistanceSortSource ● Can use a custom SolrCache for distances for commonly used points <searchComponent name="geoquery" class="....LocalSolrQueryComponent" /> geoquery...

32 GuardianComponent

GuardianComponent Goal ● When Searching Really Short Docs, Rule Out Matches That Are “Significantly” Longer Then Query ● Increase Precision At The Expense Of Recall q = Dance Party Dance Party (1995) Dance Party (2005) (V) Dance Party, USA (2006) Workout Party... Let's Dance! (2004) (V) Shrek in the Swamp Karaoke Dance Party (2001) (V)

Implementation ● SearchComponent ● Configured To Run After QueryComponent ● Post-Processes DocList Pick MAX_LEN Based On Number Of Query Clauses Re-analyze Stored “title“ Field Eliminate Any Results That Are With More Then MAX_LEN Tokens In “title“

Alternate Approach ● ● Write TokenCountingTokenFilter For titleLen ● Write MaxLenQParserPlugin Subclass Your Favorite QParser Pick MAX_LEN Based On Number Of Query Clauses From Super Add +titleLen:[* TO MAX_LEN] Clause To Query

36 Testing Your Plugins

37 AbstractSolrTestCase public class YourTest extends AbstractSolrTestCase {... public void testSomeStuff() throws Exception { assertU(adoc("id", "7", "description", "Travel Guide”, "title", "Paris in 10 Days")); assertU(adoc("id", "42", "description", "Cool Book", "title", "Hitch Hiker's Guide to the Galaxy")); assertU(commit()); assertQ("multi qf", req("q", "guide", "qt", "dismax", "qf", "title^2 ); }

38 Questions? ?