Apache Solr Beyond The Box Chris Hostetter
2 Why Are We Here? Plugins! ● What, How, Where, When, Why? ● Solr Internals In A Nutshell ● Real World Examples ● Testing ● Questions
3 What, How, Where, Who, When, Why?
4 What Is Solr (To Users) ● Information Retrieval Application ● Index/Query Via HTTP ● Comprehensive HTML Administration Interfaces ● Scalability - Efficient Replication To Other Solr Search Servers ● Highly Configurable Caching ● Flexible And Adaptable With XML Configuration Customizable Request Handlers And Response Writers Data Schema With Dynamic Fields And Unique Keys Analyzers Created At Runtime From Tokenizers And TokenFilters
What Is Solr (To Developers) ● Information Retrieval Application ● Java5 WebApp (WAR) With A Web Services-ish API ● Extensible Plugin Architecture ● MVC-ish Framework Around The Java Lucene Search Library ● Allows Custom Business Logic and Text Analysis Rules To Live Close To The Data ● Abstracts Away The Tricky Stuff: Index Consistency Data Replication Cache Management
How It Started
When/Why To Write A Plugin “X can be done more efficiently closer to the data.” OR “To force X for all clients.”
8 Solr Internals In A Nutshell
9 50,000' View HTTP SolrDispatchFilter Java EmbeddedSolrServer SolrCore SolrRequestHandler CoreContainer SolrQuery(Request/Res ponse) QueryResponseWriter
MVC-ish ● SolrRequestHandler... A Controller handleRequest(SolrQueryRequest, SolrQueryResponse ) ● SolrQueryRequest... An Event (++) Input Parameters List of ContentStreams Maintains SolrCore & SolrIndexSearcher References ● SolrQueryResponse... Model Tree of "Simple" Objects and DocLists ● ResponseWriter... View write(Writer,SolrQueryRequest, SolrQueryResponse)
11 public class HelloWorld extends RequestHandlerBase { public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) { String name = req.getParams().get("name"); Integer age = req.getParams().getInt("age"); rsp.add("greeting", "Hello " + name); rsp.add("yourage", age); } public String getVersion() { return "$Revision:$"; } public String getSource() { return "$Id:$"; } public String getSourceId() { return "$URL:$"; } public String getDescription() { return "Says Hello"; } } Hello World
Hello Hoss 32 { "responseHeader":{ "status":0, "Qtime":1}, "greeting":"Hello Hoss", "yourage":32 } Hello World Output
Types Of Plugins ● SolrRequestHandler SearchComponentQparserPluginValueSourceParser ● SolrHighlighter SolrFragmenterSolrFormatter ● UpdateRequestProcessorFactory ● QueryResponseWriter Italics: Only One Per SolrCore Color Color: Likelihood Of Needing To Write Your Own ● Similarity(Factory) ● Analyzer TokenizerFactoryTokenFilterFactory ● FieldType ● SolrCache CacheRegenerator ● SolrEventListener ● UpdateHandler
14 Real World Examples
15 Tibetan And Himalayan Digital Library Tools
16 public class TshegBarTokenizerFactory extends BaseTokenizerFactory { public TokenStream create(Reader input) { return new TshegBarTokenizer(input); } public class EdgeTshegTrimmerFactory extends BaseTokenFilterFactory { public TokenStream create(TokenStream input) { return new EdgeTshegTrimmer(input); } Tsheg Analysis Factories
17 DFLL
DFLL: Faceted Browsing
DFLL Category Metadata ● Category ID and Label: 3126 == “Tablet PCs” ● Category Query: tablet_form:[* TO *] ● Ordered List of Facets Facet ID and Label: == “OS Provided” Facet Display Info: Count vs. Alphabetical, etc... Ordered List of Constraints ● Constraint ID and Label: == “Apple OS X” ● Constraint Query: os:(“OSX10.1” “OSX10.2”...)
20 Document catMetaDoc = searcher.getFirstMatch(catDocId) Metadata m = parseAndCacheMetadata(catMetaDoc, searcher) m = m.clone() DocListAndSet results = searcher.getDocListAndSet(m.catQuery,...) response.add(“products”, results.docList) foreach (Facet f : m) { foreach (Constraint c : f) { c.setCount(searcher.numDocs(c.query, results.docSet)) } response.add(“metadata”, m.asSimpleObjects()) DfllHandler Psuedo-Code
Conceptual Picture DocLis t getDocListAndSet(Query,Query[],Sort,offset,n) os:(“OSX10.1” “OSX10.2”...) memory:[1GB TO *] tablet_form:[* TO *] price asc proc_manu:Intel proc_manu:AMD Section of ordered results DocSet Unordered set of all results price:[0 TO 500] price:[500 TO 1000] manu:Dell manu:HP manu:Lenovo numDocs() = 594 = 382 = 247 = 689 = 104 = 92 = 75 Query Response
OS provided Apple Mac OS X DFLL Response
23 DfllCacheRegenerator SolrCore “Auto-warms” all SolrCaches when new versions of the index are opened for searching (after a commit). public interface CacheRegenerator { public boolean regenerateItem(SolrIndexSearcher newSearcher, SolrCache newCache, SolrCache oldCache, Object oldKey, Object oldVal) throws IOException; }
24 DataImportHandler
25 Builds and incrementally updates indexes based on configured SQL or XPath queries. <entity name="item" pk="ID" query="select * from ITEM" deltaQuery="select ID... where ITEMDATE > '${dataimporter.last_index_time}'">... <entity name="f" pk="ITEMID" query="select DESC from FEATURE where ITEMID='${item.ID}'" deltaQuery="select ITEMID from FEATURE where UPDATEDATE > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from ITEM where ID=${f.ITEMID}">... DataImportHandler
DataImportHandler Plugins ● DataSource FileDataSource HttpDataSource JdbcDataSource ● EntityProcessor FileListEntityProcessor SqlEntityProcessor ● CachedSqlEntityProcessor XPathEntityProcessor ● Transformer DateFormatTransformer NumberFormatTransformer RegexTransformer ScriptTransformer TemplateTransformer
27 LocalSolr
LocalUpdateProcessorFactory ● Uses lat/lon fields to compute Cartesian Tier info ● Adds grid bodes of various sizes as new fields lat lng 9 17
LocalSolr Cartesian Tiers
LocalSolrQueryComponent ● Use in place of default QueryComponent ● Augments regular query with DistanceQuery and DistanceSortSource ● Can use a custom SolrCache for distances for commonly used points <searchComponent name="geoquery" class="....LocalSolrQueryComponent" /> geoquery...
32 GuardianComponent
GuardianComponent Goal ● When Searching Really Short Docs, Rule Out Matches That Are “Significantly” Longer Then Query ● Increase Precision At The Expense Of Recall q = Dance Party Dance Party (1995) Dance Party (2005) (V) Dance Party, USA (2006) Workout Party... Let's Dance! (2004) (V) Shrek in the Swamp Karaoke Dance Party (2001) (V)
Implementation ● SearchComponent ● Configured To Run After QueryComponent ● Post-Processes DocList Pick MAX_LEN Based On Number Of Query Clauses Re-analyze Stored “title“ Field Eliminate Any Results That Are With More Then MAX_LEN Tokens In “title“
Alternate Approach ● ● Write TokenCountingTokenFilter For titleLen ● Write MaxLenQParserPlugin Subclass Your Favorite QParser Pick MAX_LEN Based On Number Of Query Clauses From Super Add +titleLen:[* TO MAX_LEN] Clause To Query
36 Testing Your Plugins
37 AbstractSolrTestCase public class YourTest extends AbstractSolrTestCase {... public void testSomeStuff() throws Exception { assertU(adoc("id", "7", "description", "Travel Guide”, "title", "Paris in 10 Days")); assertU(adoc("id", "42", "description", "Cool Book", "title", "Hitch Hiker's Guide to the Galaxy")); assertU(commit()); assertQ("multi qf", req("q", "guide", "qt", "dismax", "qf", "title^2 ); }
38 Questions? ?