Presentation is loading. Please wait.

Presentation is loading. Please wait.

Solr 3.1 and Beyond Yonik Seeley Lucid Imagination October 8, 2010 2.

Similar presentations


Presentation on theme: "Solr 3.1 and Beyond Yonik Seeley Lucid Imagination October 8, 2010 2."— Presentation transcript:

1 Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2

2 Agenda Goal : Introduce new features you can try & use now in Solr development versions 3.1 or 4.0 Relevancy (Extended Dismax Parser) Spatial/Geo Search Search Result Grouping / Field Collapsing Faceting (Pivot, Range, Per-segment) Scalability (Solr Cloud) Odds & Ends Q&A 10/13/20153

3 Solr 3.1? What happened to 1.5? Lucene/Solr merged (March 2010) Single set of committers Single dev mailing list (dev@lucene.apache.org) Single shared subversion trunk Keep separate downloads, user mailing lists Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc) Development trunk is now always next major release (currently 4.0) branch_3x will be base for all 3.x releases Branch together, Release together, Share version numbers

4 RELEVANCE

5 Extended Dismax Parser Superset of dismax &defType=edismax&q=foo&qf=body Fixes edge cases where dismax could still throw exceptions OR AND NOT - “ Full lucene syntax support Tries lucene syntax first Smart escaping is done if syntax errors Optionally supports treating “and”/”or” as AND/OR in lucene syntax Fielded queries (e.g. myfield:foo ) even in degraded mode uf parameter controls what field names may be directly specified in “q”

6 Extended Dismax Parser (continued) boost parameter for multiplicative boost-by-function Pure negative query clauses Example: solr OR (-solr) Enhanced term proximity boosting pf2=myfield – results in term bigrams in sloppy phrase queries myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc” Enhanced stopword handling stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”) Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

7 SPATIAL SEARCH 8

8 Spatial Search 10/13/20159 Step1: Index some locations! The Alpine Shop 44.013617,-73.168264 Step2: Decide where you are &pt=44.0153371,-73.16734 &d=1 &sfield=store Step3: Profit! Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc

9 RESULT GROUPING / FIELD COLLAPSING

10 Field Collapsing Definition Field collapsing Limit the number of results per category “category” normally defined by unique values in a field Uses Web Search – collapse by web site Email threads – collapse by thread id Ecommerce/retail Show the top 5 items for each store category (music, movies, etc)

11 Field Collapsing by Site

12 Field Collapse on Product Type Result Grouping by Category

13 Group by Field http://...& fl=id,name&q=ipod&group=true&group.field=manu_exact 10/13/201514 "grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}]}}}

14 Group by Query 10/13/201515 http://...&group=true&group.query=price:[0 TO 99.99]&group.query=price:[100 TO *]&group.limit=5 "grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}}}

15 Grouping Params parametermeaningdefault group.field= Like facet.field – group by unique field values group.query= Like facet.query – top docs that also match group.function= Group by unique values produced by the function query group.limit= How many docs per group1 group.sort= How to sort documents within a groupSame as “sort” param rows= How many groups to return10 sort= How to sort the groups relative to each other (based on top doc) 10/13/201516

16 FACETING

17 Pivot Faceting Other names that could have made sense: Grid Faceting, Cross-Product Faceting, Matrix Faceting Syntax: facet.pivot=field1,field2,field3,… 10/13/201518 #docs#docs w/ inStock:true #docs w/ instock:false cat:electronics14104 cat:memory330 cat:connector202 cat:graphics card202 cat:hard drive220 facet.pivot=cat,inStock

18 Pivot Faceting "facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4}, 10/13/201519 http://...&facet=true&facet.pivot=cat,popularity (continued) { "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]}, […] 14 docs w/ cat==electronics 5 docs w/ cat==electronics && popularity==6

19 Range Faceting Like Date faceting, but more generic http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &facet.range.gap=50 "facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}} 10/13/201520

20 5 3 5 1 4 5 2 1 (null) batman flash spiderman superman wolverine order: for each doc, an index into the lookup array lookup: the string values Lucene FieldCache Entry (StringIndex) for the “hero” field 0 2 7 0 1 0 0 0 2 Documents matching the base query “Juggernaut” accumulator increment lookup q=Juggernaut &facet=true &facet.field=hero Priority queue Batman, 3 flash, 5 Existing single-valued faceting algorithm

21 Segment1 FieldCache Entry Segment2 FieldCache Entry Segment3 FieldCache Entry Segment4 FieldCache Entry 0 2 7 0 3 5 0 1 2 0 2 1 0 1 3 0 4 0 1 0 Priority queue Batman, 3 flash, 5 Base DocSet lookup inc accumulator1accumulator2accumulator3accumulator4 FieldCache + accumulator merger (Priority queue) thread1 thread2 thread3 thread4 Per-segment single-valued algorithm

22 Per-segment faceting Enable with facet.method=fcs Controllable multi-threading facet.field={!threads=4}myfield Disadvantages Larger memory use (FieldCaches + accumulators) Slower (extra FieldCache merge step needed) Advantages Rebuilds FieldCache entries only for new segments (NRT friendly) Multi-threaded

23 Per-segment faceting performance comparison Time for request*facet.method=fcfacet.method=fcs static index3 ms244 ms quickly changing index1388 ms267 ms Base DocSet=100 docs, facet.field on a field with 100,000 unique terms Test index: 10M documents, 18 segments, single valued field Time for request*facet.method=fcfacet.method=fcs static index26 ms34 ms quickly changing index741 ms94 ms Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms *complete request time, measured externally A B

24 Faceting Performance Improvements For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster Optimized deep facet paging – up to 10x faster with really large facet.offsets Less memory consumed by field cache entries 10/13/201525

25 SCALABILITY

26 SolrCloud First steps toward simplifying cluster management Integrates Zookeeper Central configuration (schema.xml, solrconfig.xml, etc) Tracks live nodes + shards of collections Removes need for external load balancers shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr Can specify logical shard ids shards=NY_shard,NJ_shard Clients don’t need to know shards at all: http://localhost:8983/solr/collection1/select?distrib=true

27 SolrCloud : The Future Eliminate all single points of failure Remove Master/Searcher distinction Enables near real-time search in a highly scalable environment High Availability for Writes Eventual consistency model (like Amazon Dynamo, Cassandra) Elastic Simply add/subtract servers, cluster will rebalance automatically By default, Solr will handle document partitioning

28 ODDS & ENDS

29 Auto-Suggest Many people currently use terms component Can be slow for a large corpus New auto-suggest builds off SpellCheck component Compact memory based trie for really fast completions Based on a field in the main index, or on a dictionary file http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult 10/13/201530 "spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}

30 Index with JSON $ URL=http://localhost:8983/solr/update/json $ curl $URL -H 'Content-type:application/json' -d ' { "add": { "doc": { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 } }' 31

31 Query Results in CSV http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv name,price,cat,popularity iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1 Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1 Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10 Can handle multi-valued fields (see “cat” field in example) Completely compatible with the CSV update handler (can round-trip) Results are streamed – good for dumping entire parts of the index 10/13/201532

32 http://localhost:8983/solr/browse 10/13/201533

33 Q&A


Download ppt "Solr 3.1 and Beyond Yonik Seeley Lucid Imagination October 8, 2010 2."

Similar presentations


Ads by Google