Presentation is loading. Please wait.

Presentation is loading. Please wait.

Powerful Full-Text Search with Solr Yonik Seeley Web 2.0 Expo, Berlin 8 November 2007 download at

Similar presentations


Presentation on theme: "Powerful Full-Text Search with Solr Yonik Seeley Web 2.0 Expo, Berlin 8 November 2007 download at"— Presentation transcript:

1 Powerful Full-Text Search with Solr Yonik Seeley yonik@apache.org Web 2.0 Expo, Berlin 8 November 2007 download at http://www.apache.org/~yonik

2 What is Lucene High performance, scalable, full-text search library Focus: Indexing + Searching Documents –“Document” is just a list of name+value pairs No crawlers or document parsing Flexible Text Analysis (tokenizers + token filters) 100% Java, no dependencies, no config files

3 What is Solr A full text search server based on Lucene XML/HTTP, JSON Interfaces Faceted Search (category counting) Flexible data schema to define types and fields Hit Highlighting Configurable Advanced Caching Index Replication Extensible Open Architecture, Plugins Web Administration Interface Written in Java5, deployable as a WAR

4 adminupdateselect Standard request handler Custom request handler XML response writer JSON response writer XML Update Handler CSV Update Handler Lucene Basic App Document super_name: Mr. Fantastic name: Reed Richards category: superhero powers: elasticity Query Response (matching docs) Query (powers:agility) http://solr/updatehttp://solr/select Servlet Container Solr HTML Webapp Indexer

5 Indexing Data HTTP POST to http://localhost:8983/solr/update 05991 Peter Parker Spider-Man superhero agility spider-sense

6 Indexing CSV data Iron Man, Tony Stark, superhero, powered armor | flight Sandman, William Baker|Flint Marko, supervillain, sand transform Wolverine,James Howlett|Logan, superhero, healing|adamantium Magneto, Erik Lehnsherr, supervillain, magnetism|electricity http://localhost:8983/solr/update/csv? fieldnames=supername,name,category,powers &separator=, &f.name.split=true&f.name.separator=| &f.powers.split=true&f.powers.separator=|

7 Data upload methods URL=http://localhost:8983/solr/update/csv HTTP POST body (curl, HttpClient, etc) curl $URL -H 'Content-type:text/plain; charset=utf-8' --data-binary @info.csv Multi-part file upload (browsers) Request parameter ?stream.body=‘Cyclops, Scott Summers,…’ Streaming from URL (must enable) ?stream.url=file://data/info.csv

8 Indexing with SolrJ // Solr’s Java Client API… remote or embedded/local! SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr"); SolrInputDocument doc = new SolrInputDocument(); doc.addField("supername","Daredevil"); doc.addField("name","Matt Murdock"); doc.addField(“category",“superhero"); server.add(doc); server.commit();

9 Deleting Documents Delete by Id, most efficient 05591 32552 Delete by Query category:supervillain

10 Commit makes changes visible –Triggers static cache warming in solrconfig.xml –Triggers autowarming from existing caches same as commit, merges all index segments for faster searching _0.fnm _0.fdt _0.fdx _0.frq _0.tis _0.tii _0.prx _0.nrm _0_1.del _1.fnm _1.fdt _1.fdx […] Lucene Index Segments

11 Searching http://localhost:8983/solr/select?q=powers:agility &start=0&rows=2&fl=supername,category Spider-Man superhero Msytique supervillain

12 Response Format Add &wt=json for JSON formatted response {“result": {"numFound":427, "start":0, "docs": [ {“supername”:”Spider-Man”, “category”:”superhero”}, {“supername”:” Msytique”, “category”:” supervillain”} ] } Also Python, Ruby, PHP, SerializedPHP, XSLT

13 Scoring Query results are sorted by score descending VSM – Vector Space Model tf – term frequency: numer of matching terms in field lengthNorm – number of tokens in field idf – inverse document frequency coord – coordination factor, number of matching terms document boost query clause boost http://lucene.apache.org/java/docs/scoring.html

14 Explain http://solr/select?q=super fast&indent=on&debugQuery=on 0.16389132 = (MATCH) product of: 0.32778263 = (MATCH) sum of: 0.32778263 = (MATCH) weight(text:fast in 6), product of: 0.5012072 = queryWeight(text:fast), product of: 2.466337 = idf(docFreq=5) 0.20321926 = queryNorm 0.65398633 = (MATCH) fieldWeight(text:fast in 6), product of: 1.4142135 = tf(termFreq(text:fast)=2) 2.466337 = idf(docFreq=5) 0.1875 = fieldNorm(field=fast, doc=6) 0.5 = coord(1/2) 0.1365761 = (MATCH) product of:

15 Lucene Query Syntax 1.justice league Equiv: justice OR league QueryParser default operator is “OR”/optional 2.+justice +league –name:aquaman Equiv: justice AND league NOT name:aquaman 3.“justice league” –name:aquaman 4.title:spiderman^10 description:spiderman 5.description:“spiderman movie”~100

16 Lucene Query Examples2 1.releaseDate:[2000 TO 2007] 2.Wildcard searches: sup?r, su*r, super* 3.spider~ Fuzzy search: Levenshtein distanceLevenshtein distance Optional minimum similarity: spider~0.7 4.*:* 5.(Superman AND “Lex Luthor”) OR (+Batman +Joker)

17 DisMax Query Syntax Good for handling raw user queries –Balanced quotes for phrase query – ‘+’ for required, ‘-’ for prohibited –Separates query terms from query structure http://solr/select?qt=dismax &q=super man// the user query &qf=title^3 subject^2 body// field to query &pf=title^2,body// fields to do phrase queries &ps=100// slop for those phrase q’s &tie=.1// multi-field match reward &mm=2// # of terms that should match &bf=popularity// boost function

18 DisMax Query Form The expanded Lucene Query: +( DisjunctionMaxQuery( title:super^3 | subject:super^2 | body:super) DisjunctionMaxQuery( title:man^3 | subject:man^2 | body:man) ) DisjunctionMaxQuery(title:”super man”~100^2 body:”super man”~100) FunctionQuery(popularity) Tip: set up your own request handler with default parameters to avoid clients having to specify them

19 Function Query Allows adding function of field value to score –Boost recently added or popular documents Current parser only supports function notation Example: log(sum(popularity,1)) sum, product, div, log, sqrt, abs, pow scale(x, target_min, target_max) –calculates min & max of x across all docs map(x, min, max, target) –useful for dealing with defaults

20 Boosted Query Score is multiplied instead of added –New local params syntax added &q= super man Parameter dereferencing in local params &q= &boost=sqrt(popularity) &userq=super man

21 Analysis & Search Relevancy LexCorp BFG-9000 LexCorp BFG-9000 BFG9000LexCorp LexCorp bfg9000lexcorp lexcorp WhitespaceTokenizer WordDelimiterFilter catenateWords=1 LowercaseFilter Lex corp bfg9000 Lexbfg9000 bfg9000 Lex corp bfg9000lexcorp WhitespaceTokenizer WordDelimiterFilter catenateWords=0 LowercaseFilter Query Analysis A Match! Document Indexing Analysis corp

22 Configuring Relevancy <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt“/> <filter class="solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>

23 Field Definitions Field Attributes: name, type, indexed, stored, multiValued, omitNorms, termVectors Dynamic Fields

24 copyField Copies one field to another at index time Usecase #1: Analyze same field different ways –copy into a field with a different analyzer –boost exact-case, exact-punctuation matches –language translations, thesaurus, soundex Usecase #2: Index multiple fields into single searchable field

25

26

27

28 Facet Query http://solr/select?q=foo&wt=json&indent=on &facet=true&facet.field=cat &facet.query=price:[0 TO 100] &facet.query=manu:IBM {"response":{"numFound":26,"start":0,"docs":[…]}, “facet_counts":{ "facet_queries":{ "price:[0 TO 100]":6, “manu:IBM":2}, "facet_fields":{ "cat":[ "electronics",14, "memory",3, "card",2, "connector",2] }}}

29 Filters Filters are restrictions in addition to the query Use in faceting to narrow the results Filters are cached separately for speed 1. User queries for memory, query sent to solr is &q=memory&fq=inStock:true&facet=true&… 2. User selects 1GB memory size &q=memory&fq=inStock:true&fq=size:1GB&… 3. User selects DDR2 memory type &q=memory&fq=inStock:true&fq=size:1GB &fq=type:DDR2&…

30 Highlighting http://solr/select?q=lcd&wt=json&indent=on &hl=true&hl.fl=features {"response":{"numFound":5,"start":0,"docs":[ {"id":"3007WFP", “price”:899.95}, …] "highlighting":{ "3007WFP":{ "features":["30\" TFT active matrix LCD, 2560 x 1600” "VA902B":{ "features":["19\" TFT active matrix LCD, 8ms response time, 1280 x 1024 native resolution"]}}}

31 MoreLikeThis Selects documents that are “similar” to the documents matching the main query. &q=id:6H500F0 &mlt=true&mlt.fl=name,cat,features "moreLikeThis":{ "6H500F0":{"numFound":5,"start":0, "docs”: [ {"name":"Apple 60 GB iPod with Video Playback Black", "price":399.0, "inStock":true, "popularity":10, […] }, […] ] […]

32 High Availability Load Balancer Appservers Solr Searchers Solr Master DB Updater updates admin queries Index Replication admin terminal HTTP search requests Dynamic HTML Generation

33 Resources WWW –http://lucene.apache.org/solrhttp://lucene.apache.org/solr –http://lucene.apache.org/solr/tutorial.htmlhttp://lucene.apache.org/solr/tutorial.html –http://wiki.apache.org/solr/http://wiki.apache.org/solr/ Mailing Lists –solr-user-subscribe@lucene.apache.orgsolr-user-subscribe@lucene.apache.org –solr-dev-subscribe@lucene.apache.orgsolr-dev-subscribe@lucene.apache.org


Download ppt "Powerful Full-Text Search with Solr Yonik Seeley Web 2.0 Expo, Berlin 8 November 2007 download at"

Similar presentations


Ads by Google