Lucene/SOLR 2: Lucene search API voorgerecht: Searcher, Term, Sort, Filter hoofdgerecht: Query, Similarity, QueryParser toetje: Hits, Highlighter, SpellChecker TU Delft Library Digitale Productontwikkeling Egbert Gramsbergen
org.apache.lucene.search.Searcher int i int i class Verbasterd UML class diagram Document Document Searcher * doc docFreq explain search getSimilarity setSimilarity +lower level methods (performance tuning) Term ([]) constructor int ([]) argument --- return value --> Explanation int doc Query optional ... Filter Sort methods Hits Similarity
org.apache.lucene.search.Searcher FSDirectory RAMDirectory DbDirectory JEDirectory IndexSearcher * Directory Searcher String path IndexReader MultiSearcher * FilterIndexReader MultiReader [] Searcheable ParallelReader ParallelMultiSearcher * RemoteSearcheable
Term * createTerm field text compareTo String field String text int org.apache.lucene.index.Term Term * createTerm field text compareTo String field String text int Gebruik: o.a. bouwsteen van Query en Filter
Sort ([]) SortField ([]) [] * * setSort * setSort getSort org.apache.lucene.search.Sort N.B. Lucene kent geen strongly typed fields, SOLR wel Sort * * setSort ([]) SortField int AUTO, CUSTOM, DOC, SCORE, INT, LONG, FLOAT, DOUBLE, STRING * String field boolean reverse ([]) [] setSort getSort String field boolean reverse int type SortComparatorSource Locale * String language String country String variant
org.apache.lucene.search.Filter BooleanFilter ChainedFilter Filter DuplicateFilter PrefixFilter QueryWrapperFilter gebruik: bijv. in faceted search RangeFilter SpanFilter CachingWrapperFilter voorbeeld: TermsFilter * addTerm Term more…
org.apache.lucene.search.Query TermQuery FuzzyQuery MultiTermQuery WildcardQuery BooleanQuery RegexQuery Query PhraseQuery PrefixQuery SpanFirstQuery MultiPhraseQuery SpanNearQuery RangeQuery SpanNotQuery SpanQuery SpanOrQuery BoostingQuery SpanRegexQuery ConstantScoreQuery SpanTermQuery ConstantScoreRangeQuery DisjunctionMaxQuery BoostingTermQuery FilteredQuery FuzzyLikeThisQuery MatchAllDocsQuery ValueSourceQuery FieldScoreQuery MoreLikeThisQuery CustomScoreQuery
org.apache.lucene.search.Query setBoost getBoost rewrite Float boost IndexReader TermQuery * getTerm Term PhraseQuery * add getTerms setSlop [ ] int position int slop
org.apache.lucene.search.BooleanQuery * add getClauses setMinimumNumberShouldMatch boolean disableCoord [ ] BooleanClause * int Query and/or-ish query //example BooleanQuery bq; float andNess = 0.5; // 0.:OR(default), 1.:AND … BooleanClause[] clauses = bq.getClauses(); int numOpt = 0; for (int 1 = 0; i<clauses.length; i++ { if (clauses[i].getOccur()==BooleanClause.Occur.SHOULD) numOpt++; } bq.setMinimumNumberShouldMatch(Math.round(numOpt*andNess)); //NOTE: if there is no MUST clause at least 1 SHOULD clause must match BooleanClause.Occur int MUST, MUST_NOT, SHOULD
org.apache.lucene.search.tunction.CustomScoreQuery ([]) ValueSourceQuery int doc float subQueryScore float([]) valSrcScore(s) float FieldScoreQuery * String field Use cases: * Meewegen pub. type+jaar (bibliotheek) * Geografische nabijheid (search “pizza”) override FieldScoreQuery.Type int BYTE, SHORT, INT, FLOAT Default: subQueryScore * valSrcScores[0] * valSrcScores[1] * … Pub.jaar: score = 1+a/(1+τ), τ=(t-tp)/t0 a 1 t0 t-tp
org.apache.lucene.search.Similarity Hier wordt het echte werk verricht: http://lucene.apache.org/java/2_3_0/api/ org/apache/lucene/search/Similarity.html Query, Document Score volgens Vector Space model
org.apache.lucene.queryParser.QueryParser String Query (hoera!) ::=def. ()nesting *repetition []optional |or | | | | | Query ::= ( Clause )* | | Clause ::= ["+"|"-"] [<TERM> ":"] ( <TERM> | "(" Query ")" ) | | | | | AND NOT field | nested query single term or phrase Voorbeelden: aaa bbb ccc year:[2000 TO 2005] (inclusive) +aaa bbb –ccc price:{020 TO 100} (not inclusive) "aaa bbb" aaa^3 bbb (boost) title:aaa "aaa bbb"^0.5 title:(+aaa bbb) AND author:"ddd e f" 1/+1 (/ escape char) aaa* bb*b cc?c aaa~0.8 (fuzzy/min.similarity) "aaa bbb"~10 (proximity/slop) gaat ook nog door Analyzer Strings: 20<100 Lucene: alleen Strings SOLR: strongly typed fields! NIET: "aaa* bbb" NIET: *aaa, ?aaa
org.apache.lucene.queryParser.QueryParser Niet iedere Query kan door QueryParser worden gemaakt (te ingewikkeld of bescherming performance) “New Yor*” *ork “New York” binnen 10 woorden afstand van “Broadway” en max. 5 woorden na het begin van het veld Niet iedere Query wil door QueryParser worden gemaakt Doe aan Interface ontwerp, bijv. * vrije text invoer (geQueryParsed) * aparte interface elementen voor: * velden * ranges * facetten, more like this, …
org.apache.lucene.queryParser.QueryParser StandardAnalyzer RussianAnalyzer QueryParser * parse setDefaultOperator setPhraseSlop setFuzzyMinSim … String defaultField BrazilianAnalyzer Analyzer DutchAnalyzer * String query … Query File stopwords String[] stopwords HashSet stopwords QueryParser.Operator AND_OPERATOR, OR_OPERATOR float int
Hits Document doc score iterator length int n float score Hit org.apache.lucene.search.Hits Searcher search Document get getFields … String fieldName String value List fields Hits doc score iterator length Field name getValue … int n float score Hit getDocument getScore HitIterator next hasNext length boolean hasNext int length N.B. gebruik HitCollector (low-level API) voor grote aantallen hits
org.apache.lucene.search.highlight.Highlighter * setTextFragmenter getBestFragments … QueryScorer * Query Scorer (fragmentScorer) IndexReader Formatter String fieldName SimpleHTMLFormatter * String preTag String postTag Float maxScore String minForegroundcolor String maxForegroundcolor String minBackgroundcolor String maxBackgroundcolor Analyzer String fieldName String text int maxNumFragments GradientFormatter SpanGradientFormatter * String[] bestFragments Fragmenter SimpleFragmenter * int fragmentSize
org.apache.lucene.search.spell.SpellChecker N-gram index SpellChecker * indexDictionary suggestSimilar setAccuracy … PlainTextDictionary * Directory (spellIndex) File InputStream Reader Dictionary IndexReader LuceneDictionary * String field boolean morePopular String word int numSug String[] words float minScore