Lucene/SOLR 2: Lucene search API

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel.
Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco.
Lucene/Solr Architecture
Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning and Pandu Nayak.
STRING AN EXAMPLE OF REFERENCE DATA TYPE. 2 Primitive Data Types  The eight Java primitive data types are:  byte  short  int  long  float  double.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
Apache Solr Yonik Seeley 29 June 2006 Dublin, Ireland.
 Apache Solr Apache Solr – Introduction David Shemer.
The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Implementing search with free software An introduction to Solr By Mick England.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands slides:
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Search Search Drupal with Apache Solr with CERN Web Communications Group – Copyright 2013.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing.
Advanced Lucene Grant Ingersoll Center for Natural Language Processing ApacheCon 2005 December 12, 2005.
Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text Processing 3.Index Techniques in Database 4.Index Techniques in Wireless Network.
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
From the initial (HINARI) PubMed page, we will run the HIV and pregnancy search and then apply various Filters. Note the to Advanced search and Help options.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer PANGAEA ® - Publishing Network for Geoscientific.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Project 1: Using Arrays and Manipulating Strings Essentials for Design JavaScript Level Two Michael Brooks.
Appendix E Using the Java API Documentation 報告人:黃偉倫 學號:
Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA.
Design a full-text search engine for a website based on Lucene
Lucene Jianguo Lu.
Information Retrieval Lecture 6 Vector Methods 2.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Spatial in Lucene and Solr David Smiley Lucene/Solr search developer / consultant at Harvard CGA.
Introduction to Information Retrieval Introduction to Information Retrieval ΜΕ003-ΠΛΕ70: Ανάκτηση Πληροφορίας Διδάσκουσα: Ευαγγελία Πιτουρά Εισαγωγή στο.
Lucene : Text Search IG5 – TILE Esther Pacitti. Basic Architecture.
ΠΛΕ70: Ανάκτηση Πληροφορίας
Lucene Tutorial Chris Manning and Pandu Nayak
CS242 Project – Fall 2016 Presented By Nhat Le
Searching AND INDEXING Big data
CS276 Lucene Section.
Searching and Indexing
Custom search forms with Apache Solr David Hernández
Building Search Systems for Digital Library Collections
Vores tankesæt: 80% teknologi | 20% forretning
Searching for Rio: Azure Search, NBC Sports, and the Olympics
Elasticsearch Query DSL
Lucene in action Information Retrieval A.A
Rafał Kuć – Sematext sematext.com
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20
Bryan Soltis – Kentico Technical Evangelist
Search for author with compound name
Table of Contents 1) Understanding Lucene 2) Lucene Indexing
Factual Claim Validation Models
Intermediate Documents Session 2
Presentation transcript:

Lucene/SOLR 2: Lucene search API voorgerecht: Searcher, Term, Sort, Filter hoofdgerecht: Query, Similarity, QueryParser toetje: Hits, Highlighter, SpellChecker TU Delft Library Digitale Productontwikkeling Egbert Gramsbergen

org.apache.lucene.search.Searcher int i int i class Verbasterd UML class diagram Document Document Searcher * doc docFreq explain search getSimilarity setSimilarity +lower level methods (performance tuning) Term ([]) constructor int ([]) argument --- return value --> Explanation int doc Query optional ... Filter Sort methods Hits Similarity

org.apache.lucene.search.Searcher FSDirectory RAMDirectory DbDirectory JEDirectory IndexSearcher * Directory Searcher String path IndexReader MultiSearcher * FilterIndexReader MultiReader [] Searcheable ParallelReader ParallelMultiSearcher * RemoteSearcheable

Term * createTerm field text compareTo String field String text int org.apache.lucene.index.Term Term * createTerm field text compareTo String field String text int Gebruik: o.a. bouwsteen van Query en Filter

Sort ([]) SortField ([]) [] * * setSort * setSort getSort org.apache.lucene.search.Sort N.B. Lucene kent geen strongly typed fields, SOLR wel Sort * * setSort ([]) SortField int AUTO, CUSTOM, DOC, SCORE, INT, LONG, FLOAT, DOUBLE, STRING * String field boolean reverse ([]) [] setSort getSort String field boolean reverse int type SortComparatorSource Locale * String language String country String variant

org.apache.lucene.search.Filter BooleanFilter ChainedFilter Filter DuplicateFilter PrefixFilter QueryWrapperFilter gebruik: bijv. in faceted search RangeFilter SpanFilter CachingWrapperFilter voorbeeld: TermsFilter * addTerm Term more…

org.apache.lucene.search.Query TermQuery FuzzyQuery MultiTermQuery WildcardQuery BooleanQuery RegexQuery Query PhraseQuery PrefixQuery SpanFirstQuery MultiPhraseQuery SpanNearQuery RangeQuery SpanNotQuery SpanQuery SpanOrQuery BoostingQuery SpanRegexQuery ConstantScoreQuery SpanTermQuery ConstantScoreRangeQuery DisjunctionMaxQuery BoostingTermQuery FilteredQuery FuzzyLikeThisQuery MatchAllDocsQuery ValueSourceQuery FieldScoreQuery MoreLikeThisQuery CustomScoreQuery

org.apache.lucene.search.Query setBoost getBoost rewrite Float boost IndexReader TermQuery * getTerm Term PhraseQuery * add getTerms setSlop [ ] int position int slop

org.apache.lucene.search.BooleanQuery * add getClauses setMinimumNumberShouldMatch boolean disableCoord [ ] BooleanClause * int Query  and/or-ish query //example BooleanQuery bq; float andNess = 0.5; // 0.:OR(default), 1.:AND … BooleanClause[] clauses = bq.getClauses(); int numOpt = 0; for (int 1 = 0; i<clauses.length; i++ { if (clauses[i].getOccur()==BooleanClause.Occur.SHOULD) numOpt++; } bq.setMinimumNumberShouldMatch(Math.round(numOpt*andNess)); //NOTE: if there is no MUST clause at least 1 SHOULD clause must match BooleanClause.Occur int MUST, MUST_NOT, SHOULD

org.apache.lucene.search.tunction.CustomScoreQuery ([]) ValueSourceQuery int doc float subQueryScore float([]) valSrcScore(s) float FieldScoreQuery * String field Use cases: * Meewegen pub. type+jaar (bibliotheek) * Geografische nabijheid (search “pizza”) override FieldScoreQuery.Type int BYTE, SHORT, INT, FLOAT Default: subQueryScore * valSrcScores[0] * valSrcScores[1] * … Pub.jaar: score = 1+a/(1+τ), τ=(t-tp)/t0 a 1 t0 t-tp

org.apache.lucene.search.Similarity Hier wordt het echte werk verricht: http://lucene.apache.org/java/2_3_0/api/ org/apache/lucene/search/Similarity.html Query, Document  Score volgens Vector Space model

org.apache.lucene.queryParser.QueryParser String  Query (hoera!) ::=def. ()nesting *repetition []optional |or | | | | | Query ::= ( Clause )* | | Clause ::= ["+"|"-"] [<TERM> ":"] ( <TERM> | "(" Query ")" ) | | | | | AND NOT field | nested query single term or phrase Voorbeelden: aaa bbb ccc year:[2000 TO 2005] (inclusive) +aaa bbb –ccc price:{020 TO 100} (not inclusive) "aaa bbb" aaa^3 bbb (boost) title:aaa "aaa bbb"^0.5 title:(+aaa bbb) AND author:"ddd e f" 1/+1 (/ escape char) aaa* bb*b cc?c aaa~0.8 (fuzzy/min.similarity) "aaa bbb"~10 (proximity/slop) gaat ook nog door Analyzer  Strings: 20<100 Lucene: alleen Strings SOLR: strongly typed fields!  NIET: "aaa* bbb"  NIET: *aaa, ?aaa

org.apache.lucene.queryParser.QueryParser Niet iedere Query kan door QueryParser worden gemaakt (te ingewikkeld of bescherming performance) “New Yor*” *ork “New York” binnen 10 woorden afstand van “Broadway” en max. 5 woorden na het begin van het veld Niet iedere Query wil door QueryParser worden gemaakt Doe aan Interface ontwerp, bijv. * vrije text invoer (geQueryParsed) * aparte interface elementen voor: * velden * ranges * facetten, more like this, …

org.apache.lucene.queryParser.QueryParser StandardAnalyzer RussianAnalyzer QueryParser * parse setDefaultOperator setPhraseSlop setFuzzyMinSim … String defaultField BrazilianAnalyzer Analyzer DutchAnalyzer * String query … Query File stopwords String[] stopwords HashSet stopwords QueryParser.Operator AND_OPERATOR, OR_OPERATOR float int

Hits Document doc score iterator length int n float score Hit org.apache.lucene.search.Hits Searcher search Document get getFields … String fieldName String value List fields Hits doc score iterator length Field name getValue … int n float score Hit getDocument getScore HitIterator next hasNext length boolean hasNext int length N.B. gebruik HitCollector (low-level API) voor grote aantallen hits

org.apache.lucene.search.highlight.Highlighter * setTextFragmenter getBestFragments … QueryScorer * Query Scorer (fragmentScorer) IndexReader Formatter String fieldName SimpleHTMLFormatter * String preTag String postTag Float maxScore String minForegroundcolor String maxForegroundcolor String minBackgroundcolor String maxBackgroundcolor Analyzer String fieldName String text int maxNumFragments GradientFormatter SpanGradientFormatter * String[] bestFragments Fragmenter SimpleFragmenter * int fragmentSize

org.apache.lucene.search.spell.SpellChecker N-gram index SpellChecker * indexDictionary suggestSimilar setAccuracy … PlainTextDictionary * Directory (spellIndex) File InputStream Reader Dictionary IndexReader LuceneDictionary * String field boolean morePopular String word int numSug String[] words float minScore