Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands slides:

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Lucene/Solr Architecture
Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning and Pandu Nayak.
Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
Apache Solr Yonik Seeley 29 June 2006 Dublin, Ireland.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Full Text Search using Azure Search. Shankar Subramanyam Senior Consultant | Enthusiast:
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
1 Open-Source Search Engines and Lucene/Solr UCSB 290N Tao Yang Slides are based on Y. Seeley, S. Das, C. Hostetter.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Implementing search with free software An introduction to Solr By Mick England.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
Powerful Full-Text Search with Solr Yonik Seeley Web 2.0 Expo, Berlin 8 November 2007 download at
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
© NYC Apache Lucene/Solr Meetup. Lucid Imagination, Inc. Agenda Welcome "Faster. Better. Solr! What to look for in Solr 1.4“ Yonik Seeley,
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
1 Lucene Jianguo Lu School of Computer Science University of Windsor.
Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing.
Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
HathiTrust Research Center Architecture Data subsystem.
Document Indexing and Scoring in Solr
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA.
Design a full-text search engine for a website based on Lucene
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Lucene Jianguo Lu.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
HW3 Overview There are 4 components to this homework; you will possibly not need all of them; 1. Installing Ubuntu 2. Installing Solr 3. Using Solr to.
Introduction to Information Retrieval Introduction to Information Retrieval ΜΕ003-ΠΛΕ70: Ανάκτηση Πληροφορίας Διδάσκουσα: Ευαγγελία Πιτουρά Εισαγωγή στο.
Lucene : Text Search IG5 – TILE Esther Pacitti. Basic Architecture.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380.
Lucene Tutorial Chris Manning and Pandu Nayak
Why indexing? For efficient searching of a document
Jianguo Lu School of Computer Science University of Windsor
IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.
CS276 Lucene Section.
Searching and Indexing
Building Search Systems for Digital Library Collections
Vores tankesæt: 80% teknologi | 20% forretning
Lucene in action Information Retrieval A.A
Lucene/Solr Architecture
Introduction to Elasticsearch with basics of Lucene May 2014 Meetup
Getting Started With Solr
Rafał Kuć – Sematext sematext.com
Table of Contents 1) Understanding Lucene 2) Lucene Indexing
Presentation transcript:

Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands slides:

What is Lucene High performance, scalable, full-text search library Focus: Indexing + Searching Documents 100% Java, no dependencies, no config files No crawlers or document parsing Users: Wikipedia, Technorati, Monster.com, Nabble, TheServerSide, Akamai, SourceForge Applications: Eclipse, JIRA, Roller, OpenGrok, Nutch, Solr, many commercial products

Inverted Index aardvark hood red little riding robin women zoo Little Red Riding Hood Robin Hood Little Women

Basic Application IndexWriterIndexSearcher Lucene Index Document super_name: Spider-Man name: Peter Parker category: superhero powers: agility, spider-sense Hits (Matching Docs) Query (powers:agility) addDocument()search() 1.Get Lucene jar file 2.Write indexing code to get data and create Document objects 3.Write code to create query objects 4.Write code to use/display results

Indexing Documents IndexWriter writer = new IndexWriter(directory, analyzer, true); Document doc = new Document(); doc.add(new Field(“super_name", “Sandman", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(“name", “William Baker", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(“name", “Flint Marko", Field.Store.YES, Field.Index.TOKENIZED)); // [...] writer.addDocument(doc); writer.close();

Field Options Indexed –Necessary for searching or sorting Tokenized –Text analysis done before indexing Stored –You get these back on a search “hit” Compressed Binary –Currently for stored-only fields

Searching an Index IndexSearcher searcher = new IndexSearcher(directory); QueryParser parser = new QueryParser("defaultField", analyzer); Query query = parser.parse(“powers:agility"); Hits hits = searcher.search(query); System.out.println(“matches:" + hits.length()); Document doc = hits.doc(0); // look at first match System.out.println(“name=" + doc.get(“name")); searcher.close();

Scoring VSM – Vector Space Model tf – term frequency: numer of matching terms in field lengthNorm – number of tokens in field idf – inverse document frequency coord – coordination factor, number of matching terms document boost query clause boost

Query Construction Lucene QueryParser Example: queryParser.parse(“name:Spider-Man"); good human entered queries, debugging, IPC does text analysis and constructs appropriate queries not all query types supported Programmatic query construction Example: new TermQuery(new Term(“name”,”Spider-Man”)) explicit, no escaping necessary does not do text analysis for you

Query Examples 1.justice league EQUIV: justice OR league QueryParser default is “optional” 2.+justice +league –name:aquaman EQUIV: justice AND league NOT name:aquaman 3.“justice league” –name:aquaman 4.title:spiderman^10 description:spiderman 5.description:“spiderman movie”~10

Query Examples2 1.releaseDate:[2000 TO 2007] Range search: lexicographic ordering, so beware of numbers 2.Wildcard searches: sup?r, su*r, super* 3.spider~ Fuzzy search: Levenshtein distanceLevenshtein distance Optional minimum similarity: spider~0.7 4.*:* 5.(Superman AND “Lex Luthor”) OR (+Batman +Joker)

Deleting Documents IndexReader.deleteDocument(int id) –exclusive with IndexWriter –powerful Deleting with IndexWriter –deleteDocuments(Term t) –updateDocument(Term t, Document d) Deleting does not immediately reclaim space

Index Structure _0.fnm _0.fdt _0.fdx _0.frq _0.tis _0.tii _0.prx _0.nrm _0_1.del _1.fnm _1.fdt _1.fdx […] segments_3 IndexWriter params MaxBufferedDocs MergeFactor MaxMergeDocs MaxFieldLength

Performance Indexing Performance –Index documents in batches –Raise merge factor –Raise maxBufferedDocs Searching Performance –Reuse IndexSearcher –Lower merge factor –optimize –Use cached filters (see QueryFilter) ‘+superhero +lang:english’ ‘superhero’ filtered by ‘lang:english’

Analysis & Search Relevancy LexCorp BFG-9000 LexCorp BFG-9000 BFG9000LexCorp LexCorp bfg9000lexcorp lexcorp WhitespaceTokenizer WordDelimiterFilter catenateWords=1 LowercaseFilter Lex corp bfg9000 Lexbfg9000 bfg9000 Lex corp bfg9000lexcorp WhitespaceTokenizer WordDelimiterFilter catenateWords=0 LowercaseFilter Query Analysis A Match! Document Indexing Analysis corp

Tokenizers Tokenizers break field text into tokens StandardTokenizer –source string: “full-text lucene.apache.org” –“full” “text” “lucene.apache.org” WhitespaceTokenizer –“full-text” “lucene.apache.org” LetterTokenizer –“full” “text” “lucene” “apache” “org”

TokenFilters LowerCaseFilter StopFilter ISOLatin1AccentFilter SnowballFilter –stemming: reducing words to root form –rides, ride, riding => ride –country, countries => countri contrib/analyzers for other languages SynonymFilter (from Solr) WordDelimiterFilter (from Solr)

Analyzers class MyAnalyzer extends Analyzer { private Set myStopSet = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS); public TokenStream tokenStream(String fieldname, Reader reader) { TokenStream ts = new StandardTokenizer(reader); ts = new StandardFilter(ts); ts = new LowerCaseFilter(ts); ts = new StopFilter(ts, myStopSet); return ts; }

Analysis Tips Use PerFieldAnalyzerWrapper Use NumberTools for numbers Add same field more than once, analyze differently –Boost exact case matches –Boost exact tense matches –Query with or without synonyms –Soundex for sounds-like queries Use explain(Query q, int docid) for debugging

Nutch Open source web search application Crawlers Link-graph database Document parsers (HTML, word, pdf, etc) Language + charset detection Utilizes Hadoop (DFS + MapReduce) for massive scalability

Solr REST XML/HTTP, JSON APIs Faceted search Flexible Data Schema Hit Highlighting Configurable Advanced Caching Replication Web admin interface Solr Flare: Ruby on Rails user interface

Het Eind Other Lucene Presentations Advanced Lucene (stay right here!) Beyond full-text searches with Solr and Lucene (Thursday 14:00) Introduction to Hadoop (Thursday 15:00) This presentation: