Lucene in action Information Retrieval A.A

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel.
Lucene/Solr Architecture
Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning and Pandu Nayak.
Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
Apache Solr Yonik Seeley 29 June 2006 Dublin, Ireland.
The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands slides:
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
1 Lucene Jianguo Lu School of Computer Science University of Windsor.
Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing.
Advanced Lucene Grant Ingersoll Center for Natural Language Processing ApacheCon 2005 December 12, 2005.
Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA.
Design a full-text search engine for a website based on Lucene
Lucene Jianguo Lu.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
Introduction to Information Retrieval Introduction to Information Retrieval ΜΕ003-ΠΛΕ70: Ανάκτηση Πληροφορίας Διδάσκουσα: Ευαγγελία Πιτουρά Εισαγωγή στο.
Lucene : Text Search IG5 – TILE Esther Pacitti. Basic Architecture.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380.
ΠΛΕ70: Ανάκτηση Πληροφορίας
Lucene Tutorial Chris Manning and Pandu Nayak
Why indexing? For efficient searching of a document
Jianguo Lu School of Computer Science University of Windsor
Searching AND INDEXING Big data
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Query processing: phrase queries and positional indexes
Indexing & querying text
Query Models Use Types What do search engines do.
CS276 Lucene Section.
Searching and Indexing
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Building Search Systems for Digital Library Collections
Implementation Issues & IR Systems
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Lucene/Solr Architecture
Query processing: phrase queries and positional indexes
Table of Contents 1) Understanding Lucene 2) Lucene Indexing
Presentation transcript:

Lucene in action Information Retrieval A.A. 2010-11 – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –

What is Lucene Full-text search library Indexing + Searching components 100% Java, no dependencies, no config files No crawler, document parsing nor search UI see Apache Nutch, Apache Solr, Apache Tika Probably, the most widely used SE Applications: a lot of (famous) websites, many commercial products

Basic Application Parse docs Write Index Make query Display results IndexWriter IndexSearcher Index

Indexing //create the index IndexWriter writer = new IndexWriter(directory, analyzer); //create the document structure Document doc = new Document(); Field id = new NumericField(“id”,Store.YES); Field title = new Field(“title”,null,Store.YES, Index.ANALYZED); Field body = new Field(“body”,null,Store.NO,Index.ANALYZED); doc.add(id); doc.add(title); doc.add(body); //scroll all documents, fill fields and index them! for(all document){ Article a = parse(document); id.setIntValue(a.id); title.setValue(a.title); body.setValue(a.body); writer.addDocument(doc); //doc is just a container } //IMPORTANT! close the Writer to commit all operations! writer.close();

Dictionary size: stopwords How to represent text? TEXT QUERY … Official Michael Jackson website … michael jackson Lower and Upper case … Michael Jackson’s new video … Tokenizer issues … Fender Music, the guitar company … Fender guitars Stemming … Microsoft WindowsXP … windows xp Word delimiter … the cat is on the table … cat table Dictionary size: stopwords

Text processing pipeline Analyzer Text processing pipeline String TokenStream Docs Tokenizer TokenStream TokenFilter1 TokenStream TokenFilter2 TokenFilter3 Analyzer Indexing tokens Index

Analyzer Analyzer Searching tokens Query String Index Results

Analyzer Built-in Tokenizers Built-in TokenFilters Built-in Analyzers WhitespaceTokenizer, LetterTokenizer,… StandardTokenizer good for most European-language docs Built-in TokenFilters LowerCase, Stemming, Stopwords, AccentFilter, many others in contrib packages (language-specific) Built-in Analyzers Keyword, Simple, Standard… PerField wrapper

the LexCorp BFG-900 is a printer TEXT QUERY the LexCorp BFG-900 is a printer Lex corp bfg900 printers WhitespaceTokenizer the LexCorp BFG-900 is a printer Lex corp bfg900 printers WordDelimiterFilter the Lex Corp BFG 900 is a printer Lex corp bfg 900 printers LexCorp LowerCaseFilter the lex corp bfg 900 is a printer lex corp bfg 900 printers lexcorp StopwordFilter lex corp bfg 900 printer lex corp bfg 900 printers lexcorp StemmerFilter lex corp bfg 900 print- lex corp bfg 900 print- lexcorp MATCH!

Field Options Field.Stored Field.Index Field.TermVector YES, NO ANALYZED, NOT_ANALYZED, NO Field.TermVector NO, YES (POSITION and/or OFFSETS)

Analysis tips Use PerFieldAnalyzerWrapper Use NumberUtils for numbers Don’t analyze keyword fields Store only needed data Use NumberUtils for numbers Add same field more than once, analyze it differently Boost exact case/stem matches

Searching //Best practice: reusable singleton! IndexSearcher s = new IndexSearcher(directory); //Build the query from the input string QueryParser qParser = new QueryParser(“body”, analyzer); Query q = qParser.parse(“title:Jaguar”); //Do search TopDocs hits = s.search(q, maxResults); System.out.println(“Results: ”+hits.totalHits); //Scroll all retrieved docs for(ScoreDoc hit : hits.scoreDocs){ Document doc = s.doc(hit.doc); System.out.println( doc.get(“id”) + “ – ” + doc.get(“title”) + “ Relevance=” + hit.score); } s.close();

Building the Query Built-in QueryParser Programmatic query building does text analysis and builds the Query object good for human input, debugging not all query types supported specific syntax: see JavaDoc for QueryParser Programmatic query building e.g.: Query q = new TermQuery( new Term(“title”, “jaguar”)); many types: Boolean, Term, Phrase, SpanNear, … no text analysis!

Scoring Lucene = Boolean Model + Vector Space Model Similarity = Cosine Similarity Term Frequency Inverse Document Frequency Other stuff Length Normalization Coord. factor (matching terms in OR queries) Boosts (per query or per doc) To build your own: implement Similarity and call Searcher.setSimilarity For debugging: Searcher.explain(Query q, int doc)

Performance Indexing Searching Batch indexing Raise mergeFactor Raise maxBufferedDocs Searching Reuse IndexSearcher Optimize: IndexWriter.optimize() Use cached filters: QueryFilter Segment_3 Index structure

Esercizi http://acube.di.unipi.it/esercizi-lucene-corso-ir/