1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.

Slides:

Advertisements

Similar presentations

Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –

Advertisements

Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel.

Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco.

Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.

Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning and Pandu Nayak.

Chapter 5: Introduction to Information Retrieval

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch

Advanced Indexing Techniques with

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.

Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –

Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.

For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.

Information Retrieval in Practice

The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.

1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Overview of Search Engines

Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan.

GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.

Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands slides:

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking, Crawling and Indexing in IR.

File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.

Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.

1 Lucene Jianguo Lu School of Computer Science University of Windsor.

Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing.

Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.

Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.

Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.

Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.

Chapter 10: File-System Interface 10.1 Silberschatz, Galvin and Gagne ©2011 Operating System Concepts – 8 th Edition 2014.

Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.

“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”

Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.

Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 

Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA.

Design a full-text search engine for a website based on Lucene

A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.

Lucene Jianguo Lu.

Information Retrieval Lecture 6 Vector Methods 2.

ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Introduction to Information Retrieval Introduction to Information Retrieval ΜΕ003-ΠΛΕ70: Ανάκτηση Πληροφορίας Διδάσκουσα: Ευαγγελία Πιτουρά Εισαγωγή στο.

Information Retrieval in Practice

ΠΛΕ70: Ανάκτηση Πληροφορίας

Lucene Tutorial Chris Manning and Pandu Nayak

Why indexing? For efficient searching of a document

Jianguo Lu School of Computer Science University of Windsor

Searching AND INDEXING Big data

Searching and Indexing

Text Based Information Retrieval

Building Search Systems for Digital Library Collections

Implementation Issues & IR Systems

Vores tankesæt: 80% teknologi | 20% forretning

Lucene in action Information Retrieval A.A

Introduction to Elasticsearch with basics of Lucene May 2014 Meetup

Table of Contents 1) Understanding Lucene 2) Lucene Indexing

Presentation transcript:

1 Introduction to Lucene Rong Jin

What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project implemented in Java Originally written by Doug Cutting Become a project in the Apache Software Foundation in 2001 It is the most popular free Java IR library. Lucene has been ported to Perl, Python, Ruby, C/C++, and C# (.NET). 2

Lucene Users  IBM Omnifind Y! Edition  Technorati  Wikipedia  Internet Archive  LinkedIn  Eclipse  JIRA  Apache Roller  jGuru  More than 200 others 3

The Lucene Family 4  Lucene & Apache Lucene & Java Lucene: IR library  Nutch: Hadoop-loving crawler, indexer, searcher for web-scale SE  Solr: Search server  Droids: Standalone framework for writing crawlers  Lucene.Net: C#, incubator graduate  Lucy: C Lucene implementation  PyLecene: Python port  Tika: Content analysis toolkit

Indexing Documents  Each document is comprised of multiple fields  Analyzer extracts words from texts  IndexWriter creates and writes inverted index to disk 5 Dictionary Analyzer TokenizerTokenFilter Document Index Writer Inverted Index Field :

Indexing Documents 6

7

8

Lucene Classes for Indexing  Directory class An abstract class representing the location of a Lucene index. FSDirectory stores index in a directory in the filesystem, RAMDirectory holds all its data in memory.  useful for smaller indices that can be fully loaded in memory and can be destroyed upon the termination of an application. 9

Lucene Classes for Indexing  IndexWriter Class Creates a new index or opens an existing one, and adds, removes or updates documents in the index.  Analyzer Class An abstract class for extracting tokens from texts to be indexed StandardAnalyzer is the most common one 10

Lucene Classes for Indexing  Document Class A document is a collection of fields The meta-data such as author, title, subject, date modified, and so on, are indexed and stored separately as fields of a document. 11

Index Segments and Merge  Each index consists of multiple segments  Every segment is actually a standalone index itself, holding a subset of all indexed documents.  At search time, each segment is visited separately and the results are combined together. 12 # ls -lh total 1.1G -rw-r--r-- 1 root root 123M :29 _0.fdt -rw-r--r-- 1 root root 44M :29 _0.fdx -rw-r--r-- 1 root root :31 _9j.fnm -rw-r--r-- 1 root root 372M :36 _9j.frq -rw-r--r-- 1 root root 11M :36 _9j.nrm -rw-r--r-- 1 root root 180M :36 _9j.prx -rw-r--r-- 1 root root 5.5M :36 _9j.tii -rw-r--r-- 1 root root 308M :36 _9j.tis -rw-r--r-- 1 root root :36 segments_2 -rw-r--r-- 1 root root :36 segments.gen

Index Segments and Merge  Each segment consists of multiple files _X. : X is the name and is the extension that identifies which part of the index that file corresponds to. Separate files to hold the different parts of the index (term vectors, stored fields, inverted index, etc.).  Optimize() operation will merge all the segments into one Involves a lot of disk IO and time consuming Significantly improves search efficiency 13 # ls -lh total 1.1G -rw-r--r-- 1 root root 123M :29 _0.fdt -rw-r--r-- 1 root root 44M :29 _0.fdx -rw-r--r-- 1 root root :31 _9j.fnm -rw-r--r-- 1 root root 372M :36 _9j.frq -rw-r--r-- 1 root root 11M :36 _9j.nrm -rw-r--r-- 1 root root 180M :36 _9j.prx -rw-r--r-- 1 root root 5.5M :36 _9j.tii -rw-r--r-- 1 root root 308M :36 _9j.tis -rw-r--r-- 1 root root :36 segments_2 -rw-r--r-- 1 root root :36 segments.gen

Lucene Classes for Reading Index  IndexReader class Read index from the indexed file  Terms class A container for all the terms in a specified field  TermsEnum class Implement BytesRefIterator interface, providing interface for accessing each term 14

Reading Document Vector  Enable storing term vector at indexing step. 15 FieldType fieldType = new FieldType(); fieldType.setStoreTermVectors( true); fieldType.setIndexed( true ); fieldType.setIndexOptions( IndexOptions. DOCS_AND_FREQS ); fieldType.setStored( true ); doc.add( new Field(“contents”, contentString, fieldType )) ;

Reading Document Vector  Enable storing term vector at indexing step.  Read document vector  Obtain each term in the document vector 16 FieldType fieldType = new FieldType(); fieldType.setStoreTermVectors( true); fieldType.setIndexed( true ); fieldType.setIndexOptions( IndexOptions. DOCS_AND_FREQS ); fieldType.setStored( true ); doc.add( new Field(“contents”, contentString, fieldType )) ; IndexReader reader = IndexReader.open( FSDirectory.open ( new File( indexPath )) ); int maxDoc = reader.maxDoc(); for (int i=0; i<maxDoc; i++) { Terms terms = reader.getTermVector( i, “contents”); TermsEnum termsEnum = terms.iterator( null ); BytesRef text = null; while ( (text = termsEnum.next()) !=null ) { String termtext = text.utf8ToString(); int docfreq = termsEnum.docFreq(); }

Updating Documents in Index  IndexWriter.add(): add documents to the existing index  IndexWriter.delete(): remove documents/fields from the existing index  IndexWriter.update(): update documents in the existing index 17

Other Features of Lucene Indexing  Concurrency Multiple IndexReaders may be open at once on a single index But only one IndexWriter can be open on an index at once IndexReaders may be open even while a single IndexWriter is making changes to the index; each IndexReader will always show the index as of the point-in-time that it was opened. 18

Other Features of Lucene Indexing  A file-based lock is used to prevent two writers working on the same index If the file write.lock exists in your index directory, a writer currently has the index open; any attempt to create another writer on the same index will hit a LockObtainFailedException. 19

Search Documents 20

Search Documents 21

Search Documents 22

Lucene Classes for Searching  IndexSearcher class Search through the index  TopDocs class A container of pointers to the top N ranked results Records the docID and score for each of the top N results (docID can be used to retrieve the document) 23

Lucene Classes for Searching  QueryParser Parse a text query into the Query class Need the analyzer to extract tokens from the text query  Search single term  Term class Similar to Field, is pair of name and value Use together TermQuery class to create query 24

Similarity Functions in Lucene  Many similarity functions are implemented in Lucene Okapi (BM25Similarity) Language model (LMDirichletSimilarity)  Example : Similarity simfn = new BM25Similarity(); searcher.setSimilarity(simfn); // searcher is an IndexSearcher 25

Similarity Functions in Lucene  Default similarity function Allow implementing various similarity functions 26

Lucene Scoring in DefaultSimilarity  tf - how often a term appears in the document  idf - how often the term appears across the index  coord - number of terms in both the query and the document  lengthNorm - total number of terms in the field  queryNorm - normalization factor makes queries comparable  boost(index) – boost of the field at index-time  boost(query) – boost of the field at query-time 27

Lucene Scoring in DefaultSimilarity  tf - how often a term appears in the document  idf - how often the term appears across the index  coord - number of terms in both the query and the document  lengthNorm - total number of terms in the field  queryNorm - normalization factor makes queries comparable  boost(index) – boost of the field at index-time  boost(query) – boost of the field at query-time 28 sqrt( freq ) log(numDocs/(docFreq+1))+1 overlap/maxOverlap 1/sqrt( numTerms ) 1/sqrt(sumOfSquaredWeights)

Customizing Scoring  Subclass DefaultSimilarity and override the method you want to customize.  ignore how common a term appears across the index  Increase the weight of terms in “title” field 29

Queries in Lucene  Lucene support many types of queries RangeQuery PrefixQuery WildcardQuery, BooleanQuery, PhraseQuery, … 30

Analyzers  Basic analyzers  Analyzers for different languages (in analyzers-common) Chinese, Japanese, Arabic, German, Greek, …. 31

Analysis in Action 32 "The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

Analysis in Action 33 "XY&Z Corporation - WhitespaceAnalyzer: [XY&Z] [Corporation] [-] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation]

Analyzer: Key Structure  Breaks text into a stream of tokens enumerated by the TokenStream class. 34

Analyzer  Breaks text into a stream of tokens enumerated by the TokenStream class. 35 The only required method for Analyzer

Analyzer  Breaks text into a stream of tokens enumerated by the TokenStream class. 36 Allows the TokenStream class to be reused; save space allocation and garbage collection

TokenStream Class  Two types of TokenStream Tokenizer: a TokenStream that tokenizes the input from a Reader. i.e., chunks the input into Tokens. TokenFilter: allows you to chain TokenStreams together, i.e., further modify the Tokens including removing it, stemming it, and other actions. A chain usually includes 1 Tokenizer and N TokenFilters 37

TokenStream Class  Example: StopAnalyzer 38 LowerCaseTokenizer StopFilter Text TokenStream

TokenStream Class  Example: StopAnalyzer 39 LowerCaseTokenizer StopFilter Text TokenStream StopFilter LowerCaseFilter TokenStream Order Matter ! LetterTokenizer Text

Tokenizer TokenFilter 40