Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 

Similar presentations


Presentation on theme: "Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer "— Presentation transcript:

1 Lucene

2 Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer  http://lucene.sourceforge.net/talks/pisa/ http://lucene.sourceforge.net/talks/pisa/ ◦ Developed by Doug Cutting 1996  Contributed to Apache project  Wrote several papers in IR

3 Modules for IR ◦ Analysis  Tokenization  Where tokens are indexed ◦ Document  Where the Document ID is created  Date of Document is extracted  Title of document is extracted ◦ Index  Provides access to indexes  Maintains indexes ◦ Query Parser  Where the magic of query happens ◦ Search  Searches across indexes

4 Modules for IR ◦ Search Spans  Spans  K+/- words  Example:  Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking ◦ Store/Util  Store the indexes and other housekeeping

5 Theory Space Optimization for Total Ranking ◦ Cutting et al 1996 ◦ RAIO (Computer Assisted IR) 1997 ◦ http://lucene.sf.net/papers/riao97.ps http://lucene.sf.net/papers/riao97.ps Lucene lecture at Pisa ◦ Doug Cutting ◦ Slides from Lecture at University of Pisa 2004

6 Vector Vectors are a mathematical distance between terms ◦ Uses a cosine distance to determine how close terms/documents are ◦ This distance can then be used for WSD/Clustering/IR ◦ Example:  Bass,fishing:.6506  Bass,guitar:.000423  This tells us the document is about fishing not about guitars

7 Vectors-IR “Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.”

8 Inverted Index Term/Doc Id/Weight ◦ Term  “A Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop- word elimination, stemming, filtering, term normalization, or language translation -- has been applied.”  http://www.javaworld.com/javaworld/jw-09-2000/jw- 0915-lucene-p2.html http://www.javaworld.com/javaworld/jw-09-2000/jw- 0915-lucene-p2.html

9 Inverted Index Doc Id ◦ A unique “key” that identifies each document Weight ◦ Binary ◦ Freq Count ◦ Weighting Algorithm

10 Index Merge Basic/Basket/Basketball ◦ Only keeps track of the differences between words ◦ Periodically merges indexes  Allows new documents to be added easily

11 Query Boolean Search ◦ Only searches documents with at least 1 term in query ◦ “Boolean Search Engine” Parallel Search ◦ Each term in query is search in parallel ◦ Partial scores added to queue of docs

12 Query Threshold ◦ If partial score is too low and will not be part of N-best then the document is ignored even before search is complete  Example  Potential New Doc [0,0,0,0,0,0,i]  Document ranked 14 [233,202,109,100,i]  Potential New Doc is ignored ◦ Small loss of recall greatly increases speed of search

13 Evaluation of Lucene Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering ◦ Tellex et al, MIT AI Lab 2003 Compared Prise to Lucene for question and answer tasks ◦ Question & Answer 

14 Evaluation of Lucene Prise ◦ A IR system developed by NIS that according to the paper uses “modern” search engine techniques Findings ◦ Found Prise was better than Lucene since “Boolean” query engines are considered old school and its answers to questions were better

15 Evaluation of Lucene Lucene ◦ Found although Prise had better correct answers Lucene found more documents containing relevant information MIT used Lucene in their 2005 TREC submission not Prise

16 Users Lucene is used widely ◦ TREC ◦ Document Retrieval Enterprise Systems ◦ Part of Database/Web engine ◦ Part of Nutch ◦ Used by academics for large projects

17 Conclusions Lucene is a good set of classes ◦ Designed to allow customization without have to “reinvent the wheel” ◦ Robust ◦ Fast ◦ Large development groups ◦ Used Widely in Academia and Industry


Download ppt "Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer "

Similar presentations


Ads by Google