Download presentation
Presentation is loading. Please wait.
Published byChristian Barker Modified over 9 years ago
1
Lucene
2
Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer http://lucene.sourceforge.net/talks/pisa/ http://lucene.sourceforge.net/talks/pisa/ ◦ Developed by Doug Cutting 1996 Contributed to Apache project Wrote several papers in IR
3
Modules for IR ◦ Analysis Tokenization Where tokens are indexed ◦ Document Where the Document ID is created Date of Document is extracted Title of document is extracted ◦ Index Provides access to indexes Maintains indexes ◦ Query Parser Where the magic of query happens ◦ Search Searches across indexes
4
Modules for IR ◦ Search Spans Spans K+/- words Example: Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking ◦ Store/Util Store the indexes and other housekeeping
5
Theory Space Optimization for Total Ranking ◦ Cutting et al 1996 ◦ RAIO (Computer Assisted IR) 1997 ◦ http://lucene.sf.net/papers/riao97.ps http://lucene.sf.net/papers/riao97.ps Lucene lecture at Pisa ◦ Doug Cutting ◦ Slides from Lecture at University of Pisa 2004
6
Vector Vectors are a mathematical distance between terms ◦ Uses a cosine distance to determine how close terms/documents are ◦ This distance can then be used for WSD/Clustering/IR ◦ Example: Bass,fishing:.6506 Bass,guitar:.000423 This tells us the document is about fishing not about guitars
7
Vectors-IR “Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.”
8
Inverted Index Term/Doc Id/Weight ◦ Term “A Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop- word elimination, stemming, filtering, term normalization, or language translation -- has been applied.” http://www.javaworld.com/javaworld/jw-09-2000/jw- 0915-lucene-p2.html http://www.javaworld.com/javaworld/jw-09-2000/jw- 0915-lucene-p2.html
9
Inverted Index Doc Id ◦ A unique “key” that identifies each document Weight ◦ Binary ◦ Freq Count ◦ Weighting Algorithm
10
Index Merge Basic/Basket/Basketball ◦ Only keeps track of the differences between words ◦ Periodically merges indexes Allows new documents to be added easily
11
Query Boolean Search ◦ Only searches documents with at least 1 term in query ◦ “Boolean Search Engine” Parallel Search ◦ Each term in query is search in parallel ◦ Partial scores added to queue of docs
12
Query Threshold ◦ If partial score is too low and will not be part of N-best then the document is ignored even before search is complete Example Potential New Doc [0,0,0,0,0,0,i] Document ranked 14 [233,202,109,100,i] Potential New Doc is ignored ◦ Small loss of recall greatly increases speed of search
13
Evaluation of Lucene Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering ◦ Tellex et al, MIT AI Lab 2003 Compared Prise to Lucene for question and answer tasks ◦ Question & Answer
14
Evaluation of Lucene Prise ◦ A IR system developed by NIS that according to the paper uses “modern” search engine techniques Findings ◦ Found Prise was better than Lucene since “Boolean” query engines are considered old school and its answers to questions were better
15
Evaluation of Lucene Lucene ◦ Found although Prise had better correct answers Lucene found more documents containing relevant information MIT used Lucene in their 2005 TREC submission not Prise
16
Users Lucene is used widely ◦ TREC ◦ Document Retrieval Enterprise Systems ◦ Part of Database/Web engine ◦ Part of Nutch ◦ Used by academics for large projects
17
Conclusions Lucene is a good set of classes ◦ Designed to allow customization without have to “reinvent the wheel” ◦ Robust ◦ Fast ◦ Large development groups ◦ Used Widely in Academia and Industry
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.