Download presentation
Presentation is loading. Please wait.
1
Lucene Brian Nisonger Feb 08,2006
2
What is it? Doug Cutting’s grandmother’s middle name Doug Cutting’s grandmother’s middle name A open source set of Java Classses A open source set of Java Classses Search Engine/Document Classifier/Indexer Search Engine/Document Classifier/Indexer http://lucene.sourceforge.net/talks/pisa/ http://lucene.sourceforge.net/talks/pisa/ http://lucene.sourceforge.net/talks/pisa/ Developed by Doug Cutting 1996 Developed by Doug Cutting 1996 Xerox/Apple/Excite/Nutch Xerox/Apple/Excite/Nutch Wrote several papers in IR Wrote several papers in IR
3
What is it-Nuts and Bolts Modules for IR Modules for IR Analysis Analysis Tokenization Tokenization Where tokens are indexed Where tokens are indexed Document Document Where the Document ID is created Where the Document ID is created Date of Document is extracted Date of Document is extracted Title of document is extracted Title of document is extracted
4
Nuts and Bolts -II Modules-Con’t Modules-Con’t Index Index Provides access to indexes Provides access to indexes Maintains indexes Maintains indexes Query Parser Query Parser Where the magic of query happens Where the magic of query happens Search Search Searches across indexes Searches across indexes
5
Nuts and Bolts-III Modules-Con’t Modules-Con’t Search Spans Search Spans Spans Spans K+/- words K+/- words Example: Example: Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking Store/Util Store/Util Store the indexes and other housekeeping Store the indexes and other housekeeping
6
Theory Space Optimization for Total Ranking Space Optimization for Total Ranking Cutting et al 1996 Cutting et al 1996 RAIO (Computer Assisted IR) 1997 RAIO (Computer Assisted IR) 1997 http://lucene.sf.net/papers/riao97.ps http://lucene.sf.net/papers/riao97.ps http://lucene.sf.net/papers/riao97.ps Lucene lecture at Pisa Lucene lecture at Pisa Doug Cutting Doug Cutting Slides from Lecture at University of Pisa 2004 Slides from Lecture at University of Pisa 2004 See previous link See previous link
7
Vector Vectors are a mathematical distance between terms Vectors are a mathematical distance between terms Uses a cosine distance to determine how close terms/documents are Uses a cosine distance to determine how close terms/documents are This distance can then be used for WSD/Clustering/IR This distance can then be used for WSD/Clustering/IR Example: Example: Bass,fishing:.6506 Bass,fishing:.6506 Bass,guitar:.000423 Bass,guitar:.000423 This tells us the document is about fishing not about guitars This tells us the document is about fishing not about guitars
8
Vectors-IR “Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.” “Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.” http://www.perl.com/pub/a/2003/02/19/engine.html http://www.perl.com/pub/a/2003/02/19/engine.htmlhttp://www.perl.com/pub/a/2003/02/19/engine.html Intro to Comp Ling and its applications to IR Intro to Comp Ling and its applications to IR Nisonger 2005 :P Nisonger 2005 :P
9
Inverted Index Term/Doc Id/Weight Term/Doc Id/Weight Term Term “A Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop- word elimination, stemming, filtering, term normalization, or language translation -- has been applied.” “A Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop- word elimination, stemming, filtering, term normalization, or language translation -- has been applied.” http://www.javaworld.com/javaworld/jw-09- 2000/jw-0915-lucene-p2.html http://www.javaworld.com/javaworld/jw-09- 2000/jw-0915-lucene-p2.html http://www.javaworld.com/javaworld/jw-09- 2000/jw-0915-lucene-p2.html http://www.javaworld.com/javaworld/jw-09- 2000/jw-0915-lucene-p2.html
10
Inverted Index –Con’t Doc Id Doc Id A unique “key” that identifies each document A unique “key” that identifies each document Weight Weight Binary Binary Freq Count Freq Count Weighting Algorithm Weighting Algorithm
11
Index Merge Basic/Basket/Basketball Basic/Basket/Basketball Only keeps track of the differences between words Only keeps track of the differences between words Periodically merges indexes Periodically merges indexes Allows new documents to be added easily Allows new documents to be added easily
12
Query Boolean Search Boolean Search Only searches documents with at least 1 term in query Only searches documents with at least 1 term in query “Boolean Search Engine” “Boolean Search Engine” Parallel Search Parallel Search Each term in query is search in parallel Each term in query is search in parallel Partial scores added to queue of docs Partial scores added to queue of docs
13
Query-II Threshold Threshold If partial score is too low and will not be part of N-best then the document is ignored even before search is complete If partial score is too low and will not be part of N-best then the document is ignored even before search is complete Example Example Potential New Doc [0,0,0,0,0,0,i] Potential New Doc [0,0,0,0,0,0,i] Document ranked 14 [233,202,109,100,i] Document ranked 14 [233,202,109,100,i] Potential New Doc is ignored Potential New Doc is ignored Small loss of recall greatly increases speed of search Small loss of recall greatly increases speed of search
14
Evaluation of Lucene Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering Tellex et al, MIT AI Lab 2003 Tellex et al, MIT AI Lab 2003 Compared Prise to Lucene for question and answer tasks Compared Prise to Lucene for question and answer tasks Question & Answer Question & Answer
15
Evaluation-II Prise Prise A IR system developed by NIS that according to the paper uses “modern” search engine techniques A IR system developed by NIS that according to the paper uses “modern” search engine techniques Findings Findings Found Prise was better than Lucene since “Boolean” query engines are considered old school and its answers to questions were better Found Prise was better than Lucene since “Boolean” query engines are considered old school and its answers to questions were better
16
Eval-III Lucene Lucene Found although Prise had better correct answers Lucene found more documents containing relevant information Found although Prise had better correct answers Lucene found more documents containing relevant information
17
Eval-Conclusion External Knowledge Sources for Question Answering External Knowledge Sources for Question Answering http://people.csail.mit.edu/gremio/publica tions/TREC2005.ps. http://people.csail.mit.edu/gremio/publica tions/TREC2005.ps. http://people.csail.mit.edu/gremio/publica tions/TREC2005.ps http://people.csail.mit.edu/gremio/publica tions/TREC2005.ps Katz et al, MIT Lab 2005 Katz et al, MIT Lab 2005 MIT used Lucene in their 2005 TREC submission not Prise MIT used Lucene in their 2005 TREC submission not Prise
18
Users Lucene is used widely Lucene is used widely TREC TREC Document Retrieval Enterprise Systems Document Retrieval Enterprise Systems Part of Database/Web engine Part of Database/Web engine Part of Nutch Part of Nutch Used by academics for large projects Used by academics for large projects MIT, AI Lab MIT, AI Lab Know-It-All Project (UW) Know-It-All Project (UW)
19
Conclusions Lucene is a good set of classes Lucene is a good set of classes Designed to allow customization without have to “reinvent the wheel” Designed to allow customization without have to “reinvent the wheel” Robust Robust Fast Fast Large development groups Large development groups Used Widely in Academia and Industry Used Widely in Academia and Industry
20
Questions? Feel free to ask questions, make comments, tell jokes. Feel free to ask questions, make comments, tell jokes.
21
That’s ALL Folks!!!!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.