Lucene Brian Nisonger Feb 08,2006. What is it? Doug Cutting’s grandmother’s middle name Doug Cutting’s grandmother’s middle name A open source set of.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval in Practice
ISP 433/533 Week 2 IR Models.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Hinrich Schütze and Christina Lioma
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Evaluating the Performance of IR Sytems
Advance Information Retrieval Topics Hassan Bashiri.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Chapter 23: Probabilistic Language Models April 13, 2004.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Web- and Multimedia-based Information Systems Lecture 2.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Lucene Jianguo Lu.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Information Retrieval Quality of a Search Engine.
(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications April 28, 2011.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
CS315 Introduction to Information Retrieval Boolean Search 1.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Basic Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Chapter 5: Information Retrieval and Web Search
CS246: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Lucene Brian Nisonger Feb 08,2006

What is it? Doug Cutting’s grandmother’s middle name Doug Cutting’s grandmother’s middle name A open source set of Java Classses A open source set of Java Classses Search Engine/Document Classifier/Indexer Search Engine/Document Classifier/Indexer Developed by Doug Cutting 1996 Developed by Doug Cutting 1996 Xerox/Apple/Excite/Nutch Xerox/Apple/Excite/Nutch Wrote several papers in IR Wrote several papers in IR

What is it-Nuts and Bolts Modules for IR Modules for IR Analysis Analysis Tokenization Tokenization Where tokens are indexed Where tokens are indexed Document Document Where the Document ID is created Where the Document ID is created Date of Document is extracted Date of Document is extracted Title of document is extracted Title of document is extracted

Nuts and Bolts -II Modules-Con’t Modules-Con’t Index Index Provides access to indexes Provides access to indexes Maintains indexes Maintains indexes Query Parser Query Parser Where the magic of query happens Where the magic of query happens Search Search Searches across indexes Searches across indexes

Nuts and Bolts-III Modules-Con’t Modules-Con’t Search Spans Search Spans Spans Spans K+/- words K+/- words Example: Example: Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking Store/Util Store/Util Store the indexes and other housekeeping Store the indexes and other housekeeping

Theory Space Optimization for Total Ranking Space Optimization for Total Ranking Cutting et al 1996 Cutting et al 1996 RAIO (Computer Assisted IR) 1997 RAIO (Computer Assisted IR) Lucene lecture at Pisa Lucene lecture at Pisa Doug Cutting Doug Cutting Slides from Lecture at University of Pisa 2004 Slides from Lecture at University of Pisa 2004 See previous link See previous link

Vector Vectors are a mathematical distance between terms Vectors are a mathematical distance between terms Uses a cosine distance to determine how close terms/documents are Uses a cosine distance to determine how close terms/documents are This distance can then be used for WSD/Clustering/IR This distance can then be used for WSD/Clustering/IR Example: Example: Bass,fishing:.6506 Bass,fishing:.6506 Bass,guitar: Bass,guitar: This tells us the document is about fishing not about guitars This tells us the document is about fishing not about guitars

Vectors-IR “Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.” “Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.” Intro to Comp Ling and its applications to IR Intro to Comp Ling and its applications to IR Nisonger 2005 :P Nisonger 2005 :P

Inverted Index Term/Doc Id/Weight Term/Doc Id/Weight Term Term “A Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop- word elimination, stemming, filtering, term normalization, or language translation -- has been applied.” “A Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop- word elimination, stemming, filtering, term normalization, or language translation -- has been applied.” /jw-0915-lucene-p2.html /jw-0915-lucene-p2.html /jw-0915-lucene-p2.html /jw-0915-lucene-p2.html

Inverted Index –Con’t Doc Id Doc Id A unique “key” that identifies each document A unique “key” that identifies each document Weight Weight Binary Binary Freq Count Freq Count Weighting Algorithm Weighting Algorithm

Index Merge Basic/Basket/Basketball Basic/Basket/Basketball Only keeps track of the differences between words Only keeps track of the differences between words Periodically merges indexes Periodically merges indexes Allows new documents to be added easily Allows new documents to be added easily

Query Boolean Search Boolean Search Only searches documents with at least 1 term in query Only searches documents with at least 1 term in query “Boolean Search Engine” “Boolean Search Engine” Parallel Search Parallel Search Each term in query is search in parallel Each term in query is search in parallel Partial scores added to queue of docs Partial scores added to queue of docs

Query-II Threshold Threshold If partial score is too low and will not be part of N-best then the document is ignored even before search is complete If partial score is too low and will not be part of N-best then the document is ignored even before search is complete Example Example Potential New Doc [0,0,0,0,0,0,i] Potential New Doc [0,0,0,0,0,0,i] Document ranked 14 [233,202,109,100,i] Document ranked 14 [233,202,109,100,i] Potential New Doc is ignored Potential New Doc is ignored Small loss of recall greatly increases speed of search Small loss of recall greatly increases speed of search

Evaluation of Lucene Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering Tellex et al, MIT AI Lab 2003 Tellex et al, MIT AI Lab 2003 Compared Prise to Lucene for question and answer tasks Compared Prise to Lucene for question and answer tasks Question & Answer Question & Answer

Evaluation-II Prise Prise A IR system developed by NIS that according to the paper uses “modern” search engine techniques A IR system developed by NIS that according to the paper uses “modern” search engine techniques Findings Findings Found Prise was better than Lucene since “Boolean” query engines are considered old school and its answers to questions were better Found Prise was better than Lucene since “Boolean” query engines are considered old school and its answers to questions were better

Eval-III Lucene Lucene Found although Prise had better correct answers Lucene found more documents containing relevant information Found although Prise had better correct answers Lucene found more documents containing relevant information

Eval-Conclusion External Knowledge Sources for Question Answering External Knowledge Sources for Question Answering tions/TREC2005.ps. tions/TREC2005.ps. tions/TREC2005.ps tions/TREC2005.ps Katz et al, MIT Lab 2005 Katz et al, MIT Lab 2005 MIT used Lucene in their 2005 TREC submission not Prise MIT used Lucene in their 2005 TREC submission not Prise

Users Lucene is used widely Lucene is used widely TREC TREC Document Retrieval Enterprise Systems Document Retrieval Enterprise Systems Part of Database/Web engine Part of Database/Web engine Part of Nutch Part of Nutch Used by academics for large projects Used by academics for large projects MIT, AI Lab MIT, AI Lab Know-It-All Project (UW) Know-It-All Project (UW)

Conclusions Lucene is a good set of classes Lucene is a good set of classes Designed to allow customization without have to “reinvent the wheel” Designed to allow customization without have to “reinvent the wheel” Robust Robust Fast Fast Large development groups Large development groups Used Widely in Academia and Industry Used Widely in Academia and Industry

Questions? Feel free to ask questions, make comments, tell jokes. Feel free to ask questions, make comments, tell jokes.

That’s ALL Folks!!!!!