Advanced Lucene Grant Ingersoll Center for Natural Language Processing ApacheCon 2005 December 12, 2005.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with
Information Retrieval in Practice
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Internet Resources Discovery (IRD) Search Engines Quality.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Lucene Brian Nisonger Feb 08,2006. What is it? Doug Cutting’s grandmother’s middle name Doug Cutting’s grandmother’s middle name A open source set of.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Overview of Search Engines
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Nobody’s Unpredictable Ipsos Portals. © 2009 Ipsos Agenda 2 Knowledge Manager Archway Summary Portal Definition & Benefits.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
´Google-ized´ search in your business data Author: Krasen Paskalev Certified Oracle 8i/9i DBA Seniour Oracle Consultant Semantec GmbH Benzstr.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Lucene Jianguo Lu.
(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications April 28, 2011.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
User Interfaces and Information Retrieval Dina Reitmeyer WIRED (i385d)
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Searching and Indexing
Text Based Information Retrieval
CS 430: Information Discovery
Multimedia Information Retrieval
Query Languages.
CSE 635 Multimedia Information Retrieval
Information Retrieval and Web Design
AI Discovery Template IBM Cloud Architecture Center
Introduction to Search Engines
Presentation transcript:

Advanced Lucene Grant Ingersoll Center for Natural Language Processing ApacheCon 2005 December 12, 2005

Overview Quick Lucene Refresher Term Vectors Span Queries Building an IR Framework Case Studies from CNLP –Collection analysis for domain specialization –Crosslingual/multilingual retrieval in Arabic, English and Dutch –Sublanguage analysis for commercial trouble ticket analysis –Passage retrieval and analysis for Question Answering application Resources

Lucene Refresher Indexing –A Document is a collection of Fields –A Field is free text, keywords, dates, etc. –A Field can have several characteristics indexed, tokenized, stored, term vectors Apply Analyzer to alter Tokens during indexing –Stemming –Stopword removal –Phrase identification

Lucene Refresher Searching –Uses a modified Vector Space Model (VSM) –We convert a user’s query into an internal representation that can be searched against the Index –Queries are usually analyzed in the same manner as the index –Get back Hits and use in an application Vector Space Model t 1 t 2 t 3 d 1 d 2 d 3

Lucene Demo++ Based on original Lucene demo Sample, working code for: –Search –Relevance Feedback –Span Queries –Collection Analysis –Candidate Identification for QA Requires: Lucene 1.9 RC1-dev Available at

In Lucene, a TermFreqVector is a representation of all of the terms and term counts in a specific Field of a Document instance As a tuple: termFreq = > As Java: public String getField(); public String[] getTerms(); public int[] getTermFrequencies(); Lucene Term Vectors (TV) Parallel Arrays

Term Vector Usages Relevance Feedback and “More Like This” Domain specialization Clustering Cosine similarity between two documents Highlighter –Needs offset info

Creating Term Vectors During indexing, create a Field that stores Term Vectors: new Field("title", parser.getTitle(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES); Options are: Field.TermVector.YES Field.TermVector.NO Field.TermVector.WITH_POSITIONS – Token Position Field.TermVector.WITH_OFFSETS – Character offsets Field.TermVector.WITH_POSITIONS_OFFSETS

Accessing Term Vectors Term Vectors are acquired from the IndexReader using: TermFreqVector getTermFreqVector(int docNumber, String field) TermFreqVector[] getTermFreqVectors(int docNumber) Can be cast to TermPositionVector if the vector was created with offset or position information TermPositionVector API: int[] getTermPositions(int index); TermVectorOffsetInfo [] getOffsets(int index);

Relevance Feedback Expand the original query using terms from documents Manual Feedback: –User selects which documents are most relevant and, optionally, non-relevant –Get the terms from the term vector for each of the documents and construct a new query –Can apply boosts based on term frequencies Pseudo Feedback –Application assumes the top X documents are relevant and the bottom Y are non-relevant and constructs a new query based on the terms in those documents See Modern Information Retrieval by Baeza-Yates, et. al. for in-depth discussion of feedback

Example From Demo, SearchServlet.java Code to get the top X terms from a TV protected Collection getTopTerms(TermFreqVector tfv, int numTermsToReturn) { String[] terms = tfv.getTerms();//get the terms int [] freqs = tfv.getTermFrequencies();//get the frequencies List result = new ArrayList(terms.length); for (int i = 0; i < terms.length; i++) { //create a container for the Term and Frequency information result.add(new TermFreq(terms[i], freqs[i])); } Collections.sort(result, comparator);//sort by frequency if (numTermsToReturn < result.size()) { result = result.subList(0, numTermsToReturn); } return result; }

Span Queries Provide info about where a match took place within a document SpanTermQuery is the building block for more complicated queries Other SpanQuery classes: –SpanFirstQuery, SpanNearQuery, SpanNotQuery, SpanOrQuery

Better Phrase Matching SpanNearQuery provides functionality similar to PhraseQuery Use position distance instead of edit distance Advantages: –Less slop is required to match terms –Can be built up of other SpanQuery instances

Spans The Spans object provides document and position info about the current match Interface definitions: boolean next() //Move to the next match int doc()//the doc id of the match int start()//The match start position int end() //The match end position boolean skipTo(int target)//skip to a doc

Example SpanTermQuery section = new SpanTermQuery(new Term(“test”, “section”); SpanTermQuery lucene = new SpanTermQuery(new Term(“test”, “Lucene”); SpanTermQuery dev = new SpanTermQuery(new Term(“test”, “developers”); SpanFirstQuery first = new SpanFirstQuery(section, 2); Spans spans = first.getSpans(indexReader);//do something with the spans SpanQuery [] clauses = {lucene, dev}; SpanNearQuery near = new SpanNearQuery(clauses, 2, true); spans = first.getSpans(indexReader);//do something with the spans clauses = new SpanQuery[]{section, near}; SpanNearQuery allNear = new SpanNearQuery(clauses, 3, false); spans = first.getSpans(indexReader);//do something with the spans This section is for Lucene Java developers. first allNear near

What is QA? Return an answer, not just a list of hits that the user has to sift through Question ambiguity: –Who is the President? Often built on top of IR Systems: –IR: Query usually has the terms you are interested in –QA: Query usually does not have the terms you are interested in Future of search?!?

QA Needs Candidate Identification Determine Question types and focus –Person, place, yes/no, number, etc. Document Analysis –Determine if candidates align with the question Answering non-collection based questions Results display

Candidate Identification for QA A candidate is a potential answer for a question Can be as short as a word or as large as multiple documents, based on system goals Example: –Q: How many Lucene properties can be tuned? –Candidate: Lucene has a number of properties that can be tuned.

Span Queries and QA Retrieve candidates using SpanNearQuery, built out of the keywords of the user’s question –Expand to include window larger than span Score the sections based on system criteria Keywords, distances, ordering, others Sample Code in QAService.java in demo

Example QueryTermVector queryVec = new QueryTermVector(question, analyzer); String [] terms = queryVec.getTerms(); int [] freqs = queryVec.getTermFrequencies(); SpanQuery [] clauses = new SpanQuery[terms.length]; for (int i = 0; i < terms.length; i++) { String term = terms[i]; SpanTermQuery termQuery = new SpanTermQuery(new Term("contents", term)); termQuery.setBoost(freqs[i]); clauses[i] = termQuery; } SpanQuery query = new SpanNearQuery(clauses, 15, false); Spans spans = query.getSpans(indexReader); //Now loop through the spans and analyze the documents

Search Engine(s) (Lucene) Tying It All Together IRToolsCollection Mgr Extractor TextTagger (NLP) Authoring QAL2L Applications

Building an IR Framework Goals: –Support rapid experimentation and deployment –Configurable by non-programmers –Pluggable search engines (Lucene, Lucene LM, others) –Indexing and searching –Relevance Feedback (manual and pseudo) –Multilingual support –Common search API across projects

Lucene IR Tools Write implementations for Interfaces –Indexing –Searching –Analysis Wrap Filters and Tokenizers, providing Bean properties for configuration –Explanations Use Digester to create –Uses reflection, but only during startup

Sample Configuration Declare a Tokenizer: <tokenizer name="standardTokenizer" class="StandardTokenizerWrapper"/> Declare a Token Filter: Declare an Analyzer: test standardTokenizer stop Can also use existing Lucene Analyzers

Case Studies Domain Specialization AIR (Arabic Information Retrieval) Commercial QA Trouble Ticket Analysis

Domain Specialization At CNLP, we often work within specific domains Users are Subject Matter Experts (SMEs) who require advanced search and discovery capabilities We often cannot see the documents when developing solutions for customer Terrorism CRM Trouble Tickets Insurance Scientific Banking

Domain Specialization We provide the SME with tools to tailor our out of the box processing for their domain Users often want to see NLP features like: –Phrase and entity identification –Phrases and entities in context of other terms –Entity categorizations –Occurrences and co-occurrences Upon examining, users can choose to alter (or remove) how terms were processed Benefits –Less ambiguity yields better results for searches and extractions –SME has a better understanding of collection

KBB Provides several layers of co-occurrence data to the SME for custom processing Taxonomy -> Phrases Terms -> Context

Lucene and Domains Build an index of NLP features Several Alternatives for domain analysis: –Store Term Vectors (including position and offset) and loop through selected documents –Custom Analyzer/filter produces Lucene tokens and calculates counts, etc. –Span Queries for certain features such as term co-occurrence

AIR

AIR and Lucene AIR uses IR framework Wrote several different filters for Arabic and English to provide stemming and phrase identification One index per language Do query analysis on both the Source and Target language Added Dutch in less than 1 week

QA for CRM Provide NLP based QA for use in CRM applications Index knowledge based documents, plus FAQs and PAQs Uses Lucene to identify passages which are then processed by QA system to determine answers Index contains rich NLP features Set boosts on Query according to importance of term in query. Boosts determined by our query analysis system (L2L) We use the score from Lucene as a factor in scoring answers

Trouble Ticket Analysis Client has Trouble Tickets (TT) which are a combination of: –Fixed domain fields such as checkboxes –Free text, often written using shorthand Objective: –Use NLP to discover patterns and trends across TT collection

Using Lucene for Trouble Tickets Write custom Filters and Tokenizers to process shorthand –Tokenizer: Certain Symbols and whitespace delineate tokens –Filters Standardize shorthand –Addresses, dates, equipment and structure tags Regular expression stopword removal Use high frequency terms to identify lexical clues for use in trend identification by TextTagger Browsing collection to gain an understanding of domain

Resources Mailing List – CNLP – – Lucene In Action –