Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
EasySearch Technical Overview. Ever seen a website without a full text search? BUT – Search is expensive Financially Computationally – Search is complicated.
AskMe A Web-Based FAQ Management Tool Alex Albu. Background Fast responses to customer inquiries – key factor in customer satisfaction Costs for customer.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Solr has a lot of extensive features Solr Integration and Enhancements Todd Hatcher.
Information Retrieval in Practice
Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll.
Lucene Brian Nisonger Feb 08,2006. What is it? Doug Cutting’s grandmother’s middle name Doug Cutting’s grandmother’s middle name A open source set of.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Introduction to a Programming Environment
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Overview of Search Engines
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
Advanced Lucene Grant Ingersoll Center for Natural Language Processing ApacheCon 2005 December 12, 2005.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Created By: Kevin Cherry. A library that creates a display to run on top of your game allowing you to retrieve/set values and invoke methods.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
XP Tutorial 8 Adding Interactivity with ActionScript.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA.
Design a full-text search engine for a website based on Lucene
Information Retrieval
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
Lucene Jianguo Lu.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
©2003 Paula Matuszek GOOGLE API l Search requests: submit a query string and a set of parameters to the Google Web APIs service and receive in return a.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Searching and Indexing
Safe by default, optimized for efficiency
Building Search Systems for Digital Library Collections
Search Techniques and Advanced tools for Researchers
Lucene in action Information Retrieval A.A
Lucene/Solr Architecture
Getting Started With Solr
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA

Intro My Background Goals for Tutorial –Understand Lucene core capabilities –Real examples, real code, real data Ask Questions!!!!!

Schedule Day I – Concepts – Indexing – Searching – Analysis – Lucene contrib: highlighter, spell checking, etc. Day II – In-depth Indexing/Searching Performance, Internals – Terms and Term Vectors – Class Project – Q & A

4 Resources Slides at – Lucene Java – – – Luke: –

5 What is Search? Given a user’s information need (query), find documents relevant to the need –Very Subjective! Information Retrieval –Interdisciplinary –Comp. Sci, Math/Statistics, Library Sci., Linguistics, AI…

6 Search Use Cases Web –Google, Y!, etc. Enterprise –Intranet, Content Repositories, , etc. eCommerce/DB/CMS –Online Stores, websites, etc. Other –QA, Federated Yours? Why do you need Search?

7 Your Content And You Only you know your content! –Key Features Title, body, price, margin, etc. –Important Terms –Synonyms/Jargon –Structures (tables, lists, etc.) –Importance –Priorities

Search Basics Many different Models: –Boolean, Probabilistic, Inference, Neural Net, and: Modified Vector Space Model (VSM) –Boolean + VSM –TF-IDF –The words in the document and the query each define a Vector in an n-dimensional space –Sim(q1, d1) = cos Θ q1q1 d1d1 Θ d j = q= w = weight assigned to term

9 Inverted Index From “Taming Text”

10 Lucene Background Created by Doug Cutting in 1999 Donated to ASF in 2001 Morphed into a Top Level Project (TLP) with many sub projects –Java (flagship) a.k.a. “Lucene” –Solr, Nutch, Mahout, Tika, several Lucene ports From here on out, Lucene refers to “Lucene Java”

Lucene is… NOT a crawler –See Nutch NOT an application –See PoweredBy on the Wiki NOT a library for doing Google PageRank or other link analysis algorithms –See Nutch A library for enabling text based search

12 A Few Words about Solr HTTP-based Search Server XML Configuration XML, JSON, Ruby, PHP, Java support Many, many nice features that Lucene users need –Faceting, spell checking, highlighting –Caching, Replication, Distributed

Indexing Process of preparing and adding text to Lucene, which stores it in an inverted index Key Point: Lucene only indexes Strings –What does this mean? Lucene doesn’t care about XML, Word, PDF, etc. –There are many good open source extractors available It’s our job to convert whatever file format we have into something Lucene can use

Indexing Classes Analyzer –Creates tokens using a Tokenizer and filters them through zero or more TokenFilter s IndexWriter –Responsible for converting text into internal Lucene format Directory –Where the Index is stored –RAMDirectory, FSDirectory, others

Indexing Classes Document –A collection of Field s –Can be boosted Field –Free text, keywords, dates, etc. –Defines attributes for storing, indexing –Can be boosted –Field Constructors and parameters Open up Fieldable and Field in IDE

How to Index Create IndexWriter For each input –Create a Document –Add Field s to the Document –Add the Document to the IndexWriter Close the IndexWriter Optimize (optional)

Indexing in a Nutshell For each Document –For each Field to be tokenized Create the tokens using the specified Tokenizer –Tokens consist of a String, position, type and offset information Pass the tokens through the chained TokenFilter s where they can be changed or removed Add the end result to the inverted index Position information can be altered –Useful when removing words or to prevent phrases from matching

Task 1.a From the Boot Camp Files, use the basic.ReutersIndexer skeleton to start Index the small Reuters Collection using the IndexWriter, a Directory and StandardAnalyzer –Boost every 10 documents by 3 Questions to Answer: –What Field s should I define? –What attributes should each Field have? –Pick a field to boost and give a reason why you think it should be boosted ~30 minutes

Use Luke

5 minute Break

Searching Parse user query Lookup matching Documents Score Documents Return ranked list

22 Key Classes: –Searcher –Provides methods for searching –Take a moment to look at the Searcher class declaration IndexSearcher, MultiSearcher, ParallelMultiSearcher –IndexReader –Loads a snapshot of the index into memory for searching –More tomorrow –TopDocs - The search results –QueryParser – –Query –Logical representation of program’s information need

Query Parsing Basic syntax: title:hockey +(body:stanley AND body:cup) OR/AND must be uppercase Default operator is OR (can be changed) Supports fairly advanced syntax, see the website – Doesn’t always play nice, so beware –Many applications construct queries programmatically or restrict syntax

24 How to Search Create/Get an IndexSearcher Create a Query –Use a QueryParser –Construct it programmatically Display the results from the TopDocs –Retrieve Field values from Document More tomorrow on search lifecyle

Task 1.b Using the ReutersIndexerTest.java skeleton in the boot camp files –Search your newly created index using queries you develop Questions: –What is the default field for the QueryParser ? –What Analyzer to use? ~20 minutes

Task 1 Results Scores across queries are NOT comparable –They may not even be comparable for the same query over time (if the index changes) Performance –Caching –Warming –More Tomorrow

Lunch 1-2:30

28 Discussion/Questions So far, we’ve seen the basics of search and indexing Next going to look into Analysis and Contrib modules

Analysis Analysis is the process of creating Token s to be indexed Analysis is usually done to improve results overall, but it comes with a price Lucene comes with many different Analyzer s, Tokenizer s and TokenFilter s, each with their own goals StandardAnalyzer is included with the core JAR and does a good job for most English and Latin-based tasks Often times you want the same content analyzed in different ways Consider a catch-all Field in addition to other Field s

30 Solr’s Analysis tool If you use nothing else from Solr, the Admin analysis tool can really help you understand analysis Download Solr and unpack it cd apache-solr-1.3.0/example java -jar start.jar

31 Analyzers StandardAnalyzer, WhitespaceAnalyzer, SimpleAnalyzer Contrib/analysis –Suite of Analyzers for many common situations Languages n-grams Payloads Contrib/snowball

Tokenization Split words into Token s to be processed Tokenization is fairly straightforward for most languages that use a space for word segmentation –More difficult for some East Asian languages –See the CJK Analyzer

Modifying Tokens TokenFilter s are used to alter the token stream to be indexed Common tasks: –Remove stopwords –Lower case –Stem/Normalize -> Wi-Fi -> Wi Fi –Add Synonyms StandardAnalyzer does things that you may not want

34 Payloads Associate an arbitrary byte array with a term in the index Uses –Part of Speech –Font weight –URL Currently can search using the BoostingTermQuery

35 n-grams Combine units of content together into a single token Character –2-grams for the word “Lucene”: Lu,uc, ce, en, ne –Can make search possible when data is noisy or hard to tokenize Word (“shingles” in Lucene parlance) –Pseudo Phrases

Custom Analyzers Problem: none of the Analyzers cover my problem Solution: write your own Analyzer Better solution: write a configurable Analyzer so you only need one Analyzer that you can easily change for your projects –See Solr

37 Analysis APIs Have a look at the TokenStream and Token API s Token s and TokenStream s may be reused –Helps reduce allocations and speeds up indexing –Not all Analysis can take advantage: caching – Analyzer.reusableTokenStream() – TokenStream.next(Token)

Special Cases Dates and numbers need special treatment to be searchable –o.a.l.document.DateTools –org.apache.solr.util.NumberUtils Altering Position Information –Increase Position Gap between sentences to prevent phrases from crossing sentence boundaries –Index synonyms at the same position so query can match regardless of synonym used

Task 2 Take minutes and write an Analyzer/Tokenizer/TokenFilter and Unit Test –Examine results in Luke –Run some searches Ideas: –Combine existing Tokenizer s and TokenFilter s –Normalize abbreviations –Add payloads –Filter out all words beginning with the letter A –Identify/Mark sentences

40 Discussion What did you implement? What issues do you face with your content? To Stem or not to Stem? Stopwords: good or bad? Tradeoffs of different techniques

Lucene Contributions Many people have generously contributed code to help solve common problems These are in contrib directory of the source Popular: –Analyzers –Highlighter –Queries and MoreLikeThis –Snowball Stemmers –Spellchecker

42 Highlighter Highlight query keywords in context –Often useful for display purposes Important Classes: – Highlighter - Main entry point, coordinates the work – Fragmenter - Splits up document for scoring – Formatter - Marks up the results – Scorer - Scores the fragments SpanScorer - Can score phrases Use term vectors for performance Look at example usage

43 Spell Checking Suggest spelling corrections based on spellings of words in the index –Will/can suggest incorrectly spelled words Uses a distance measure to determine suggestions –Can also factor in document frequency –Distance Measure is pluggable

44 Spell Checking Classes: Spellchecker, StringDistance See ContribExamplesTest Practical aspects: –It’s not as simple as just turning it on –Good results require testing and tuning Pay attention to accuracy settings Mind your Analysis (simple, no stemming) Consider alternate StringDistance ( JaroWinklerDistance )

45 More Like This Given a Document, find other Document s that are similar –Variation on relevance feedback –“Find Similar” Extracts the most important terms from a Document and creates a new query –Many options available for determining important terms Classes: MoreLikeThis –See ContribExamplesTest

46 Summary Indexing Searching Analysis Contrib Questions?

Resources Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto Lucene In Action by Hatcher and Gospodnetić Wiki Mailing Lists Discussions on how to use Lucene Discussions on how to develop Lucene Issue Tracking – We always welcome patches –Ask on the mailing list before reporting a bug

Resources Lucid Imagination –Support –Training –Value Add