Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA.

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco.
Information Retrieval in Practice
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Lucene/SOLR 2: Lucene search API
Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
Programming with Microsoft Visual Basic th Edition
Dr. Kalpakis CMSC 661, Principles of Database Systems Representing Data Elements [12]
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Recap of Feb 27: Disk-Block Access and Buffer Management Major concepts in Disk-Block Access covered: –Disk-arm Scheduling –Non-volatile write buffers.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
1 Operating Systems Ch An Overview. Architecture of Computer Hardware and Systems Software Irv Englander, John Wiley, Bare Bones Computer.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
Enterprise Search. Search Architecture Configuring Crawl Processes Advanced Crawl Administration Configuring Query Processes Implementing People Search.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
DAY 15: ACCESS CHAPTER 2 Larry Reaves October 7,
Advanced Lucene Grant Ingersoll Center for Natural Language Processing ApacheCon 2005 December 12, 2005.
Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Mail merge I: Use mail merge for mass mailings Perform a complete mail merge Now you’ll walk through the process of performing a mail merge by using the.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Extents, segments and blocks in detail. Database structure Database Table spaces Segment Extent Oracle block O/S block Data file logical physical.
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.
Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.
IT253: Computer Organization
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Dr Ian Boston University of Cambridge Image © University of Cambridge December :30 INTL 6.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Why do we need a database?
Use of ICT in Data Management AS Applied ICT. Back to Contents Back to Contents.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Evidence from Content INST 734 Module 2 Doug Oard.
Chapter 5 Index and Clustering
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Lucene Jianguo Lu.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
CS4432: Database Systems II
TerarkDB Introduction Peng Lei Terark Inc ( ). All rights reserved 1.
Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
In this session, you will learn to: Create and manage views Implement a full-text search Implement batches Objectives.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Jonathan Walpole Computer Science Portland State University
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Searching and Indexing
Vores tankesæt: 80% teknologi | 20% forretning
Lucene in action Information Retrieval A.A
Four Rules For Columnstore Query Performance
Diving into Query Execution Plans
CSE 326: Data Structures Lecture #14
Presentation transcript:

Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

2 Schedule In-depth Indexing/Searching – Performance, Internals – Filters, Sorting Terms and Term Vectors Class Project Q & A

3 Day I Recap Indexing – IndexWriter – Document / Field – Analyzer Searching – IndexSearcher – IndexReader – QueryParser Analysis Contrib

4 Indexing In-Depth Deletions and Updates Optimize Important Internals – File Formats – Segments, Commits, Merging – Compound File System Performance

5 Lucene File Formats and Structures A Lucene index is made up of one or more Segments Lucene tracks Document s internally by an int “id” This id may change across index operations – You should not rely on it unless you know your index isn’t changing You can ask for a Document by this id on the IndexReader

6 Segments Each Segment is an independent index containing: –Field Names –Stored Field values –Term Dictionary, proximity info and normalization factors –Term Vectors (optional) –Deleted Docs Compound File System (CFS) stores all of these logical pieces in a single file

How Lucene Indexes Lucene indexes Document s into memory –At certain trigger points, memory (segments) are committed/flushed to the Directory Can be forced by calling commit() –Segments are periodically merged (more in a moment)

8 Segments and Merging May be created when new documents are added Are merged from time to time based on segment size in relation to: – MergePolicy – MergeScheduler –Optimization

9 Merge Policy Identifies Segments to be merged Two Current Implementations – LogDocMergePolicy – LogByteSizeMergePolicy mergeFactor - Max # of segments allowed before merging

10 MergeScheduler Responsible for performing the merge Two Implementations: –Serial - blocking –Concurrent - new, background

11 Optimize Optimize is the process of merging segments down into a single segment This process can yield significant speedups in search Can be slow Can also do partial optimizes

12 Final Thoughts On Merging Usually don’t have to think about it, except when to optimize In high update, performance critical environments, you may need to dig into it more as it can sometimes cause long pauses Good to optimize when you can, otherwise, keep a low mergeFactor

Deletion A deletion only marks the Document as deleted –Doesn’t get physically removed until a merge Deletions can be a bit confusing –Both IndexReader and IndexWriter have delete methods By: id, term(s), Query (s)

14 Task –Build your index from yesterday and then try some deletes Id, term, Query –Also try out an optimize on a FSDirectory against the full Reuters sample –15-20 minutes

15 Updates Updates are always a delete and an add –Yes, that is a repeat! –Nature of data structures used in search See IndexWriter.updateDocument()

Performance Factors setRAMBufferSizeMB –New model for automagically controlling indexing factors based on the amount of memory in use –Obsoletes setMaxBufferedDocs maxBufferedDocs –Minimum # of docs before merge occurs and a new segment is created –Usually, Larger == faster, but more RAM

17 More Factors mergeFactor –How often segments are merged –Smaller == less RAM, better for incremental updates –Larger == faster, better for batch indexing maxFieldLength –Limit the number of terms in a Document Analysis Reuse – Document, TokenStream, Token

Index Threading IndexWriter and IndexReader are thread- safe and can be shared between threads without external synchronization One open IndexWriter per Directory Parallel Indexing –Index to separate Directory instances –Merge using IndexWriter.addIndexes –Could also distribute and collect

Benchmarking Indexing contrib/benchmark Try out different algorithms between Lucene 2.2 and 2.3 –contrib/benchmark/conf: indexing.alg indexing-multithreaded.alg Info: –Mac Pro 2 x 2GHz Dual-Core Xeon –4 GB RAM – ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M

Benchmarking Results Records/SecAvg. T Mem M Trunk2,12252M Trunk-mt (4) 3,68057M Your results will depend on analysis, etc.

Searching Earlier we touched on basics of search using the QueryParser Now look at: –Searcher / IndexReader Lifecycle –Query classes –More details on the QueryParser –Filter s –Sort ing

Lifecycle Recall that the IndexReader loads a snapshot of index into memory –This means updates made since loading the index will not be seen Business rules are needed to define how often to reload the index, if at all –IndexReader.isCurrent() can help Loading an index is an expensive operation –Do not open a Searcher/IndexReader for every search

23 Reopen It is possible to have IndexReader reopen new or changed segments –Save some on the cost of loading a new index Does not close the old reader, so application must See DeletionsUpdatesTest.testReopen()

Query Classes TermQuery is basis for all non-span queries BooleanQuery combines multiple Query instances as clauses –should –required PhraseQuery finds terms occurring near each other, position-wise –“slop” is the edit distance between two terms Take 2-3 minutes to explore Query implementations

Spans Spans provide information about where matches took place Not supported by the QueryParser Can be used in BooleanQuery clauses Take 2-3 minutes to explore SpanQuery classes –SpanNearQuery useful for doing phrase matching

QueryParser MultiFieldQueryParser Boolean operators cause confusion –Better to think in terms of required (+ operator) and not allowed (- operator) Check JIRA for QueryParser issues Most applications either modify QP, create their own, or restrict to a subset of the syntax Your users may not need all the “flexibility” of the QP

Sorting Lucene default sort is by score Searcher has several methods that take in a Sort object Sorting should be addressed during indexing Sorting is done on Field s containing a single term that can be used for comparison The SortField defines the different sort types available –AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC

Sorting II Look at Searcher, Sort and SortField Custom sorting is done with a SortComparatorSource Sorting can be very expensive –Terms are cached in the FieldCache

Filter s Filter s restrict the search space to a subset of Document s Use Cases –Search within a Search –Restrict by date –Rating –Security –Author

Filter Classes QueryWrapperFilter (QueryFilter) –Restrict to subset of Document s that match a Query RangeFilter –Restrict to Document s that fall within a range –Better alternative to RangeQuery CachingWrapperFilter –Wrap another Filter and provide caching

31 Task Modify your program to sort by a field and to filter by a query or some other criteria –~15 minutes

Searcher s MultiSearcher –Search over multiple Searchable s, including remote MultiReader –Not a Searcher, but can be used with IndexSearcher to achieve same results for local indexes ParallelMultiSearcher –Like MultiSearcher, but threaded RemoteSearchable –RMI based remote searching Look at MultiSearcherTest in example code

Expert Results Searcher has several “expert” methods HitCollector allows low-level access to all Document s as they are scored

Search Performance Search speed is based on a number of factors: –Query Type(s) –Query Size –Analysis –Occurrences of Query Terms –Optimize –Index Size –Index type ( RAMDirectory, other) –Usual Suspects CPU Memory I/O Business Needs

Query Types Be careful with WildcardQuery as it rewrites to a BooleanQuery containing all the terms that match the wildcards Avoid starting a WildcardQuery with wildcard Use ConstantScoreRangeQuery instead of RangeQuery Be careful with range queries and dates –User mailing list and Wiki have useful tips for optimizing date handling

Query Size Stopword removal Search an “all” field instead of many fields with the same terms Disambiguation –May be useful when doing synonym expansion –Difficult to automate and may be slower –Some applications may allow the user to disambiguate Relevance Feedback/More Like This –Use most important words –“Important” can be defined in a number of ways

Usual Suspects CPU –Profile your application Memory –Examine your heap size, garbage collection approach I/O –Cache your Searcher Define business logic for refreshing based on indexing needs –Warm your Searcher before going live -- See Solr Business Needs –Do you really need to support Wildcards? –What about date range queries down to the millisecond?

FieldSelector Prior to version 2.1, Lucene always loaded all Fields in a Document FieldSelector API addition allows Lucene to skip large Fields –Options: Load, Lazy Load, No Load, Load and Break, Load for Merge, Size, Size and Break Makes storage of original content more viable without large cost of loading it when not used FieldSelectorTest in example code

39 Relevance At some point along your journey, you will get results that you think are “bad” Is it a big deal? –Content, Content, Content! –Relevance Judgments –Don’t break other queries just to “fix” one Hardcode it! –A query doesn’t always have to result in a “search”

Scoring and Similarity Lucene has sophisticated scoring mechanism designed to meet most needs Has hooks for modifying scores Scoring is handled by the Query, Weight and Scorer class

Explanations explain(Query, int) method is useful for understanding why a Document scored the way it did Shows all the pieces that went into scoring the result: –Tf, DF, boosts, etc.

Tuning Relevance FunctionQuery from Solr (variation in Lucene) Override Similarity Implement own Query and related classes Payload s Boosts

43 Task Open Luke and try some queries and then use the “explain” button Or, write some code to do explains on a query and some documents See how Query type, boosting, other factors play a role in the score

44 Terms and Term Vectors Sometimes you need access to the Term Dictionary: –Auto suggest –Frequency information Sometimes you need a Document-centric view of terms, frequencies, positions and offsets –Term Vectors

Term Information TermEnum gives access to terms and how many Document s they occur in –IndexReader.terms() TermDocs gives access to the frequency of a term in a Document –IndexReader.termDocs() –TermPositions extends TermDocs and provides access to position and payload info –IndexReader.termPositions()

46 Term Vectors Term Vectors give access to term frequency information in a given Document –IndexReader.getTermFreqVector TermVectorMapper provides callbacks for working with Term Vectors

47 TermsTest Provides samples of working with terms and term vectors

Lunch ? 1-2:30

Recap Indexing Searching Performance Odds and Ends –Explains –FieldSelector –Relevance –Terms and Term Vectors

50 Class Project Your chance to really dig in and get your hands dirty Ask Questions Options…

51 Option I Start building out your Lucene Application! –Index your Data (or any data) Threading/Updates/Deletions Analysis –Search Caching/Warming Dealing with Updates Multi-threaded –Display

52 Option II Dig deeper into an area of interest –Performance How fast can you index? Search? Queries per Second? –Analysis –Query Parsing –Scoring –Contrib

53 Option III Dig into JIRA issues and find something to fix in Lucene java/HowToContributehttp://wiki.apache.org/lucene- java/HowToContribute

54 Option IV Try out Solr

55 Option V Other? –Architecture Review/Discussion –Use Case Discussion

Project Post-Mortem Volunteers to share?

Open Discussion Multilingual Best Practices –UNICODE –One Index versus many Advanced Analysis Distributed Lucene Crawling Hadoop Nutch Solr

Resources Lucid Imagination –Support –Training –Value Add

Finally… Please take the time to fill out a survey to help me improve this training –Located in base directory of source – it to me at There are several Lucene related talks on Wednesday