Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco.

Similar presentations


Presentation on theme: "Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco."— Presentation transcript:

1 Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco

2 What is NRT? Search on documents nearly as fast as they are indexed Delete documents in a way that is immediate and IO efficient Good for things like Twitter and other apps that require realtime searching (Social 2.0)

3 Today? Users expect to search their data immediately after updating it (Web/Social 2.0 apps) Search engines are designed to perform efficient batch indexing (not realtime) Batch indexing is slow and updates take a while to be searchable

4 NRT in Lucene Uses core Lucene code to make existing batch indexing nearly realtime Required retrofitting of some of the core implementation Details are hidden Hopefully really easy for developers to use

5 Lucene NRT Patches LUCENE-1314 – IndexReader.clone LUCENE-1516 – IndexWriter.getReader LUCENE-1313 – RAMDir in IndexWriter LUCENE-1483 – Fast FieldCache loading LUCENE-1231 – Column stride fields LUCENE-1526 – Incremental copy-on- write

6 LUCENE-1314 IndexReader.clone is like reopen However it performs a copy-on-write of norms and deletes Used by LUCENE-1516 to keep deletes in RAM (rather than flush them to disk)

7 LUCENE-1516 Adds ability to obtain an IndexReader from IndexWriter Efficient in ram deletes Call IndexWriter.getReader instead of IndexReader.reopen All updating, deletes, roepening, and flushing details hidden from user Will be in Lucene 2.9

8 Sample IW.getReader Code IndexWriter writer; Document doc = new Document(); writer.addDocument(doc); IndexReader reader = writer.getReader(); Document sameDoc= reader.document(0); assert doc.equals(sameDoc);

9 LUCENE-1313 Near Realtime Search Makes IW.getReader faster New segments are flushed to IndexWriter internal RAMDirectory Could increase overall indexing performance because theres no pause while the ram buffer is being written to disk Will be in Lucene 2.9?

10 LUCENE-1483 Searches on fieldcaches at the segment level Means faster field cache loading and more efficient memory usage Good for realtime because field cache loading is less of a bottleneck, less ram usage Will be in Lucene 2.9

11 LUCENE-1526 Optimize copy-on-write When were doing IndexReader.clone, we may be creating a huge new array for a small number of deletes or norms updates So we need to do incremental copy-on- write of things like deletes, norms, and field caches (?) Lucene 3.0?

12 LUCENE-1231 Column stride fields will make field cache loading faster because data will be loaded sequentially from disk Today there are potentially two hard drive seeks per field cache value (TermEnum.next, TermDocs.next) Lucene 3.0?

13 Future of Lucene NRT LUCENE-1292 – Realtime parallel untokenized field index (for tags) Pulsing - Store smaller postings directly in the term dictionary (to avoid seeks) for faster field cache loading Replication More benchmarks

14 LinkedIn Open Source Projects Bobo – Facet library that counts using custom field caches http://code.google.com/p/bobo-browse/ http://code.google.com/p/bobo-browse/ Zoie – Realtime search on top of Lucene http://code.google.com/p/zoie/ http://code.google.com/p/zoie/ Voldemort – Distributed key-value storage http://project-voldemort.com/ http://project-voldemort.com/

15 BoboBrowse: facet features MultiSelect Runtime-defined facets (query-based, etc) Fast (custom field-cache based) Custom facet types: –Hierarchical (/a/b/c) –Range –Multivalued

16 Zoie: realtime features No modifications to core lucene Multiple read/write: RAMDir + FSDir IndexReader on (small) RAMDir opened per request: instantly realtime IndexReaderDecorator for custom Reader Transparent Indexing: implement StreamDataProvider then inject

17 Next Steps Help work on the patches? https://issues.apache.org/jira/browse/LUC ENE LinkedIn is hiring Contact: jason.rutherglen@gmail.com or jake.mannix@gmail.comjason.rutherglen@gmail.com jake.mannix@gmail.com


Download ppt "Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco."

Similar presentations


Ads by Google