Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch

Similar presentations


Presentation on theme: "Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch"— Presentation transcript:

1 Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

2 Advanced Indexing Techniques with Apache Lucene - Payloads Agenda Part 1: Inverted Index 101 –Posting Lists –Stored Fields vs. Payloads Part 2: Use cases for Payloads –BoostingTermQuery –Simple facet counting

3 Advanced Indexing Techniques with Apache Lucene - Payloads Lucene’s data structures Inverted Index Store search Results retrieve stored fields Hits

4 Advanced Indexing Techniques with Apache Lucene - Payloads c:\docs\shakespeare.txt: To be or not to be. c:\docs\einstein.txt: The important thing is not to stop questioning. Query: not String comparison slow! Solution:Inverted index

5 Advanced Indexing Techniques with Apache Lucene - Payloads c:\docs\shakespeare.txt: To be or not to be. c:\docs\einstein.txt: The important thing is not to stop questioning. Query: notInverted index be important is not or questioning stop to the thing 0 1 1 0 0 0 1 1 0 0 0 1 0 0 Document IDs

6 Advanced Indexing Techniques with Apache Lucene - Payloads c:\docs\shakespeare.txt: To be or not to be. c:\docs\einstein.txt: The important thing is not to stop questioning. Inverted index be important is not or questioning stop to the thing 0 1 1 0 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 Query: ”not to” Document IDs

7 Advanced Indexing Techniques with Apache Lucene - Payloads c:\docs\shakespeare.txt: To be or not to be. c:\docs\einstein.txt: The important thing is not to stop questioning. Query: ”not to”Inverted index be important is not or questioning stop to the thing 0 1 10001000001000100000 1 0 1 2 3 4 5 6 7 11342765021134276502 5 0 41 Document IDs Positions

8 Advanced Indexing Techniques with Apache Lucene - Payloads c:\docs\shakespeare.txt: To be or not to be. c:\docs\einstein.txt: The important thing is not to stop questioning. Inverted index with Payloads be important is not or questioning stop to the thing 0 1 10001000001000100000 0 1 2 3 4 5 6 7 11342765021134276502 0 1 5 1 Document IDs PositionsPayloads 4

9 Advanced Indexing Techniques with Apache Lucene - Payloads So far… String comparison slow Inverted index used to accelerate search Store positions in posting lists to allow phrase searches Store payloads in posting lists to store arbitrary data with each position

10 Advanced Indexing Techniques with Apache Lucene - Payloads Lucene’s data structures Inverted Index Store search Results retrieve stored fields Hits

11 Advanced Indexing Techniques with Apache Lucene - Payloads Store Field 1: title Field 2: content Field 3: hashvalue Documents: F3 D0 F1F2F3 D1 F1F2 D2 F1F2 F3

12 Advanced Indexing Techniques with Apache Lucene - Payloads F3 Store D0 F1F2F3 D1 F1F2 D2 F1F2 F3 Optimized for random access Document-locality

13 Advanced Indexing Techniques with Apache Lucene - Payloads F3 Store D0 F1F2F3 D1 F1F2 D2 F1F2 F3 Optimized for scanning and skipping Value-locality Posting list with Payloads D0D1 F3000 Document IDs PositionsPayloads XXX

14 Advanced Indexing Techniques with Apache Lucene - Payloads Agenda Part 1: Inverted Index 101 –Posting Lists –Stored Fields vs. Payloads Part 2: Use cases for Payloads –BoostingTermQuery –Simple facet counting

15 Advanced Indexing Techniques with Apache Lucene - Payloads org.apache.lucene.analysis.Token void setPayload(Payload payload) org.apache.lucene.index.TermPositions int getPayloadLength(); byte[] getPayload(byte[] data, int offset) Payloads - API

16 Advanced Indexing Techniques with Apache Lucene - Payloads Analyzer: final byte BoldBoost = 5; … Token token = new Token(…); … If (isBold) { token.setPayload( new Payload(new byte[] {BoldBoost})); } … return token; Example: BoostingTermQuery

17 Advanced Indexing Techniques with Apache Lucene - Payloads Similarity: Similarity boostingSimilarity = new DefaultSimilarity() { // @override public float scorePayload(byte [] payload, int offset, int length) { if (length == 1) return payload[offset]; }; Example: BoostingTermQuery

18 Advanced Indexing Techniques with Apache Lucene - Payloads Example: BoostingTermQuery BoostingTermQuery: Query btq = new BoostingTermQuery( new Term(“field”, “searchterm”)); Searching: Searcher searcher = new IndexSearcher(…); Searcher.setSimilarity(boostingSimilarity); … Hits hits = searcher.search(btq);

19 Advanced Indexing Techniques with Apache Lucene - Payloads Analyzer: public TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName.equals(“_facet”)) { return new TokenStream() { boolean done = false; public Token next() { if (done) return null; Token token = new Token(…); token.setPayload( new Payload(computeHash(url)); done = true; return token; }}}} Example: Simple facet counting

20 Advanced Indexing Techniques with Apache Lucene - Payloads Hitcollector: Example: Simple facet counting Use different PriorityQueues for different sites Instead of returning top-n results of the whole data set, return top-n results per site

21 Advanced Indexing Techniques with Apache Lucene - Payloads Summary Example: Simple facet counting In this example: facet (site) used for scoring, but extendable for facet counting Good performance due to locality of facet values

22 Advanced Indexing Techniques with Apache Lucene - Payloads Conclusion Payloads offer great flexibility Payloads are stored very space-efficient Sophisticated data structures enable efficient skipping over payloads Payloads should be used whenever special data is required for finding hits and scoring

23 Advanced Indexing Techniques with Apache Lucene - Payloads Outlook Finalize API (currently Beta) Add more out-of-the-box query types Per-document Payloads

24 Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Questions ?


Download ppt "Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch"

Similar presentations


Ads by Google