Advanced Indexing Techniques with

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Solr Facets in Alfresco 4
Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel.
Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco.
EIONET Training Searching and categorizing content Miruna Bădescu Finsiel Romania Copenhagen, 27 October 2003.
Information Retrieval in Practice
Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Overview of Search Engines
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer PANGAEA ® - Publishing Network for Geoscientific.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Design a full-text search engine for a website based on Lucene
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Lucene Jianguo Lu.
File Systems - Part I CS Introduction to Operating Systems.
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Module 11: File Structure
Search Engines and Search techniques
CHP - 9 File Structures.
Query processing: phrase queries and positional indexes
Indexing & querying text
Searching and Indexing
Information Retrieval in Practice
Implementation Issues & IR Systems
CS 430: Information Discovery
MG4J – Managing GigaBytes for Java Introduction
Query Languages.
Lucene in action Information Retrieval A.A
ICOM 5016 – Introduction to Database Systems
Query processing: phrase queries and positional indexes
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Presentation transcript:

Advanced Indexing Techniques with Michael Busch (buschmi@apache.org) http://people.apache.org/~buschmi/apachecon/ Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Agenda Part 1: Inverted Index 101 Posting Lists Stored Fields vs. Payloads Part 2: Use cases for Payloads BoostingTermQuery Simple facet counting Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Lucene’s data structures Inverted Index Store search retrieve stored fields Hits Results Advanced Indexing Techniques with Apache Lucene - Payloads

String comparison slow! Query: not c:\docs\einstein.txt: The important thing is not to stop questioning. String comparison slow! Solution: Inverted index c:\docs\shakespeare.txt: To be or not to be. Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Inverted index Query: not be important is not or questioning stop to the thing 1 0 1 c:\docs\einstein.txt: The important thing is not to stop questioning. c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 0 1 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 1 2 3 4 5 0 1 2 3 4 5 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 1 3 4 2 7 6 5 5 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 1 2 3 4 5 1 3 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 1 0 4 0 1 2 3 4 5 Document IDs Positions Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Inverted index with Payloads be important is not or questioning stop to the thing 1 1 3 4 2 7 6 5 1 5 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 1 2 3 4 5 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 4 B 0 1 2 3 4 5 Document IDs Positions Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads So far… String comparison slow Inverted index used to accelerate search Store positions in posting lists to allow phrase searches Store payloads in posting lists to store arbitrary data with each position Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Lucene’s data structures Inverted Index Store search retrieve stored fields Hits Results Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Store Store Field 1: title Field 2: content Field 3: hashvalue Documents: F3 D0 F1 F2 D1 D2 Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Store D0 F1 F2 F3 D1 F1 F2 F3 D2 F1 F2 F3 Optimized for random access Document-locality Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Store Posting list with Payloads D0 D1 F3 Document IDs Positions Payloads X D0 F1 F2 F3 D1 F1 F2 F3 D2 F1 F2 F3 Optimized for scanning and skipping Space-efficient encoding Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Agenda Part 1: Inverted Index 101 Posting Lists Stored Fields vs. Payloads Part 2: Use cases for Payloads BoostingTermQuery Simple facet counting Advanced Indexing Techniques with Apache Lucene - Payloads

org.apache.lucene.analysis.Token Payloads - API org.apache.lucene.analysis.Token void setPayload(Payload payload) org.apache.lucene.index.Payload Payload(byte[] data) Payload(byte[] data, int offset, int length) Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Payloads - API org.apache.lucene.index.TermPositions boolean next(); int doc() int freq(); int nextPosition(); int getPayloadLength(); byte[] getPayload(byte[] data, int offset) Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example: BoostingTermQuery Use case: Score certain occurrences of a term higher than others E. g.: Query: ‘warning’ doc1: ”HURRICANE WARNING” doc2: “The Warning Label Generator is a fun way to generate your own warning labels!” (www.warninglabelgenerator.com) Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example: BoostingTermQuery Analyzer: final byte BoldBoost = 5; … Token token = new Token(…); if (isBold) { token.setPayload( new Payload(new byte[] {BoldBoost})); } return token; Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example: BoostingTermQuery Similarity: Similarity boostingSimilarity = new DefaultSimilarity() { // @override public float scorePayload(byte [] payload, int offset, int length) { if (length == 1) return payload[offset]; }; Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example: BoostingTermQuery BoostingTermQuery: Query btq = new BoostingTermQuery( new Term(“field”, “searchterm”)); Searching: Searcher searcher = new IndexSearcher(…); Searcher.setSimilarity(boostingSimilarity); … Hits hits = searcher.search(btq); Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example from java-user: Unique Doc Ids Use case: Store a unique document id (UID) that maps to a row in a database table Retrieve UID at search time to influence matching/scoring FieldCache takes to long to load Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example from java-user: Unique Doc Ids Solution: Index one special term for each document, e. g. ID:UID Index one occurrence for each document Store UID in the Payload of the occurrence Advanced Indexing Techniques with Apache Lucene - Payloads

For indexing: TokenStream Example from java-user: Unique Doc Ids For indexing: TokenStream class SinglePayloadTokenStream extends TokenStream { boolean done = false; public void setUID(int uid) {...} public Token next() throws IOException { if (done) return null; Token token = new Token(“UID”); token.setPayload(new Payload(uid); done = true; return token; } Advanced Indexing Techniques with Apache Lucene - Payloads

For retrieving: TermPositions Example from java-user: Unique Doc Ids For retrieving: TermPositions public int[] getCachedUIDs(IndexReader reader) { int[] cache = new int[reader.maxDoc()]; TermPositions tp = reader.termPositions( new Term(“ID”, “UID”); byte[] buffer = new byte[4]; while(tp.next()) { // iterate over docs tp.nextPosition(); // only one pos per doc tp.getPayload(buffer, 0); cache[tp.doc()] = bytesToInt(buffer); } return cache; Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example from java-user: Unique Doc Ids Performance: Load UIDs for 2M docs into memory FieldCache: 16.5 s Payloads: 430 ms Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example: (Very) Simple facet counting Use case: Collection with docs from different sources Show top-n results from each source instead of top-n results from entire collection Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example: (Very) Simple facet counting Analyzer: public TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName.equals(“_facet”)) { return new TokenStream() { boolean done = false; public Token next() { if (done) return null; Token token = new Token(…); token.setPayload( new Payload(computeHash(url)); done = true; return token; }}}} Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example: (Very) Simple facet counting Hitcollector: Use different PriorityQueues for different sites Instead of returning top-n results of the whole data set, return top-n results per site Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example: (Very) Simple facet counting Summary In this example: facet (site) used for scoring, but extendable for facet counting Good performance due to locality of facet values Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example: Efficient Numeric Search Use case: Find documents that have a numeric value in a specific range, e. g. all docs with a date >2006 and <2007 Currently in Lucene: RangeQuery Store all values in the dictionary Query expansion Advanced Indexing Techniques with Apache Lucene - Payloads

Dictionary Postinglists Example: Efficient Numeric Search Dictionary Postinglists 01/01/2006 01/02/2006 01/04/2006 . 12/30/2006 Query: [01/05/2006 TO 11/25/2006] Problem: A large number of postinglists have to be processed Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example: Efficient Numeric Search Idea: Index special term, e. g. ‘numeric:date’ and store actual value in a Payload for each doc Problem: Postinglist can become very big -> entire list has to be processed Solution: Hybrid approach Advanced Indexing Techniques with Apache Lucene - Payloads

Dictionary Postinglists Example: Efficient Numeric Search Dictionary Postinglists date:01/2006 date:02/2006 . date:12/2006 Store day in payload Store position where date occurred Document IDs Positions Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Example: Efficient Numeric Search Tradeoff between number of postinglists to process and size of postinglists Significant speedup possible with good choice of chunk size Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Conclusion Payloads offer great flexibility Payloads are stored very space-efficient Sophisticated data structures enable efficient skipping over payloads Payloads should be used whenever special data is required for finding hits and scoring Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Outlook Finalize API (currently Beta) Add more out-of-the-box query types Per-document Payloads – updateable FieldCache implementation that uses Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Questions ? http://people.apache.org/~buschmi/apachecon/ Advanced Indexing Techniques with Apache Lucene - Payloads