MG4J: Managing Gigabytes for Java Introduction Ida Mele.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 6: Memory Management
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
CompSci Searching & Sorting. CompSci Searching & Sorting The Plan  Searching  Sorting  Java Context.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Chapter 11: File System Implementation
Information Retrieval in Practice
Modern Information Retrieval
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
MG4J: Managing Gigabytes for Java Exercise Ida Mele.
Overview of Search Engines
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Form Handling, Validation and Functions. Form Handling Forms are a graphical user interfaces (GUIs) that enables the interaction between users and servers.
XHTML Introductory1 Forms Chapter 7. XHTML Introductory2 Objectives In this chapter, you will: Study elements Learn about input fields Use the element.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
1 Physical Data Organization and Indexing Lecture 14.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
IT253: Computer Organization
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Web Search Algorithms By Matt Richard and Kyle Krueger.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
File Storage Organization The majority of space on a device is reserved for the storage of files. When files are created and modified physical blocks are.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Physical Database Design I, Ch. Eick 1 Physical Database Design I Chapter 16 Simple queries:= no joins, no complex aggregate functions Focus of this Lecture:
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
CS6502 Operating Systems - Dr. J. Garrido Memory Management – Part 1 Class Will Start Momentarily… Lecture 8b CS6502 Operating Systems Dr. Jose M. Garrido.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
MG4J – Managing GigaBytes for Java Ida Mele. Overview Document –Document –DocumentCollection –FileSetDocumentCollection –DocumentFactory Index Query –HttpQueryServer.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Module 11: File Structure
Search Engine Architecture
Subject Name: File Structures
CS522 Advanced database Systems
Query processing: phrase queries and positional indexes
Implementation Issues & IR Systems
CS 430: Information Discovery
MG4J – Managing GigaBytes for Java Introduction
Optimizing Malloc and Free
Computer Architecture
Introduction to Database Systems
Query processing: phrase queries and positional indexes
Virtual Memory: Working Sets
Information Retrieval and Web Design
Presentation transcript:

MG4J: Managing Gigabytes for Java Introduction Ida Mele

Contact information Ida Mele Dipartimento di Ingegneria Informatica, Automatica e Gestionale “A. Ruberti” Via Ariosto, Rome, Italy Room A220-A221 (second floor) Home page: MG4J - intro1 Ida Mele

MG4J: introduction (1) Free full-text search engine for large document collections Written in Java Developed by the Department of Computer Science (University of Milan) MG4J home: Documentation: Manual: MG4J - intro2 Ida Mele

MG4J can be used for indexing and querying a large collection of documents INPUT: set of documents with the same number and type of fields The fields can be textual or virtual (example of virtual fields are the anchors of HTML pages) OUTPUT: inverted index MG4J - intro3 Ida Mele MG4J: introduction (2)

The document factory is responsible for turning raw byte sequences into documents. In particular, the factory transforms a sequence of bytes into a number of fields Every field has a name and a type Textual fields are words that appear in the title and in the text of a document The words are processed by a term processor The dictionary is the set of terms that are indexed MG4J - intro4 Ida Mele Document factory

MG4J - intro5 Ida Mele Example of index (1)

An occurrence is a group of three numbers (t,d,p) (t,d,p) means that the term whose index is t appears in document d at position p Inverted lists can be obtained by re-sorting the occurrences in increasing term order, so that occurrences relative to the same term appear consecutively MG4J - intro6 Ida Mele Example of index (2)

MG4J scans the whole document collection producing batches Batches are sub-indices limited to a subset of documents, and they are created each time the number of indexed documents reaches a user-provided threshold, or when the available memory is too little Indexer operations: 1.scan all documents and extract the occurrences 2.gather new terms, sort them in alphabetical ordering, and renumber all the occurrences 3.(if required) renumber and sort the documents in increasing order 4.sort (at least partially) the occurrences in increasing term order 5.create the subindex of the current batch of occurrences MG4J - intro7 Ida Mele Building batches

Once the batches are created, they are combined in a single index MG4J has three type of index combination: Concatenation Merging Pasting MG4J - intro8 Ida Mele Batch combination

The first document of the second index is renumbered to the number of documents of the first index, and the others follow; the first document of the third index is renumbered to the sum of number of documents of the first and second index, and so on The resulting index is identical to the index that would be produced by indexing the concatenation of document sequences producing each index This is the kind of combination that is applied to batches, unless documents were renumbered MG4J - intro9 Ida Mele Batch combination: concatenation

Assuming that each index contains a separate subset of documents, with non-overlapping number, we can merge the lists accordingly In case a document appears in two indexes, the merge operation is stopped Note that no renumbering is performed This is the kind of combination that is applied to batches when documents have been renumbered, and each batch contains potentially non-consecutive document numbers MG4J - intro10 Ida Mele Batch combination: merging

Each index is assumed to index a (possibly empty) part of a document For each term and document, the positions of the term in the document are gathered (and possibly suitably renumbered) If the inputs that have been indexed are text files with newline as separator, the resulting index is identical to the one that would be obtained by applying the UNIX command paste to the text files This is the kind of combination that is applied to virtual documents MG4J - intro11 Ida Mele Batch combination: pasting

Scanning phase is time and space consuming MG4J is able to work with little memory, but more memory allows to create larger batches which can be merged quickly The indexer produces a number of subindexes which is proportional to the number of occurrences The cost in term of time for combining the subindexes increases logarithmically with the number of subindexes For each subindex the on-the-fly combiner needs to allocate buffers, so the memory cost increases linearly with the number of subindexes Good strategy: Try to create batches as large as possible and check the logs, because working with an almost full heap can slow down Java MG4J - intro12 Ida Mele Time and space requirements

Once the index is built we can query it using a web server MG4J allows to use command line and the web browser Flexibility to improve query answer time: a portion of the index can loaded into main memory (caching) MG4J - intro13 Ida Mele Querying the index

MG4J provides very sophisticated query tuning To use this features, we must use the command line interface. All settings are then used for subsequent queries MG4J allows to choose the scorer for the ranking of the results of a query Scorers assign a score to the document, depending on some criterion (e.g., the frequency of the term in the document) Documents are ordered using the scores, so that more relevant documents appears at the beginning of the result list MG4J - intro14 Ida Mele Sophisticated query: scorer

A virtual field produces pieces of text that refer to other documents (possibly belonging to the collection) Referrer: the document that is referring to another document Referee: the document to which a piece of text of the Referrer is referring to Intuitively, the Referrer gives us information about the Referee The Referrer produces in a virtual field a number of fragments of text, each referring to a Referee The content of a virtual field is a list of pairs made by the piece of text (called virtual fragment) and by some string that is aimed at representing the Referee (called the document spec) Virtual fields MG4J - intro15 Ida Mele

In the case of the HTML document: the document spec is a URL (as specified in the href attribute) the virtual fragment is the content of the anchor element and some surrounding text (anchor context) The HTMLDocumentFactory produces the pairs (document spec, virtual fragment) Virtual fields: HTMLDocumentFactory MG4J - intro16 Ida Mele

The document factory just produces fields out of a HTML document There is no fixed way to map document spec into actual references to documents in the collection. This is resolved by the notion of document resolver The document resolver maps the document spec, which are produced by some document factory, into actual references to the documents in the collection MG4J - intro17 Ida Mele Virtual fields: document resolver

All the virtual fragments that refer to a given document of the collection are like a single text, called the virtual text Virtual fragments coming from different anchors are concatenated, and this may produce false positive results To avoid such kinds of false positives, we can use virtual gap The virtual gap is a positive integer, representing the virtual space left between different virtual fragments Virtual gap MG4J - intro18 Ida Mele

For metadata (dates, integers, etc.) MG4J provides a special kind of index called payload-based index Payload-based index leverages the structure of the text- based index, with the difference that each posting has a payload Searching a payload-based index is different from searching a text-based index. For example, instead of keyword queries and Boolean queries, we can use range queries [..] Assuming to have the field date, we can use the following query: [ 20/2/ /2/2007 ] to get all the documents whose date is between Feb. 20 and Feb. 23, 2007 MG4J - intro19 Ida Mele Payload-based index

MG4J provides a great flexibility for the index construction, and all the choices have a significant impact on performance Building a collection during the indexing phase will slow down the whole process If we have enough memory, building large batches is a good trick We can choose the codes used for compression: Nonparametric codes are quicker than parametric codes. The default choice is  -code for frequencies and counts, while  -code for pointers and positions. Max speed can be obtained by using  -code for everything MG4J - intro20 Ida Mele Performance (1)

Another trick is discarding what is not necessary: pointers, counts, positions, etc. For example, if we use BM25 or TF/IDF scoring, we do not need to store positions in the index (use -c POSITIONS:none) Indexes contain skipping structures to skip index entries. However, skipping structures introduce a slight overhead when scanning sequentially a list (use --no- skips option to disable skipping structure) MG4J - intro21 Ida Mele Performance (2)

The index can be: read from disk memory mapped directly loaded into main memory The three solutions work with increasing speed and increased main memory usage The default is to read the index from disk We can add suitable options to the index URI (e.g. mapped=1 or inmemory=1) to use the other solutions MG4J - intro22 Ida Mele Performance (3)

MG4J provides a generic way of combining indexes into clusters This feature can be used for supporting the incremental indexing For example, we can index two sets of documents separately, and then we use the two resulting indexes as a single index (concatenation-based cluster), or we can combine them and get a new index. Clusters can be documental or lexical MG4J - intro23 Ida Mele Clustering

Cluster: set of local indexes as a global index Documental clusters: Each document of the global index appears exactly once in each local index Documental clusters can be used to keep a set of documents with high static rank in a separate index living on faster storage MG4J - intro24 Ida Mele Documental clusters

Cluster: set of local indexes as a global index Lexical clusters: Each term of the global index appears exactly once in each local index Lexical cluster can be used to load in memory the inverted lists of terms that appear more frequently in user queries MG4J - intro25 Ida Mele Lexical clusters

The opposite of clustering is partitioning Partitioning an index means dividing its inverted lists using some criterion Partitioning can be: Documental Lexical Personalized Improving performance: we can partition our index, and once we have several subindexes, we can decide which ones will be loaded into main memory, which ones will be mapped, etc. MG4J - intro26 Ida Mele Partitioning