MG4J – Managing GigaBytes for Java Introduction

MG4J – Managing GigaBytes for Java Introduction
Ida Mele

Home page: www.dis.uniroma1.it/~mele
Contact information Ida Mele Dipartimento di Ingegneria Informatica, Automatica e Gestionale “A. Ruberti” Via Ariosto, Rome, Italy Room B221 (second floor) Home page: Ida Mele MG4J 1

Free full-text search engine for large document collections.
MG4J: introduction (1) Free full-text search engine for large document collections. Written in Java. Developed by the Department of Computer Science (University of Milan). Release 5.0: Documentation: Manual: Ida Mele MG4J 2

MG4J can be used to index and query a large collection of documents.
MG4J: introduction (2) MG4J can be used to index and query a large collection of documents. INPUT: set of documents with same number and type of fields. The fields can be textual or virtual (example of virtual fields are the anchors of HTML pages). OUTPUT: inverted index. Ida Mele MG4J 3

Every field has a name and a type.
Document factory The document factory is responsible for turning raw byte sequences into documents. In particular, the factory transforms a sequence of bytes into a number of fields. Every field has a name and a type. Textual fields are words that appear in the title and in the text of a document. The words are processed by a term processor. The dictionary is the set of terms that appear in the index. Ida Mele MG4J 4

Example of index Ida Mele MG4J 5

Building batches An occurrence is a group of three numbers, say (t,d,p), meaning that term with index t appears in document d at position p. Inverted lists can be obtained by re-sorting the occurrences in increasing term order, so that occurrences relative to the same term appear consecutively. MG4J scans the whole document collection producing batches. Batches are sub-indices limited to a subset of documents, and they are created each time the number of indexed documents reaches a user-provided threshold, or when the available memory is too little. Ida Mele MG4J 6

Once the batches are created, they are combined in a single index.
Batch combination Once the batches are created, they are combined in a single index. MG4J has three type of index combination: Concatenation Merging Pasting Ida Mele MG4J 7

Batch combination: concatenation
The first document of the second index is renumbered to the number of documents of the first index, and the others follow; the first document of the third index is renumbered to the sum of number of documents of the first and second index, and so on. The resulting index is identical to the index that would be produced by indexing the concatenation of document sequences producing each index. This is the kind of combination that is applied to batches, unless documents were renumbered. Ida Mele MG4J 8

Batch combination: merging
Assuming that each index contains a separate subset of documents, with non-overlapping number, we can merge the lists accordingly. In case a document appears in two indices, the merge operation is stopped. Note that no renumbering is performed. This is the kind of combination that is applied to batches when documents have been renumbered, and each batch contains potentially non-consecutive document numbers. Ida Mele MG4J 9

Batch combination: pasting
Each index is assumed to index a (possibly empty) part of a document. For each term and document, the positions of the term in the document are gathered (and possibly suitably renumbered). If the inputs that have been indexed are text files with newline as separator, the resulting index is identical to the one that would be obtained by applying the UNIX command paste to the text files. This is the kind of combination that is applied to virtual documents. Ida Mele MG4J 10

Once the index is built we can query it using a web server.
Querying the index Once the index is built we can query it using a web server. MG4J allows to use command line and the web browser. Flexibility to improve query answer time: a portion of the index can loaded into main memory (caching). Ida Mele MG4J 11

Sophisticated Query: scorer
MG4J provides very sophisticated query tuning. To use this features, we must use the command line interface. For example, we can choose the scorer to use. The scorers are important for ranking the documents satisfying a query depending on some criterion (ex. the frequency of the term in the document). Ida Mele MG4J 12

Virtual fields (1) A virtual field produces pieces of text that are to be referred to other documents. Referrer: the document that has a link to another document. Referee: the document pointed by the Referrer. Intuitively the Referrer gives us information about the Referee. Hence, the Referrer produces in a virtual field a number of fragments of text, each referring to a certain Referee. The content of a virtual field is a list of pairs made by the piece of text (called virtual fragment) and by some string that is aimed at representing the Referee (called the document spec). Ida Mele MG4J 13

the document spec is a URL (as specified in the href attribute);
Virtual fields (2) In the case of the HTMLDocumentFactory, the anchor field is the list of all anchors contained in the document: the document spec is a URL (as specified in the href attribute); the virtual fragment is the content of the anchor element. Ida Mele MG4J 14

The document factory just produces fields out of a HTML document.
Document resolver The document factory just produces fields out of a HTML document. There is no fixed way to map document spec into actual references to documents in the collection. This is resolved, by the notion of document resolver. The document resolver maps the document spec produced by some document factory into actual references to documents in the collection. Ida Mele MG4J 15

Virtual gap All the virtual fragments that refer to a given document of the collection are like a single text, called the virtual text. Virtual fragments coming from different anchors are concatenated, and this may produce false positive results. To avoid such kinds of false positives, we can use virtual gap: a positive integer, representing the virtual space left between different virtual fragments. Ida Mele MG4J 16

Payload-based index For metadata (dates, integers, etc), MG4J provides a special kind of index, called payload-based index. Searching a payload-based index is rather different form searching an index. For example, instead of term-based operators and Boolean you just get range queries [..]. Assuming to have the field of the date, we can use: [ 20/2/ /2/2007 ] : Ida Mele MG4J 17

Performance (1) MG4J provides a great flexibility in index construction, and all the choices have a significant impact on performance. Building a collection during the indexing phase will of course slow down the whole process. Nonparametric codes are quicker than parametric codes. Discarding what you do not need: pointers, counts, positions, etc. For example, if we will use BM25 or TF/IDF scoring, we do not need to store positions in the index. Indices contain a skipping structure that makes skipping index entries faster, however the skipping structures introduce a slight overhead when scanning sequentially a list. Ida Mele MG4J 18

directly loaded into main memory.
Performance (2) The index can be: read from disk, memory-mapped, directly loaded into main memory. These three solutions work with increasing speed and increased main memory usage. The default is to read an index from disk. We can add suitable options to the index URI (mapped=1 or inmemory=1) to use the other solutions. Ida Mele MG4J 19

MG4J provides a generic way of combining indices into clusters.
Clustering MG4J provides a generic way of combining indices into clusters. Example: we can index separately two sets of documents and then use the two resulting indices as a single index with a concatenation-based cluster index. A cluster exhibits a set of local indices as a single global index. Clusters can be documental or lexical. Ida Mele MG4J 20

Documental Clusters Each document of the global index appears exactly once in each local index. Documental clusters can be used to keep a set of documents with high static rank in a separate index living on faster storage. Ida Mele MG4J 21

Lexical Clusters Each term of the global index appears exactly once in each local index. Lexical cluster can be used to load in memory the inverted lists of terms that appear more frequently in user queries. Ida Mele MG4J 22

The opposite of clustering is partitioning.
Partitioning an index means dividing its inverted lists using some criterion. Partitioning can be: Documental, Lexical, Personalized. Ida Mele MG4J 23

MG4J – Managing GigaBytes for Java Introduction

Similar presentations

Presentation on theme: "MG4J – Managing GigaBytes for Java Introduction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MG4J – Managing GigaBytes for Java Introduction

Similar presentations

Presentation on theme: "MG4J – Managing GigaBytes for Java Introduction"— Presentation transcript:

Similar presentations

About project

Feedback