MG4J – Managing GigaBytes for Java Introduction

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Chapter 10: File-System Interface
Information Retrieval in Practice
Modern Information Retrieval
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
MG4J: Managing Gigabytes for Java Exercise Ida Mele.
Overview of Search Engines
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
XHTML Introductory1 Forms Chapter 7. XHTML Introductory2 Objectives In this chapter, you will: Study elements Learn about input fields Use the element.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
1 Physical Data Organization and Indexing Lecture 14.
MG4J: Managing Gigabytes for Java Introduction Ida Mele.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Ranking Ida Mele. Introduction The set of software components for the management of large sets of data is made of: MG4J Fastutil the DSI Utilities Sux4J.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
MG4J – Managing GigaBytes for Java Ida Mele. Overview Document –Document –DocumentCollection –FileSetDocumentCollection –DocumentFactory Index Query –HttpQueryServer.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Information Retrieval in Practice
Storage and File Organization
Sets and Maps Chapter 9.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Module 11: File Structure
Search Engine Architecture
Subject Name: File Structures
Text Indexing and Search
CHP - 9 File Structures.
CS522 Advanced database Systems
Query processing: phrase queries and positional indexes
Compiler Construction (CS-636)
Operating Systems (CS 340 D)
File System Structure How do I organize a disk into a file system?
Database Management Systems (CS 564)
Methodology – Physical Database Design for Relational Databases
Implementation Issues & IR Systems
CS 430: Information Discovery
Chapter 15 QUERY EXECUTION.
MR Application with optimizations for performance and scalability
The Anatomy of a Large-Scale Hypertextual Web Search Engine
MapReduce Simplied Data Processing on Large Clusters
Chapter 11: File System Implementation
File organization and Indexing
Chapter 11: Indexing and Hashing
Chapter 27 WWW and HTTP.
CNIT 131 HTML5 – Anchor/Link.
Physical Database Design
Computer Architecture
Lecture 3: Main Memory.
MR Application with optimizations for performance and scalability
CS202 - Fundamental Structures of Computer Science II
Files Management – The interfacing
Sets and Maps Chapter 9.
Virtual Memory: Working Sets
Information Retrieval and Web Design
Chapter 11: Indexing and Hashing
Advance Database System
Lecture 4: File-System Interface
Presentation transcript:

MG4J – Managing GigaBytes for Java Introduction Ida Mele

Home page: www.dis.uniroma1.it/~mele Contact information Ida Mele Dipartimento di Ingegneria Informatica, Automatica e Gestionale “A. Ruberti” Via Ariosto, 25 00185 Rome, Italy Room B221 (second floor) Email: mele@dis.uniroma1.it Home page: www.dis.uniroma1.it/~mele Ida Mele MG4J 1

Free full-text search engine for large document collections. MG4J: introduction (1) Free full-text search engine for large document collections. Written in Java. Developed by the Department of Computer Science (University of Milan). Release 5.0: http://mg4j.di.unimi.it/ Documentation: http://mg4j.di.unimi.it/docs/ Manual: http://mg4j.di.unimi.it/man/manual.pdf Ida Mele MG4J 2

MG4J can be used to index and query a large collection of documents. MG4J: introduction (2) MG4J can be used to index and query a large collection of documents. INPUT: set of documents with same number and type of fields. The fields can be textual or virtual (example of virtual fields are the anchors of HTML pages). OUTPUT: inverted index. Ida Mele MG4J 3

Every field has a name and a type. Document factory The document factory is responsible for turning raw byte sequences into documents. In particular, the factory transforms a sequence of bytes into a number of fields. Every field has a name and a type. Textual fields are words that appear in the title and in the text of a document. The words are processed by a term processor. The dictionary is the set of terms that appear in the index. Ida Mele MG4J 4

Example of index Ida Mele MG4J 5

Building batches An occurrence is a group of three numbers, say (t,d,p), meaning that term with index t appears in document d at position p. Inverted lists can be obtained by re-sorting the occurrences in increasing term order, so that occurrences relative to the same term appear consecutively. MG4J scans the whole document collection producing batches. Batches are sub-indices limited to a subset of documents, and they are created each time the number of indexed documents reaches a user-provided threshold, or when the available memory is too little. Ida Mele MG4J 6

Once the batches are created, they are combined in a single index. Batch combination Once the batches are created, they are combined in a single index. MG4J has three type of index combination: Concatenation Merging Pasting Ida Mele MG4J 7

Batch combination: concatenation The first document of the second index is renumbered to the number of documents of the first index, and the others follow; the first document of the third index is renumbered to the sum of number of documents of the first and second index, and so on. The resulting index is identical to the index that would be produced by indexing the concatenation of document sequences producing each index. This is the kind of combination that is applied to batches, unless documents were renumbered. Ida Mele MG4J 8

Batch combination: merging Assuming that each index contains a separate subset of documents, with non-overlapping number, we can merge the lists accordingly. In case a document appears in two indices, the merge operation is stopped. Note that no renumbering is performed. This is the kind of combination that is applied to batches when documents have been renumbered, and each batch contains potentially non-consecutive document numbers. Ida Mele MG4J 9

Batch combination: pasting Each index is assumed to index a (possibly empty) part of a document. For each term and document, the positions of the term in the document are gathered (and possibly suitably renumbered). If the inputs that have been indexed are text files with newline as separator, the resulting index is identical to the one that would be obtained by applying the UNIX command paste to the text files. This is the kind of combination that is applied to virtual documents. Ida Mele MG4J 10

Once the index is built we can query it using a web server. Querying the index Once the index is built we can query it using a web server. MG4J allows to use command line and the web browser. Flexibility to improve query answer time: a portion of the index can loaded into main memory (caching). Ida Mele MG4J 11

Sophisticated Query: scorer MG4J provides very sophisticated query tuning. To use this features, we must use the command line interface. For example, we can choose the scorer to use. The scorers are important for ranking the documents satisfying a query depending on some criterion (ex. the frequency of the term in the document). Ida Mele MG4J 12

Virtual fields (1) A virtual field produces pieces of text that are to be referred to other documents. Referrer: the document that has a link to another document. Referee: the document pointed by the Referrer. Intuitively the Referrer gives us information about the Referee. Hence, the Referrer produces in a virtual field a number of fragments of text, each referring to a certain Referee. The content of a virtual field is a list of pairs made by the piece of text (called virtual fragment) and by some string that is aimed at representing the Referee (called the document spec). Ida Mele MG4J 13

the document spec is a URL (as specified in the href attribute); Virtual fields (2) In the case of the HTMLDocumentFactory, the anchor field is the list of all anchors contained in the document: the document spec is a URL (as specified in the href attribute); the virtual fragment is the content of the anchor element. Ida Mele MG4J 14

The document factory just produces fields out of a HTML document. Document resolver The document factory just produces fields out of a HTML document. There is no fixed way to map document spec into actual references to documents in the collection. This is resolved, by the notion of document resolver. The document resolver maps the document spec produced by some document factory into actual references to documents in the collection. Ida Mele MG4J 15

Virtual gap All the virtual fragments that refer to a given document of the collection are like a single text, called the virtual text. Virtual fragments coming from different anchors are concatenated, and this may produce false positive results. To avoid such kinds of false positives, we can use virtual gap: a positive integer, representing the virtual space left between different virtual fragments. Ida Mele MG4J 16

Payload-based index For metadata (dates, integers, etc), MG4J provides a special kind of index, called payload-based index. Searching a payload-based index is rather different form searching an index. For example, instead of term-based operators and Boolean you just get range queries [..]. Assuming to have the field of the date, we can use: [ 20/2/2007 .. 23/2/2007 ] : Ida Mele MG4J 17

Performance (1) MG4J provides a great flexibility in index construction, and all the choices have a significant impact on performance. Building a collection during the indexing phase will of course slow down the whole process. Nonparametric codes are quicker than parametric codes. Discarding what you do not need: pointers, counts, positions, etc. For example, if we will use BM25 or TF/IDF scoring, we do not need to store positions in the index. Indices contain a skipping structure that makes skipping index entries faster, however the skipping structures introduce a slight overhead when scanning sequentially a list. Ida Mele MG4J 18

directly loaded into main memory. Performance (2) The index can be: read from disk, memory-mapped, directly loaded into main memory. These three solutions work with increasing speed and increased main memory usage. The default is to read an index from disk. We can add suitable options to the index URI (mapped=1 or inmemory=1) to use the other solutions. Ida Mele MG4J 19

MG4J provides a generic way of combining indices into clusters. Clustering MG4J provides a generic way of combining indices into clusters. Example: we can index separately two sets of documents and then use the two resulting indices as a single index with a concatenation-based cluster index. A cluster exhibits a set of local indices as a single global index. Clusters can be documental or lexical. Ida Mele MG4J 20

Documental Clusters Each document of the global index appears exactly once in each local index. Documental clusters can be used to keep a set of documents with high static rank in a separate index living on faster storage. Ida Mele MG4J 21

Lexical Clusters Each term of the global index appears exactly once in each local index. Lexical cluster can be used to load in memory the inverted lists of terms that appear more frequently in user queries. Ida Mele MG4J 22

The opposite of clustering is partitioning. Partitioning an index means dividing its inverted lists using some criterion. Partitioning can be: Documental, Lexical, Personalized. Ida Mele MG4J 23