MG4J: Managing Gigabytes for Java Exercise Ida Mele.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Information Retrieval in Practice
Modern Information Retrieval
BTrees & Bitmap Indexes
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Guide To UNIX Using Linux Third Edition
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Overview of Search Engines
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.
Tutorial 14 Working with Forms and Regular Expressions.
Chapter 6: Forms JavaScript - Introductory. Previewing the Product Registration Form.
ASP.NET Programming with C# and SQL Server First Edition
XHTML Introductory1 Forms Chapter 7. XHTML Introductory2 Objectives In this chapter, you will: Study elements Learn about input fields Use the element.
Tutorial 1 Getting Started with Adobe Dreamweaver CS3
JavaScript, Fourth Edition
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
MG4J: Managing Gigabytes for Java Introduction Ida Mele.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Chapter 8 Cookies And Security JavaScript, Third Edition.
INTRODUCTION. What is HTML? HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language,
Ranking Ida Mele. Introduction The set of software components for the management of large sets of data is made of: MG4J Fastutil the DSI Utilities Sux4J.
File Structures Foundations of Computer Science  Cengage Learning.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
JavaScript, Fourth Edition Chapter 5 Validating Form Data with JavaScript.
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
ITCS373: Internet Technology Lecture 5: More HTML.
Homework #5 New York University Computer Science Department Data Structures Fall 2008 Eugene Weinstein.
Web Search Algorithms By Matt Richard and Kyle Krueger.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
3 1 Sending Data Using an Online Form CGI/Perl Programming By Diane Zak.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
Tutorial 13 Validating Documents with Schemas
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
©SoftMooreSlide 1 Introduction to HTML: Forms ©SoftMooreSlide 2 Forms Forms provide a simple mechanism for collecting user data and submitting it to.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Ranking Ida Mele. Introduction The set of software components for the management of large sets of data is made of: – MG4J, – Fastutil, – the DSI Utilities,
MG4J – Managing GigaBytes for Java Ida Mele. Overview Document –Document –DocumentCollection –FileSetDocumentCollection –DocumentFactory Index Query –HttpQueryServer.
INTERNET APPLICATIONS CPIT405 Forms, Internal links, meta tags, search engine friendly websites.
IR Homework #2 By J. H. Wang Apr. 13, Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Module 11: File Structure
13 Text Processing Hongfei Yan June 1, 2016.
MG4J – Managing GigaBytes for Java Introduction
Query Languages.
Topics Introduction to File Input and Output
Implementation Based on Inverted Files
6. Implementation of Vector-Space Retrieval
Web Search Engines.
Topics Introduction to File Input and Output
Presentation transcript:

MG4J: Managing Gigabytes for Java Exercise Ida Mele

Indexing in MG4J is centered around documents Package: it.unimi.di.big.mg4j.document The object document, which is the instance of the class Document, represents a single document that can be indexed Different documents have different number and type of fields. For example, from, to, date, subject, body HTML page: title, url, body Ida Mele MG4J - exercise1 Document

Ida Mele MG4J - exercise2 Document Summary of methods:

Package: it.unimi.di.big.mg4j.document DocumentCollection is a randomly addressable lists of documents Ida Mele MG4J - exercise3 DocumentCollection

FileSetDocumentCollection Package: it.unimi.di.big.mg4j.document The main method of FileSetDocumentCollection allows to build and serialize a set of documents specified by their filenames Ida Mele MG4J - exercise4

Document Factory Package: it.unimi.di.big.mg4j.document The factory turns a pure stream of bytes (file) into a document made by several fields (title and text) Ida Mele MG4J - exercise5

Standard MG4J Document Factories CompositeDocumentFactory HtmlDocumentFactory IdentityDocumentFactory MailDocumentFactory PdfDocumentFactory ReplicatedDocumentFactory PropertyBasedDocumentFactory TRECHeaderDocumentFactory ZipDocumentCollection.ZipFactory Ida Mele MG4J - exercise6

Query Package: it.unimi.di.big.mg4j.query To query the index we can use the main method of the class Query We can submit queries by using: command line web browser QueryEngine: The query engine receives the query and returns the ranked list of results HttpQueryServer: A simple web server for query processing Ida Mele MG4J - exercise7

Indexing and querying: exercise TECHNICAL REQUIREMENTS: UNIX Operating System Java (>=6) Document collection and the libraries are available at: Ida Mele MG4J - exercise8

Set the classpath Download and extract htmlDIS.tar.gz Download and extract lib.zip Download the file set-classpath.sh Edit the first line of the file set-classpath.sh: replace your_directory with the path of the folder containing all the.jar files (lib folder) Set the CLASSPATH: source set-classpath.sh Ida Mele MG4J - exercise9

Building the collection of documents (1) Help: java it.unimi.di.big.mg4j.document.FileSetDocumentCollection -- help Create the collection: find htmlDIS -iname \*.html | java it.unimi.di.big.mg4j.document.FileSetDocumentCollection -f HtmlDocumentFactory -p encoding=UTF-8 dis.collection find returns the list of files, one per line. This list is provided as input to the main method of the FileSetDocumentCollection Ida Mele MG4J - exercise10

We need also to specify a factory (the -f option) and the encoding as a property The name of the collection is dis.collection The collection does not contain the files, but only their names Deleting or modifying files of htmlDIS directory may cause inconsistence in the collection Building the collection of documents (2) Ida Mele MG4J - exercise11

Building the index Help: java it.unimi.di.big.mg4j.tool.IndexBuilder --help Create the index: java it.unimi.di.big.mg4j.tool.IndexBuilder --downcase -S dis.collection dis --downcase: this option forces all the terms to be downcased -S : specifies that we are producing an index for the specified collection. If the option is omitted, Index expects to index a document sequence read from standard input dis: basename of the index If you have memory problem, you can use -Xmx for allocating more memory to Java: java -Xmx512M it.unimi.di.big.mg4j.tool.IndexBuilder --downcase -S dis.collection dis Ida Mele MG4J - exercise12

dis-{text,title}.terms: contain the terms of the dictionary. One term per line more dis-text.terms dis-{text,title}.stats: contain statistics more dis-text.stats dis-{text,title}.properties: contain global information more dis-text.properties Index files (1) Ida Mele MG4J - exercise13

dis- { text,title}.frequencies: for each term, there is the number of documents with the term (  -code) dis-{text,title}.globcounts: for each term, there is the number of occurrence of the term (  -code) dis-{text,title}.offset: for each term, there is the offset (  - code) Index files (2) Ida Mele MG4J - exercise14

dis-{title,text}.sizes: contain the list of the document sizes. The document size is the number of words contained in each document (  - code) dis-{text,title}.batch : temporary files with sub-indices (  -code). Use the option --keep-batches to not delete temporary files dis-{text,title}.index: contain the index (  -code) Index files (3) Ida Mele MG4J - exercise15

Web server Help: java it.unimi.di.big.mg4j.query.Query --help Querying the index: java it.unimi.di.big.mg4j.query.Query -h -i FileSystemItem -c dis.collection dis-text dis-title Command line: {text, title} > computer Web browser: Ida Mele MG4J - exercise16

Search one word: The result is the set of documents that contain the specified word Example: computer AND: more than one term separated by whitespace or by AND or &. The result is the set of documents that contain all the specified words Example: computer science Example: computer AND science Example: computer & science Query (1) Ida Mele MG4J - exercise17

OR: more than one term separated by OR or |. The result is the set of documents that contain any of the given words Example: conference | workshop NOT: the operator NOT or ! is used for negation Example: conference & ! workshop Parentheses: the parentheses are used to enforce priority in complex queries Example: university & (rome | california) Query (2) Ida Mele MG4J - exercise18

Proximity restriction: the words must appear within a limited portion of the document Example: (university rome)~6 Phrase: using “ ” we can look for documents that contain the exact phrase Example: “university of rome la sapienza” Ordered AND: more than one term separated by < Example: computer < science < department Query (3) Ida Mele MG4J - exercise19

Wildcard (*): wildcard queries can be submitted appending * at the end of a term Example: infor* Index specifiers: prefixing a query with the name of an index followed by : you can restrict the search to that index Example: title:computer Example: text:computer science AND title:FOCS Query (4) Ida Mele MG4J - exercise20

MG4J provides sophisticated query tuning To use this features, we must use the command line interface $ --- to get some help on the available options Some examples: $mode --- to choose the kind of results Example: > $mode short $selector --- to choose the way the snippet or intervals are shown Example: > $selector 3 40 Sophisticated queries (1) Ida Mele MG4J - exercise21

Other examples: $mplex --- when multiplexing is on, each query is multiplexed to all indices. When a scorer is used, it is a good idea to use multiplexing Example: > $mplex on $score --- to choose the scorer Example: > $score VignaScorer $weight --- to change the weight of the indices. This is useful when multiplexing is on Example: >$weight text:1 title:3 Sophisticated queries (2) Ida Mele MG4J - exercise22

Scorer are important for ranking the documents result of a query. Default: BM25Scorer and VignaScorer ConstantScorer. Each document has a constant score (default is 0) >$score ConstantScorer CountScorer. It is the product between the number of occurrences of the term in the document and the weight assigned to the index >$score CountScorer Scorer (1) Ida Mele MG4J - exercise23

TfIdfScorer. It implements TF/IDF TF is the term frequency of the term t for the document d: c/l; where c is the number of occurrences of t in d and l is the length of d IDF is the inverse document frequency of the term t in the collection: log(N/f); where N is the number of documents in the collection and f is the number of documents where t appears >$score TfIdfScorer Scorer (2) Ida Mele MG4J - exercise24

DocumentRankScorer. The scores of documents are stored in a text file >$score DocumentRankScorer nameFile Scorer (3) Ida Mele MG4J - exercise25

A virtual field produces pieces of text that refer to other documents (possibly belonging to the collection) Referrer: the document that is referring to another document Referee: the document to which a piece of text of the Referrer is referring to Intuitively, the Referrer gives us information about the Referee The Referrer produces in a virtual field a number of fragments of text, each referring to a Referee The content of a virtual field is a list of pairs made by the piece of text (called virtual fragment) and by some string that is aimed at representing the Referee (called the document spec) Virtual fields (1) Ida Mele MG4J - exercise26

Virtual fields (2) Ida Mele MG4J - exercise27 In the case of the HTML document: the document spec is a URL (as specified in the href attribute) the virtual fragment is the content of the anchor element and some surrounding text (anchor context) The HTMLDocumentFactory produces the pairs (document spec, virtual fragment)

Create the list of URL of the documents in the collection: java it.unimi.di.big.mg4j.tool.ScanMetadata -S dis.collection -u dis.urls Create the document resolver. It is able to map the document spec produced by some document factory into actual references to documents in the collection Given a document spec, the resolver will decide whether the spec really refers to a document in the collection or not, and in the first case it will find out to which document the spec refers to: java it.unimi.di.big.mg4j.tool.URLMPHVirtualDocumentResolver -o dis.urls dis-anchor.resolver Virtual fields (3) Ida Mele MG4J - exercise28

Building the index: java it.unimi.di.big.mg4j.tool.IndexBuilder -a -v anchor:dis- anchor.resolver --downcase -S dis.collection dis Querying the index: java it.unimi.di.big.mg4j.query.Query -h -i FileSystemItem -c dis.collection dis-text dis-title dis-anchor {text, title, anchor} > anchor:conference {text, title, anchor} > title:combinatorial algorithms AND anchor:conference {text, title, anchor} > text:RoboCup AND anchor:info Virtual fields (4) Ida Mele MG4J - exercise29

All the virtual fragments that refer to a given document of the collection are like a single text, called virtual text Virtual fragments coming from different anchors are concatenated, and they are in a text file This may produce false positive results For example, the query anchor:(computer AND science) produces as result a list of documents that contain both the words in some of their anchors, but not necessarily in the same anchor Virtual gap (1) Ida Mele MG4J - exercise30

To avoid such kinds of false positives, we can use virtual gaps The virtual gap is a positive integer, representing the virtual space left between different virtual fragments For example, if the virtual gap is 64 (the default), anchors are concatenated by leaving 64 “empty words” between subsequent fragments We can submit the query: >anchor:(computer AND science)~64 and we will be sure that only documents containing both the term in the same anchor are retrieved Virtual gap (2) Ida Mele MG4J - exercise31

If the anchor is longer than 64 characters, we can still have false positives In the indexing phase, it is possible to specify a different virtual gap For example, we can use: java it.unimi.di.big.mg4j.tool.IndexBuilder -a -g anchor:100 -v anchor:dis-anchor.resolver --downcase -S dis.collection dis It uses 100 characters for the virtual gap Virtual gap (3) Ida Mele MG4J - exercise32

Term map (1) A simple representation of a dictionary is the term list (the file.terms): a text file containing the whole dictionary, one term per line, in index order (the first line contains the term with index 0, the second line the term with index 1, etc.) A more efficient representation is based on a monotone minimal perfect hash function: it is a very compact data structure that is able to answer to the question "What is the index of the term XXX?” You can build such a function from a sorted term list using: java it.unimi.dsi.sux4j.mph.MinimalPerfectHashFunction titles.mph dis-title.terms Ida Mele MG4J - exercise33

Term map (2) Monotone minimal perfect functions have a serious limit: they can answer correctly to the question "What is the index of the term XXX?” but only if the term appears in the dictionary To solve this problem, we can use a signed function For terms not in the dictionary, the function will answer with a special value (-1) that means "the word is not in the dictionary” java it.unimi.dsi.util.ShiftAddXorSignedStringMap titles.mph titles.map mycollection-title.terms Ida Mele MG4J - exercise34

Term map (3) Wildcard searches require the use of a prefix map A prefix map is able to answer correctly to the question "What are the indices of terms starting with the characters YYY?” If terms are lexicographically sorted, the answer is a pair of integers, representing the first and the last index of terms satisfying the property We can build a prefix map by using: java it.unimi.dsi.util.ImmutableExternalPrefixMap -b4Ki -o dis-title.terms dis-title.dict Ida Mele MG4J - exercise35

Homework 1.Read the MG4J (big) manual: -mg4j.pdf -mg4j.pdf 2.Repeat the exercise 3.Create your own document collection, build the inverted index (with or without virtual fields), then submit some queries and try the different scorers Ida Mele MG4J - exercise36