IR Homework #1 By J. H. Wang Mar. 21, 2014. Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.

Slides:



Advertisements
Similar presentations
Boolean and Vector Space Retrieval Models
Advertisements

Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval in Practice
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Evaluating the Performance of IR Sytems
The Vector Space Model …and applications in Information Retrieval.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653)
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Information Retrieval and Web Search
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Web- and Multimedia-based Information Systems Lecture 2.
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Vector Space Models.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Information Retrieval
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
IR Homework #1 By J. H. Wang Mar. 25, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
IR Homework #2 By J. H. Wang Apr. 13, Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰&許名宏.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Search Engine Architecture
Text Indexing and Search
Indexing & querying text
Proposal for Term Project
Information Retrieval and Web Search
Big Data Analytics: HW#3
Building Search Systems for Digital Library Collections
Implementation Issues & IR Systems
אחזור מידע, מנועי חיפוש וספריות
Project 1: Text Classification by Neural Networks
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Design
Presentation transcript:

IR Homework #1 By J. H. Wang Mar. 21, 2014

Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and to search relevant documents for a given query Input: a set of text documents, and a user query Output: relevant documents in a ranked list Tools: either open source tools or write your own code in any programming language

Major Tasks Indexing –Given a set of text documents, build an inverted index Searching –Given a user query, find the most relevant documents in a ranked list

IR, Spring 2014NTUT CSIE4 Steps in Vector Space Retrieval 1 2

Some Open Source Tools Apache Lucene/Solr (in Java) The Lemur Project, Indri, Galago – by CMU/Umass, (in C++) Terrier – by U. Glasgow (in Java) …

Input 1: the Test Collection ClueWeb09 dataset – –1,040,809,705 Web pages in 10 languages, in Jan.- Feb –5TB, compressed (25TB, uncompressed) –File format: WARC (Web ARChive file format) d shtmlhttp:// d shtml Sample Files: index.php?page=Sample+Files index.php?page=Sample+Files Each file contains about 40,000 Web pages, in 1GB Each team will be randomly allocated different files!

Other Test Collections Reuters-RCV1: (in the textbook) –About 810,000 English news stories from 1996/08/20 to 1997/08/19 (2.5GB uncompressed) –Needs to sign agreements Reuters-21578: s/reuters21578/ s/reuters21578/ –21,578 news articles in 1987 (28.0MB uncompressed) Test collections held at University of Glasgow: ections/ ections/ –LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI –Ex: The Time Collection: 423 documents (1.5MB)

Indexing: Building Inverted Index E.g.: Using the standard positional index as the format (Chap. 1 & 2): –Dictionary file: a sorted list of vocabularies (in separate lines) –Postings list: for each term, a list of occurrences in the original text term i, df i : ; doc2, tf i2 : ; … > (as in Fig. 2.11, Sec. 2.4, p.38) –df i : document frequency of term i –tf ij : term frequency of term i in doc j to, : ; 2, 5: ; … > …

Design Issues pos means the token positions in the body of documents –This can facilitate easier implementation in following steps, e.g., proximity search You can design different index formats, as long as –The necessary information can be accessed for ranking Dictionary: terms t i and the corresponding document frequency df i Postings: (DocID, term frequency tf ij, Loc) for each term Preprocessing should be handled with care –Different formats for different collections –Digits, hyphens, punctuation marks, …

Optional Functionality Efficiency issues –A separate data structure (e.g. trie) can be used to store the vocabularies and postings in your indexer –Skip pointers Tokenization –Case folding –Stopword removal –Stemming –Able to be turned on/off by a parameter trigger

Input 2: User Query Simple queries –Single keywords Ex: Tucson, Microsoft, … –Free texts with multiple words Ex: United States, Mount Carmel, … –Simple Boolean search Ex: open source AND Linux, software engineer OR project manager, …

Output: Ranked List A ranked list of search results from ClueWeb09 collection –Ranking: vector space model Term weighting scheme: TF-IDF Similarity estimation: cosine similarity between query and document vectors

Searching: scoring and ranking documents Vector space model –Term weighting: TF-IDF –Similarity estimation: cosine similarity between query q and document vectors d j w ij = (1+ log tf ij ) * log (N/df i )

Example Output Ex: –Query: “ Hong Kong ” –Result: E.g.: …

Optional Features Optional functionalities –Better user interface for search –Complex queries: phrase, wildcard, substring, proximity search, combinations of Boolean operators, … (Ch.2 & 3) –Query processing: spell-correction, phonetic correction, … (Ch.3) –Different term weighting schemes: variants of TF- IDF, … (Ch.6) –In-exact top- k retrieval: index elimination, champion lists, impact-ordering, tiered index, … (Ch.7) –Able to be turned on/off by a parameter trigger

Submission Your submission *should* include –The source code (or your configuration of installed open source tool) –A one-page description that includes the following Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) Instructions for compilation/execution environments (ex: Java Runtime Environment, special compilers, …) Major difficulties encountered Team members list: The names and the responsible parts of each individual member should be clearly identified Due: three weeks (Apr. 18, 2014)

Submission Instructions Programs or homework in electronic files must be submitted directly on the submission site: – Submission site: FTP server: localhost User name & password: Your student ID – Preparing your submission file : as one single compressed file Remember to specify the names and student IDs of your team members in the files and documentation –If you cannot successfully submit your work, please contact with the TA (Mr. R1424, Technology Building) Available Time: Mon. morning or Tue. Afternoon gmail. com

Evaluation Minimum requirement : correctness for simple queries in vector space retrieval –Using the (partial) ClueWeb09 Test Collection and some sample queries as the input, the ranked list of documents retrieved by your system will be checked –Optional features will be considered as bonus You might be required to demo if the program submitted was unable to compile/run by the TA

Any Questions or Comments?