IR Homework #1 By J. H. Wang Mar. 25, 2009
Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input: a set of text documents –(to be described later) Output: inverted index files –(exact format to be described later)
Input: the Test Collection Reuters-RCV1: –About 810,000 English news stories from 1996/08/20 to 1997/08/19 (2.5GB uncompressed) –Needs to sign agreements Reuters-21578: s/reuters21578/ s/reuters21578/ –21,578 news in 1987 (28.0MB uncompressed) Test collections held at University of Glasgow: ections/ ections/ –LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI –Ex: The Time Collection: 423 documents (1.5MB)
Output: Inverted Index Using the standard inverted index (Chap. 1 & 2) Output format: –Dictionary file: a sorted list of vocabularies (in separate lines) –Postings list: for each word, a list of occurrences in the original text term i, df i : ; doc2, tf i2 : ; … > (as in Fig. 2.11, Sec. 2.4) –df i : document frequency of term i –tf ij : term frequency of term i in doc j to, : ; 2, 5: ; … > …
Implementation Issues Note: pos means the token positions in the body of documents –This can facilitate easier implementation in later steps after indexing, for example, proximity search Document preprocessing should be handled with care –Different formats for different collections –Digits, hyphens, punctuation marks, …
Implementation Issues You can have a separate data structure (e.g. trie, which is more efficient) to store the vocabularies and occurrences in your program to speed up the indexing process, but the output should be in the designated format Optional functionality –Case folding –Stopword removal –Stemming –They should be able to be turned off by a parameter trigger
Submission Your submission *should* include –The source code (and optionally your executable file) –A one-page description that includes the following Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) Major difficulties encountered Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) The names and the responsible parts of each individual member should be clearly identified for team work Due: extended to three weeks (Apr. 1, 2009)
Submission Instructions Programs or homework in electronic files must be submitted directly to the TA as follows – Team members list : please your team members list to the TA ntut. edu. tw) even if you’re the only team member – Preparing submission file : one single compressed file named as, for example, IR0901- HW1.ZIP Remember to specify the names of your team members and student ID in the files and documentation – or online submission: TBD –If you cannot successfully submit your work, please contact with the TA or the instructor
Evaluation Minimum requirement : the Reuters Test Collection as the input, and the inverted index generated by your program will be checked Optional features such as case folding, stemming and stopword removal will be considered as bonus You might be required to demo if the program submitted was unable to compile/run by TA
Questions?