Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.

Similar presentations


Presentation on theme: "IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection."— Presentation transcript:

1 IR Homework #1 By J. H. Wang Mar. 16, 2015

2 Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection Input: a set of text documents Output: inverted index Tools: either utilizing open source tools (libraries, APIs) or writing your own code in any programming language

3 The Major Task Indexing –Given a set of text documents, build an inverted index

4 IR, Spring 2014NTUT CSIE4 Steps in Vector Space Retrieval 1 2

5 Some Open Source Tools Apache Lucene/Solr (in Java) The Lemur Project, Indri, Galago – by CMU/Umass, (in C++) Terrier – by U. Glasgow (in Java) …

6 Input 1: the Test Collection ClueWeb09 dataset –http://lemurproject.org/clueweb09.php/http://lemurproject.org/clueweb09.php/ –1,040,809,705 Web pages in 10 languages, in Jan.- Feb. 2009 –5TB, compressed (25TB, uncompressed) –File format: WARC (Web ARChive file format) http://www.digitalpreservation.gov/formats/fdd/fd d000236.shtmlhttp://www.digitalpreservation.gov/formats/fdd/fd d000236.shtml Sample Files: http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki- index.php?page=Sample+Files http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki- index.php?page=Sample+Files Each file contains about 40,000 Web pages, in 1GB Each team will be randomly allocated different files!

7 Web Test Collections The ClueWeb12 dataset –a successor to the ClueWeb09 dataset –http://lemurproject.org/clueweb12.php/http://lemurproject.org/clueweb12.php/ –733,019,372 English Web pages, in Feb.-May 2012 –5.5TB, compressed (27.3TB, uncompressed) TREC datasets: WT2g, WT10g,.GOV,.GOV2, Blogs06, Blogs08 –http://ir.dcs.gla.ac.uk/test_collections/http://ir.dcs.gla.ac.uk/test_collections/

8 Previous Test Collections Test collections held at University of Glasgow: http://www.dcs.gla.ac.uk/idom/ir_resources/test_coll ections/ http://www.dcs.gla.ac.uk/idom/ir_resources/test_coll ections/ –LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI –Ex: The Time Collection: 423 documents (1.5MB) Reuters-21578: http://www.daviddlewis.com/resources/testcollection s/reuters21578/ http://www.daviddlewis.com/resources/testcollection s/reuters21578/ –21,578 news articles in 1987 (28.0MB uncompressed) Reuters-RCV1: (in the textbook) http://trec.nist.gov/data/reuters/reuters.html http://trec.nist.gov/data/reuters/reuters.html –About 810,000 English news stories from 1996/08/20 to 1997/08/19 (2.5GB uncompressed) –Needs to sign agreements

9 Output: Inverted Index E.g.: Using the standard positional index as the format (Chap. 1 & 2): –Dictionary file: a sorted list of vocabularies (in separate lines) –Postings list: for each term, a list of occurrences in the original text term i, df i : ; doc2, tf i2 : ; … > (as in Fig. 2.11, Sec. 2.4, p.38) –df i : document frequency of term i –tf ij : term frequency of term i in doc j to, 993427: ; 2, 5: ; … > …

10 Design Issues pos means the token positions in the body of documents –This facilitate easier implementation in following steps, e.g., proximity search You can design different index formats, as long as –The necessary information can be accessed for ranking Dictionary: terms t i and the corresponding document frequency df i Postings: (DocID, term frequency tf ij, Loc) for each term Preprocessing should be handled with care –Different formats for different collections –Digits, hyphens, punctuation marks, …

11 Optional Functionality Efficiency issues –A separate data structure (e.g. trie) can be used to store the vocabularies and postings in your indexer –Skip pointers (to be used in query processing) Tokenization –Case folding –Stopword removal –Stemming –Able to be turned on/off by a parameter trigger

12 Submission Your submission *should* include –The source code (and your configurations of extra libraries) For utilizing open source tools, please also submit your source code on calling the APIs or libraries –A one-page documentation including Major features : ex: high efficiency, low storage, multiple input formats, huge corpus, … Major difficulties encountered Instructions for compilation/execution environments (ex: Java Runtime Environment, special compilers, …) Team members list: The names and the responsible parts of each individual member should be clearly identified Due: three weeks (extended to Apr. 6, 2015)

13 Submission Instructions Programs and related electronic files in your homework must be submitted directly on the submission site: – Submission site: https://140.124.183.13/ https://140.124.183.13/ Username: your student ID Password: (please change it at your first login) – Preparing your submission file : as one single compressed file Name your file according to your ID such as _HW1.zip Remember to specify the names and student IDs of your team members in the files and documentation If you cannot successfully submit your work, please contact with the TA (TBD, @ R1424, Technology Building)

14 Evaluation Minimum requirement : correctness for sample documents –Using the (partial) ClueWeb09 Test Collection as the input, and the inverted index generated by your program will be checked –Optional features will be considered as bonus You might be required to demo if the program submitted was unable to compile/run by the TA

15 Any Questions or Comments?


Download ppt "IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection."

Similar presentations


Ads by Google