Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR Homework #1 By J. H. Wang Mar. 21, 2014. Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.

Similar presentations


Presentation on theme: "IR Homework #1 By J. H. Wang Mar. 21, 2014. Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and."— Presentation transcript:

1 IR Homework #1 By J. H. Wang Mar. 21, 2014

2 Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and to search relevant documents for a given query Input: a set of text documents, and a user query Output: relevant documents in a ranked list Tools: either open source tools or write your own code in any programming language

3 Major Tasks Indexing –Given a set of text documents, build an inverted index Searching –Given a user query, find the most relevant documents in a ranked list

4 IR, Spring 2014NTUT CSIE4 Steps in Vector Space Retrieval 1 2

5 Some Open Source Tools Apache Lucene/Solr (in Java) The Lemur Project, Indri, Galago – by CMU/Umass, (in C++) Terrier – by U. Glasgow (in Java) …

6 Input 1: the Test Collection ClueWeb09 dataset –http://lemurproject.org/clueweb09.php/http://lemurproject.org/clueweb09.php/ –1,040,809,705 Web pages in 10 languages, in Jan.- Feb. 2009 –5TB, compressed (25TB, uncompressed) –File format: WARC (Web ARChive file format) http://www.digitalpreservation.gov/formats/fdd/fd d000236.shtmlhttp://www.digitalpreservation.gov/formats/fdd/fd d000236.shtml Sample Files: http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki- index.php?page=Sample+Files http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki- index.php?page=Sample+Files Each file contains about 40,000 Web pages, in 1GB Each team will be randomly allocated different files!

7 Other Test Collections Reuters-RCV1: (in the textbook) http://trec.nist.gov/data/reuters/reuters.html http://trec.nist.gov/data/reuters/reuters.html –About 810,000 English news stories from 1996/08/20 to 1997/08/19 (2.5GB uncompressed) –Needs to sign agreements Reuters-21578: http://www.daviddlewis.com/resources/testcollection s/reuters21578/ http://www.daviddlewis.com/resources/testcollection s/reuters21578/ –21,578 news articles in 1987 (28.0MB uncompressed) Test collections held at University of Glasgow: http://www.dcs.gla.ac.uk/idom/ir_resources/test_coll ections/ http://www.dcs.gla.ac.uk/idom/ir_resources/test_coll ections/ –LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI –Ex: The Time Collection: 423 documents (1.5MB)

8 Indexing: Building Inverted Index E.g.: Using the standard positional index as the format (Chap. 1 & 2): –Dictionary file: a sorted list of vocabularies (in separate lines) –Postings list: for each term, a list of occurrences in the original text term i, df i : ; doc2, tf i2 : ; … > (as in Fig. 2.11, Sec. 2.4, p.38) –df i : document frequency of term i –tf ij : term frequency of term i in doc j to, 993427: ; 2, 5: ; … > …

9 Design Issues pos means the token positions in the body of documents –This can facilitate easier implementation in following steps, e.g., proximity search You can design different index formats, as long as –The necessary information can be accessed for ranking Dictionary: terms t i and the corresponding document frequency df i Postings: (DocID, term frequency tf ij, Loc) for each term Preprocessing should be handled with care –Different formats for different collections –Digits, hyphens, punctuation marks, …

10 Optional Functionality Efficiency issues –A separate data structure (e.g. trie) can be used to store the vocabularies and postings in your indexer –Skip pointers Tokenization –Case folding –Stopword removal –Stemming –Able to be turned on/off by a parameter trigger

11 Input 2: User Query Simple queries –Single keywords Ex: Tucson, Microsoft, … –Free texts with multiple words Ex: United States, Mount Carmel, … –Simple Boolean search Ex: open source AND Linux, software engineer OR project manager, …

12 Output: Ranked List A ranked list of search results from ClueWeb09 collection –Ranking: vector space model Term weighting scheme: TF-IDF Similarity estimation: cosine similarity between query and document vectors

13 Searching: scoring and ranking documents Vector space model –Term weighting: TF-IDF –Similarity estimation: cosine similarity between query q and document vectors d j w ij = (1+ log tf ij ) * log (N/df i )

14 Example Output Ex: –Query: “ Hong Kong ” –Result: E.g.: 2610.85 1350.67 3240.3 …

15 Optional Features Optional functionalities –Better user interface for search –Complex queries: phrase, wildcard, substring, proximity search, combinations of Boolean operators, … (Ch.2 & 3) –Query processing: spell-correction, phonetic correction, … (Ch.3) –Different term weighting schemes: variants of TF- IDF, … (Ch.6) –In-exact top- k retrieval: index elimination, champion lists, impact-ordering, tiered index, … (Ch.7) –Able to be turned on/off by a parameter trigger

16 Submission Your submission *should* include –The source code (or your configuration of installed open source tool) –A one-page description that includes the following Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) Instructions for compilation/execution environments (ex: Java Runtime Environment, special compilers, …) Major difficulties encountered Team members list: The names and the responsible parts of each individual member should be clearly identified Due: three weeks (Apr. 18, 2014)

17 Submission Instructions Programs or homework in electronic files must be submitted directly on the submission site: – Submission site: http://140.124.183.31/net2ftp FTP server: localhost User name & password: Your student ID – Preparing your submission file : as one single compressed file Remember to specify the names and student IDs of your team members in the files and documentation –If you cannot successfully submit your work, please contact with the TA (Mr. Huang, @ R1424, Technology Building) Available Time: Mon. morning or Tue. Afternoon E-mail: jsn900211 @ gmail. com

18 Evaluation Minimum requirement : correctness for simple queries in vector space retrieval –Using the (partial) ClueWeb09 Test Collection and some sample queries as the input, the ranked list of documents retrieved by your system will be checked –Optional features will be considered as bonus You might be required to demo if the program submitted was unable to compile/run by the TA

19 Any Questions or Comments?


Download ppt "IR Homework #1 By J. H. Wang Mar. 21, 2014. Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and."

Similar presentations


Ads by Google