IR Homework #2 By J. H. Wang Apr. 13, 2011
Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query –(simple search: keyword, Boolean) Output: a ranked list of search results from Reuters collection –(details to be described later)
Input: User Query Simple search –Keyword Ex: Malaysia, Nuclear, … –Free text Ex: United Nations, Nuclear Submarine Fleet, … –Simple Boolean search Ex: Israel OR Pakistan, …
Output: Ranked List A ranked list of search results from Reuters collection –Term weighting scheme: TF-IDF –Ranking: vector space model, i.e. the cosine similarity between query and document vectors w ij = (1+ log tf ij ) * log (N/df i )
Example Output Ex: –Query: “ Bangladesh ” –Result: …
Optional Features Optional functionalities –Better user interface for search –Complex queries: phrase, wildcard, substring, proximity search, combinations of Boolean operators, … (Ch.2 & 3) –Spell-correction, phonetic correction, … (Ch.3) –Champion lists, impact-ordering, tiered index, … (Ch.7) –Different ranking/term weighting schemes: variants of TF-IDF, … (Ch.6) –Able to be turned on/off by a parameter trigger
Submission Your submission *should* include –The source code (and optionally your executable file) –A one-page description that includes the following Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) Major difficulties encountered Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) The names and the responsible parts of each individual member should be clearly identified for team work Due: two weeks (Apr. 27, 2011)
Submission Instructions Programs or homework in electronic files must be submitted directly on the submission site: – Submission site: Username: your student ID Password: (Please change your default password at your first login) – Preparing your submission file : as one single compressed file Remember to specify the names of your team members and student ID in the files and documentation –If you cannot successfully submit your work, please contact with the TA
Evaluation Some example queries will be submitted to your program, and the ranked list will be checked for effectiveness (recall and precision) – Minimum requirement: simple keyword and Boolean queries Optional features will be considered as bonus –Various query types, weighting schemes, efficient scoring and ranking, … You might be required to demo if the program submitted was unable to run by TA
Questions?