Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰&許名宏
Overview (in English) The goal The goal –Using advanced approaches to improve the performance of basic IR models Group Group –1~3 person(s) per group; the name list to the TA Approach Approach –No limitations; Any resources on the Web is usable. Date of system demo and report submission Date of system demo and report submission –6/18 Thursday (provisional) Criteria for the grade Criteria for the grade –Originality and reasonableness of your approach –Effort for implementation / per person –Retrieval performance (training & testing) –Completeness of the report, division of the work and analysis for the retrieval results
Overview (in Chinese) 專題目標 專題目標 – 以進階 IR 技術提升基本檢索模型的效能 分組 分組 –1~3 人 / 組,請組長將組員名單 ( 學號、姓名 ) 給 TA 方法 方法 – 不限,可使用任何 toolkit or resource on web Demo 及報告繳交 Demo 及報告繳交 – 暫定 6/18 Thursday 評分標準 評分標準 – 所採用的方法創意、合理性 –Effort for implementation / per person – 檢索效能 (training & testing) – 報告完整性、分工及檢索結果分析
Content in the Report Detail description about your approach Detail description about your approach Parameter setting (if parametric) Parameter setting (if parametric) System performance on the training topics System performance on the training topics –The baseline performance –The performance of your approach Division of the work ( 如何分工 ) Division of the work ( 如何分工 ) What you have learned ( 心得 ) What you have learned ( 心得 ) Others (optional) Others (optional)
Basic IR Models Vector space model Vector space model –Lucene Probabilistic model Probabilistic model –Okapi-BM25 Language model Language model –Indri (Lemur toolkit)
Possible Approaches Pseudo relevance feedback (PRF) Pseudo relevance feedback (PRF) –Supported by Lemur API Simple and effective, but no originality Simple and effective, but no originality Query expansion Query expansion –Using external resources ex: WordNet, Wikipedia, query log...etc Word sense disambiguation in docs/query Word sense disambiguation in docs/query Combining Results from 2 or more IR systems Combining Results from 2 or more IR systems Learning to rank Learning to rank –What are the useful features? Others Others
Experimental Dataset A partial collection of TREC WT10g A partial collection of TREC WT10g –Link information is provided 30 topics for system development 30 topics for system development Another 30 topics for the demo Another 30 topics for the demo
Topic Example <top> Number: 476 Number: 476 Jennifer Aniston Jennifer Aniston Description: Description: Find documents that identify movies and/or television programs that Jennifer Aniston has appeared in. Narrative: Narrative: Relevant documents include movies and/or television programs that Jennifer Aniston has appeared in. </top>
Document Example <DOC><DOCNO>WTX010-B01-2</DOCNO><DOCOLDNO>IA B </DOCOLDNO><DOCHDR> text/html 264 HTTP/ OK Date: Sunday, 16-Feb-97 18:19:32 GMT Server: NCSA/SMI-1.0 MIME-version: 1.0 Content-type: text/html Last-modified: Friday, 02-Feb-96 19:51:15 GMT Content-length: 82 </DOCHDR> 1 Mr. Delleney did not participate in deliberation of this candidate. 1 Mr. Delleney did not participate in deliberation of this candidate.</DOC>
Link Information In-links In-links –“ A B C ” B and C contain links to A ex: WTX010-B WTX010-B WTX010-B Out-links Out-links –“ A B C ” A contains links pointed to B or C ex: WTX010-B WTX010-B01-89 WTX010-B01-119
Evaluation Evaluate top 100 retrieved documents Evaluate top 100 retrieved documents Evaluation metrics Evaluation metrics –Mean average precision (MAP) Use the program “ ireval” to evaluate system performance Use the program “ ireval” to evaluate system performance –Usage of ireval Usage of irevalUsage of ireval
Example Result for Evaluation 465Q0WTX017-B test 465 Q0WTX017-B test 465Q0WTX017-B test 465 Q0WTX017-B test 465 Q0WTX017-B test 465 Q0WTX018-B test 465 Q0WTX018-B test 465 Q0WTX012-B test 465 Q0WTX019-B test 465 Q0WTX019-B test 474 Q0WTX012-B test 474 Q0WTX017-B test 474 Q0WTX018-B test 474 Q0WTX013-B test 474 Q0WTX018-B test 474 Q0WTX015-B test 474 Q0WTX019-B test 474 Q0WTX014-B test 474 Q0WTX018-B test 474 Q0WTX018-B test
Dataset Description (1/2) “ training_topics.txt” (file) “ training_topics.txt” (file) –30 topics for system development “ qrels_training_topics.txt” (file) “ qrels_training_topics.txt” (file) –Relevance judgments for training topics “ documents ” (directory) “ documents ” (directory) –Including 10.rar files of raw documents “ in_links.txt” (file) “ in_links.txt” (file) –In-link information “ out_links.txt ” (file) “ out_links.txt ” (file) –Out-link information
Dataset Description (2/2) “ ireval.jar ” (file) “ ireval.jar ” (file) –A Java program for evaluation “ irevalGUI.jar” (file) “ irevalGUI.jar” (file) –GUI of ireval.jar