Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰&許名宏.

Similar presentations


Presentation on theme: "Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰&許名宏."— Presentation transcript:

1 Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰&許名宏

2 Overview (in English) The goal The goal –Using advanced approaches to improve the performance of basic IR models Group Group –1~3 person(s) per group; email the name list to the TA Approach Approach –No limitations; Any resources on the Web is usable. Date of system demo and report submission Date of system demo and report submission –6/18 Thursday (provisional) Criteria for the grade Criteria for the grade –Originality and reasonableness of your approach –Effort for implementation / per person –Retrieval performance (training & testing) –Completeness of the report, division of the work and analysis for the retrieval results

3 Overview (in Chinese) 專題目標 專題目標 – 以進階 IR 技術提升基本檢索模型的效能 分組 分組 –1~3 人 / 組,請組長將組員名單 ( 學號、姓名 ) e-mail 給 TA 方法 方法 – 不限,可使用任何 toolkit or resource on web Demo 及報告繳交 Demo 及報告繳交 – 暫定 6/18 Thursday 評分標準 評分標準 – 所採用的方法創意、合理性 –Effort for implementation / per person – 檢索效能 (training & testing) – 報告完整性、分工及檢索結果分析

4 Content in the Report Detail description about your approach Detail description about your approach Parameter setting (if parametric) Parameter setting (if parametric) System performance on the training topics System performance on the training topics –The baseline performance –The performance of your approach Division of the work ( 如何分工 ) Division of the work ( 如何分工 ) What you have learned ( 心得 ) What you have learned ( 心得 ) Others (optional) Others (optional)

5 Basic IR Models Vector space model Vector space model –Lucene Probabilistic model Probabilistic model –Okapi-BM25 Language model Language model –Indri (Lemur toolkit)

6 Possible Approaches Pseudo relevance feedback (PRF) Pseudo relevance feedback (PRF) –Supported by Lemur API Simple and effective, but no originality Simple and effective, but no originality Query expansion Query expansion –Using external resources ex: WordNet, Wikipedia, query log...etc Word sense disambiguation in docs/query Word sense disambiguation in docs/query Combining Results from 2 or more IR systems Combining Results from 2 or more IR systems Learning to rank Learning to rank –What are the useful features? Others Others

7 Experimental Dataset A partial collection of TREC WT10g A partial collection of TREC WT10g –Link information is provided 30 topics for system development 30 topics for system development Another 30 topics for the demo Another 30 topics for the demo

8 Topic Example <top> Number: 476 Number: 476 Jennifer Aniston Jennifer Aniston Description: Description: Find documents that identify movies and/or television programs that Jennifer Aniston has appeared in. Narrative: Narrative: Relevant documents include movies and/or television programs that Jennifer Aniston has appeared in. </top>

9 Document Example <DOC><DOCNO>WTX010-B01-2</DOCNO><DOCOLDNO>IA011-000115-B026-169</DOCOLDNO><DOCHDR> http://www.lpitr.state.sc.us:80/reports/jsrf14.htm 167.7.18.68 19970216181104 text/html 264 HTTP/1.0 200 OK Date: Sunday, 16-Feb-97 18:19:32 GMT Server: NCSA/SMI-1.0 MIME-version: 1.0 Content-type: text/html Last-modified: Friday, 02-Feb-96 19:51:15 GMT Content-length: 82 </DOCHDR> 1 Mr. Delleney did not participate in deliberation of this candidate. 1 Mr. Delleney did not participate in deliberation of this candidate.</DOC>

10 Link Information In-links In-links –“ A B C ”  B and C contain links to A ex: WTX010-B01-118 WTX010-B01-114 WTX010-B01-121 Out-links Out-links –“ A B C ”  A contains links pointed to B or C ex: WTX010-B01-127 WTX010-B01-89 WTX010-B01-119

11 Evaluation Evaluate top 100 retrieved documents Evaluate top 100 retrieved documents Evaluation metrics Evaluation metrics –Mean average precision (MAP) –NDCG@15 Use the program “ ireval” to evaluate system performance Use the program “ ireval” to evaluate system performance –Usage of ireval Usage of irevalUsage of ireval

12 Example Result for Evaluation 465Q0WTX017-B13-7415test 465 Q0WTX017-B38-1124.5test 465Q0WTX017-B38-4134.3test 465 Q0WTX017-B38-4244.2test 465 Q0WTX017-B40-4654.1test 465 Q0WTX018-B44-35963.5test 465 Q0WTX018-B44-30073test 465 Q0WTX012-B01-12182.5test 465 Q0WTX019-B37-2792test 465 Q0WTX019-B37-31101.9test 474 Q0WTX012-B01-15119test 474 Q0WTX017-B38-4628test 474 Q0WTX018-B44-3537test 474 Q0WTX013-B03-33546test 474 Q0WTX018-B44-3055test 474 Q0WTX015-B25-28564test 474 Q0WTX019-B37-2773test 474 Q0WTX014-B39-28182test 474 Q0WTX018-B14-29491.5test 474 Q0WTX018-B20-109101test

13 Dataset Description (1/2) “ training_topics.txt” (file) “ training_topics.txt” (file) –30 topics for system development “ qrels_training_topics.txt” (file) “ qrels_training_topics.txt” (file) –Relevance judgments for training topics “ documents ” (directory) “ documents ” (directory) –Including 10.rar files of raw documents “ in_links.txt” (file) “ in_links.txt” (file) –In-link information “ out_links.txt ” (file) “ out_links.txt ” (file) –Out-link information

14 Dataset Description (2/2) “ ireval.jar ” (file) “ ireval.jar ” (file) –A Java program for evaluation “ irevalGUI.jar” (file) “ irevalGUI.jar” (file) –GUI of ireval.jar


Download ppt "Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰&許名宏."

Similar presentations


Ads by Google