Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval and Extraction 2010 Term Project – Modern Web Search Advisor: 陳信希 TA: 許名宏 & 王界人.

Similar presentations


Presentation on theme: "Information Retrieval and Extraction 2010 Term Project – Modern Web Search Advisor: 陳信希 TA: 許名宏 & 王界人."— Presentation transcript:

1 Information Retrieval and Extraction 2010 Term Project – Modern Web Search Advisor: 陳信希 TA: 許名宏 & 王界人

2 Overview (in English) Goal Goal –Using advanced approaches to enhance Okapi-BM25 Group Group –1~3 person(s) per group; email the name list to the TA Approach Approach –No limitations; Any resources on the Web is usable. Date of system demo and report submission Date of system demo and report submission –6/24 Thursday (provisional) Grading criteria Grading criteria –Originality and reasonableness of your approach –Effort for implementation / per person –Retrieval performance (training & testing) –Completeness of the report ( 分工、結果分析 )

3 Overview (in Chinese) 專題目標 專題目標 – 以進階 IR 技術提升 Okapi-BM25 的效能 分組 分組 –1~3 人 / 組,請組長將組員名單 ( 學號、姓名 ) e-mail 給 TA 方法 方法 – 不限,可使用任何 toolkit or resource on Web Demo 及報告繳交 Demo 及報告繳交 –6/25 Friday 評分標準 評分標準 – 所採用的方法創意、合理性 –Effort of implementation / per person – 檢索效能 (training 、 testing) – 報告完整性、分工及檢索結果分析

4 Content of Report Detail description about your approach Detail description about your approach Parameter setting (if parametric) Parameter setting (if parametric) System performance on the training topics System performance on the training topics –The baseline (Okapi-BM25) performance –The performance of your approach Division of the work ( 如何分工 ) Division of the work ( 如何分工 ) What you have learned ( 心得 ) What you have learned ( 心得 ) Others (optional) Others (optional)

5 Baseline Implementation: Okapi-BM25 Parametric probabilistic model Parametric probabilistic model Parameter setting Parameter setting –k 1 =1.2, k 2 =0, k 3 =0, b =0.75, R =r =0 (initial guess) Stemming: Porter ’ s stemmer Stemming: Porter ’ s stemmerPorter ’ s stemmerPorter ’ s stemmer

6 Possible Approaches Pseudo relevance feedback (PRF) Pseudo relevance feedback (PRF) –Supported by Lemur API Simple and effective, but no originality Simple and effective, but no originality Query expansion Query expansion –Using external resources ex: WordNet, Wikipedia, query log (AOL)...etc AOL Word sense disambiguation in docs/query Word sense disambiguation in docs/query Combining Results from 2 or more IR systems Combining Results from 2 or more IR systems Latent semantic analysis (LSI) Latent semantic analysis (LSI) Others Others –learning to rank, clustering/classification, …

7 Experimental Dataset A partial collection of TREC WT10g A partial collection of TREC WT10g –~10k documents –Link information is provided 30 topics for system development (training) 30 topics for system development (training) Another 20 topics in demo (testing) Another 20 topics in demo (testing)

8 Topic Example <top> Number: 476 Number: 476 Jennifer Aniston Jennifer Aniston Description: Description: Find documents that identify movies and/or television programs that Jennifer Aniston has appeared in. Narrative: Narrative: Relevant documents include movies and/or television programs that Jennifer Aniston has appeared in. </top>

9 Document Example <DOC><DOCNO>WTX010-B01-2</DOCNO><DOCOLDNO>IA011-000115-B026-169</DOCOLDNO><DOCHDR> http://www.lpitr.state.sc.us:80/reports/jsrf14.htm 167.7.18.68 19970216181104 text/html 264 HTTP/1.0 200 OK Date: Sunday, 16-Feb-97 18:19:32 GMT Server: NCSA/SMI-1.0 MIME-version: 1.0 Content-type: text/html Last-modified: Friday, 02-Feb-96 19:51:15 GMT Content-length: 82 </DOCHDR> 1 Mr. Delleney did not participate in deliberation of this candidate. 1 Mr. Delleney did not participate in deliberation of this candidate.</DOC>

10 Link Information For approaches with PageRank/HITS For approaches with PageRank/HITS In-links In-links –“ A B C ”  B and C contain links to A ex: WTX010-B01-118 WTX010-B01-114 WTX010-B01-121 Out-links Out-links –“ A B C ”  A contains links pointed to B or C ex: WTX010-B01-127 WTX010-B01-89 WTX010-B01-119

11 Evaluation Evaluate top 100 retrieved documents Evaluate top 100 retrieved documents Evaluation metrics Evaluation metrics –Mean average precision (MAP) –P@20 Use the program “ trec_eval” to evaluate system performance Use the program “ trec_eval” to evaluate system performance –Usage of trec_eval Usage of trec_evalUsage of trec_eval

12 Example Result for Evaluation (topic-num) (dummy) (docno) (rank) (score) (run-tag) 465Q0WTX017-B13-7415test 465 Q0WTX017-B38-1124.5test 465Q0WTX017-B38-4134.3test 465 Q0WTX017-B38-4244.2test 465 Q0WTX017-B40-4654.1test 465 Q0WTX018-B44-35963.5test 465 Q0WTX018-B44-30073test 465 Q0WTX012-B01-12182.5test 465 Q0WTX019-B37-2792test 465 Q0WTX019-B37-31101.9test 474 Q0WTX012-B01-15119test 474 Q0WTX017-B38-4628test 474 Q0WTX018-B44-3537test 474 Q0WTX013-B03-33546test 474 Q0WTX018-B44-3055test 474 Q0WTX015-B25-28564test 474 Q0WTX019-B37-2773test 474 Q0WTX014-B39-28182test 474 Q0WTX018-B14-29491.5test

13 Example of Relevance Judgments (topic-num) (dummy) (docno) (relevance) 4650WTX017-B13-741 4650WTX017-B38-461 4650WTX018-B44-3591 4650WTX019-B37-272 4740WTX012-B01-1511 4740WTX013-B03-3351 4740WTX014-B39-2811 4740WTX015-B25-2851 4740WTX018-B20-1092 4740WTX018-B14-2941

14 Summary of What to Do 1. Okapi-BM25 implementation (baseline) –With the fixed settings 2. Evaluate the baseline approach with training topics –using terms in as query 3. Survey or design your enhanced approach 4. Evaluate and optimize your approach with training topics 5. Submit report and demo with testing topics 6. Evaluate Okapi-BM25 and your approach with testing topics

15 Dataset Description (1/2) “ training_topics.txt” (file) “ training_topics.txt” (file) –30 topics for system development “ qrels_training_topics.txt” (file) “ qrels_training_topics.txt” (file) –Relevance judgments for training topics “ documents ” (directory) “ documents ” (directory) –Including 10.rar files of raw documents “ in_links.txt” (file) “ in_links.txt” (file) –In-link information “ out_links.txt ” (file) “ out_links.txt ” (file) –Out-link information

16 Dataset Description (2/2) “ trec_eval.exe ” (file) “ trec_eval.exe ” (file) –Binary evaluation program “ trec_eval.8.1.rar” (file) “ trec_eval.8.1.rar” (file) –Source of trec_eval for making in UNIX


Download ppt "Information Retrieval and Extraction 2010 Term Project – Modern Web Search Advisor: 陳信希 TA: 許名宏 & 王界人."

Similar presentations


Ads by Google