Presentation is loading. Please wait.

Presentation is loading. Please wait.

WBIA Project 2 – Retrieval & Evaluation LI Geng Nov.10, 2008.

Similar presentations


Presentation on theme: "WBIA Project 2 – Retrieval & Evaluation LI Geng Nov.10, 2008."— Presentation transcript:

1 WBIA Project 2 – Retrieval & Evaluation LI Geng Nov.10, 2008

2 Guidelines Information retrieval evaluation – a brief review Goals of this assignment Tools & work environment  Nutch-0.9  Lucene-2.1.0 Assignment instructions Submission & grading policies

3 Previously in Project 1 - Crawling Tool: Nutch Target network: ccer.pku.edu.cn What we already have:  A web database that contains web pages of CCER;  Inverted index of your data (you may not have noticed yet);  Global PageRank results

4 Previously in Project 1 (Cont.) What we don’t have yet for a complete IR service:  Interpreting user information need Query  Web page (at least page urls)  Online retrieval service.

5 I. Information Retrieval Evaluation – A Brief Review Project 2’s Focus: Query  Web Page What do we need to evaluate retrieval results?  Retrieval model implementation & optimization;  A standard test data set;  Pre-defined queries and their corresponding answer set;  Evaluating with well-known metrics (MAP, P@10, etc.)

6 II. Goals of this Assignment Setup an online web search engine (using Nutch) Understand information retrieval evaluation process Refine existing retrieval model (by enhancing evaluation metric scores)

7 How? A standard web page test set (Done.) Pre-defined queries and their corresponding answer set (Done.) Retrieval model implementation Evaluating with well-known metrics (MAP, P@10, etc.)

8 III. Tools & work environment Nutch’s major modules:  Crawling  Indexing  Retrieval  Web search ……  Of which indexing and retrieval modules are built on top of Lucene.

9 Lucene A framework for document retrieval using the Vector Space Model  Inverted index construction  Query matching

10 Lucene (Cont.) It does not handle (from http://darksleep.com/lucene):  managing the process (instantiating the objects and hooking them together, both for indexing and for searching)  selecting the data files  parsing the data files ( 例如:中文切词 )  getting the search string from the user  displaying the search results to the user A “library” rather than a stand-alone application

11 Lucene (Cont.) But a library with useful utilities as standard extensions  E.g. package org.apache.lucene.analysis.standard; Default document analysis (and tokenizing) utilities (i.e. they will be used if you don’t implement your onwn.)

12 Lucene in Nutch As a third-party library  try listing the $NUTCH-HOME/lib directory Crawled Web Page org.apache. lucene.analysis org.apache. lucene.index org.apache. lucene.search org.apache. lucene.index Inverted Index HitSet Web Page Posting Lists Matched Documents

13 Lucene in Nutch (Cont.) Nutch implements Lucene interfaces and imports Lucene classes so as to reuse its indexing and retrieval functionalities.  E.g. In package org.apache.nutch.analysis; public final class NutchDocumentTokenizer extends org.apache.lucene.analysis.Tokenizer implements NutchAnalysisConstants  Refer to these packages for more details: package org.apache.nutch.indexer; package org.apache.nutch.analysis; package org.apache.nutch.searcher; Index Construction Retrieval

14 Towards a complete IR Application Nutch’s major modules: Crawling Indexing Try listing the root directory of your WebDB:  Crawldb indexes linkdb segments  Retrieval  Web search …

15 IV. Assignment Instructions The test set and answer set:  Taken from one group’s previous crawl  Will be put online soon Retrieval  Enhance retrieval quality using your PageRank results Web search  Set up online search engine with Nutch

16 Step 1 - Web Search Engine Setup This is the recommended first step in this assignment.  It is relatively simple; Nutch’s online tutorial has detailed enough information on this. http://wiki.apache.org/nutch/NutchTutorial  You will have an impression of the vector space retrieval model implemented by Lucene. Important: To save time with Nutch configuration, refer to my instructions in addition to the Nutch online tutorial at  http://162.105.80.59/WBIA_NutchConfigHelp.txt http://162.105.80.59/WBIA_NutchConfigHelp.txt

17 Step 1 - Web Search Engine Setup (Cont.) Your task:  Compute retrieval metrics as the base for comparison MAP, P@10

18 Step 2 – Lucene Retrieval Ranking Analysis Entry point:  class org.apache.lucene.search.IndexSearcher (Hint)Related class, for reference:  class org.apache.lucene.search.BooleanQuery  class org.apache.lucene.search.BooleanQuery. BooleanWeight

19 Step 2 – Lucene Retrieval Ranking Analysis (Cont.) Your task:  Figure out the formula of score computing.

20 Step 3 – Integrate PageRank results with VSM Your task:  Figure out a solution to combine PageRank and VSM score effectively to enhance retrieval quality.  Any ideas now? Required coding: edit  package org.apache.lucene.search

21 Step 4 – Re-evaluate and Improve Based on your new model and retrieval results, recompute  MAP, P@10 Compare newly computed values with previous ones, go back to step 3 if there is still room for improvement.

22 Challenge Task 1 Edit Lucene to implement the language model (and repeat the evaluation process, compare results with VSM + PageRank)  Hint: Find out how Lucene stores and reads the posting lists, and figure out a way to use the data in them for LM similarity computing. Or, you may consider reformatting the posting list store and insert additional useful information.

23 Challenge Task 2 Implement LSI (Latent Semantic Indexing) and evalute  In this case, could Lucene’s document scoring module still be reused? ……

24 V. Submission & Grading Deadline: 12.3 23:59 Challenge 属于选做内容

25 提交内容 工程报告文档,包含以下部分: 1. 小组成员及分工 2. Lucene 进行文档匹配的评分计算公式; 3. 如何将 PageRank 的计算结果整合进来?  讲思路,不要贴程序代码。 4. 整合的效果如何?整合后又做了哪些改进尝 试?  用两个评测指标说明 5. (选做部分)简述实现语言模型或 LSI 的思路

26 提交内容(续) 代码包  至少包括结合了 VSM 和 PageRank 文档排序算 法的 lucene jar 包,并说明修改过的文件;  如果做了 Challenge ,请在代码包内加上额外的 文本文件说明; 提交格式:  将以上两部分打成 zip 或 rar 压缩包,命名格式: (组名) _ ( Project leader 学号).zip(rar)

27 Grading Policy 起评: 100  Challenge 1: +30 bonus  Challenge 2: +40 bonus 独力完成的小组至少可以得到 75% 的分数 根据完成情况, Project Leader 有 0 - 20% 的 奖励

28 Any Questions?

29 Online References http://wiki.apache.org/nutch/NutchTutorial http://darksleep.com/lucene http://lucene.apache.org/java/2_1_0/


Download ppt "WBIA Project 2 – Retrieval & Evaluation LI Geng Nov.10, 2008."

Similar presentations


Ads by Google