Download presentation
Presentation is loading. Please wait.
Published byMadeleine Bradford Modified over 9 years ago
1
WBIA Project 2 – Retrieval & Evaluation LI Geng Nov.10, 2008
2
Guidelines Information retrieval evaluation – a brief review Goals of this assignment Tools & work environment Nutch-0.9 Lucene-2.1.0 Assignment instructions Submission & grading policies
3
Previously in Project 1 - Crawling Tool: Nutch Target network: ccer.pku.edu.cn What we already have: A web database that contains web pages of CCER; Inverted index of your data (you may not have noticed yet); Global PageRank results
4
Previously in Project 1 (Cont.) What we don’t have yet for a complete IR service: Interpreting user information need Query Web page (at least page urls) Online retrieval service.
5
I. Information Retrieval Evaluation – A Brief Review Project 2’s Focus: Query Web Page What do we need to evaluate retrieval results? Retrieval model implementation & optimization; A standard test data set; Pre-defined queries and their corresponding answer set; Evaluating with well-known metrics (MAP, P@10, etc.)
6
II. Goals of this Assignment Setup an online web search engine (using Nutch) Understand information retrieval evaluation process Refine existing retrieval model (by enhancing evaluation metric scores)
7
How? A standard web page test set (Done.) Pre-defined queries and their corresponding answer set (Done.) Retrieval model implementation Evaluating with well-known metrics (MAP, P@10, etc.)
8
III. Tools & work environment Nutch’s major modules: Crawling Indexing Retrieval Web search …… Of which indexing and retrieval modules are built on top of Lucene.
9
Lucene A framework for document retrieval using the Vector Space Model Inverted index construction Query matching
10
Lucene (Cont.) It does not handle (from http://darksleep.com/lucene): managing the process (instantiating the objects and hooking them together, both for indexing and for searching) selecting the data files parsing the data files ( 例如:中文切词 ) getting the search string from the user displaying the search results to the user A “library” rather than a stand-alone application
11
Lucene (Cont.) But a library with useful utilities as standard extensions E.g. package org.apache.lucene.analysis.standard; Default document analysis (and tokenizing) utilities (i.e. they will be used if you don’t implement your onwn.)
12
Lucene in Nutch As a third-party library try listing the $NUTCH-HOME/lib directory Crawled Web Page org.apache. lucene.analysis org.apache. lucene.index org.apache. lucene.search org.apache. lucene.index Inverted Index HitSet Web Page Posting Lists Matched Documents
13
Lucene in Nutch (Cont.) Nutch implements Lucene interfaces and imports Lucene classes so as to reuse its indexing and retrieval functionalities. E.g. In package org.apache.nutch.analysis; public final class NutchDocumentTokenizer extends org.apache.lucene.analysis.Tokenizer implements NutchAnalysisConstants Refer to these packages for more details: package org.apache.nutch.indexer; package org.apache.nutch.analysis; package org.apache.nutch.searcher; Index Construction Retrieval
14
Towards a complete IR Application Nutch’s major modules: Crawling Indexing Try listing the root directory of your WebDB: Crawldb indexes linkdb segments Retrieval Web search …
15
IV. Assignment Instructions The test set and answer set: Taken from one group’s previous crawl Will be put online soon Retrieval Enhance retrieval quality using your PageRank results Web search Set up online search engine with Nutch
16
Step 1 - Web Search Engine Setup This is the recommended first step in this assignment. It is relatively simple; Nutch’s online tutorial has detailed enough information on this. http://wiki.apache.org/nutch/NutchTutorial You will have an impression of the vector space retrieval model implemented by Lucene. Important: To save time with Nutch configuration, refer to my instructions in addition to the Nutch online tutorial at http://162.105.80.59/WBIA_NutchConfigHelp.txt http://162.105.80.59/WBIA_NutchConfigHelp.txt
17
Step 1 - Web Search Engine Setup (Cont.) Your task: Compute retrieval metrics as the base for comparison MAP, P@10
18
Step 2 – Lucene Retrieval Ranking Analysis Entry point: class org.apache.lucene.search.IndexSearcher (Hint)Related class, for reference: class org.apache.lucene.search.BooleanQuery class org.apache.lucene.search.BooleanQuery. BooleanWeight
19
Step 2 – Lucene Retrieval Ranking Analysis (Cont.) Your task: Figure out the formula of score computing.
20
Step 3 – Integrate PageRank results with VSM Your task: Figure out a solution to combine PageRank and VSM score effectively to enhance retrieval quality. Any ideas now? Required coding: edit package org.apache.lucene.search
21
Step 4 – Re-evaluate and Improve Based on your new model and retrieval results, recompute MAP, P@10 Compare newly computed values with previous ones, go back to step 3 if there is still room for improvement.
22
Challenge Task 1 Edit Lucene to implement the language model (and repeat the evaluation process, compare results with VSM + PageRank) Hint: Find out how Lucene stores and reads the posting lists, and figure out a way to use the data in them for LM similarity computing. Or, you may consider reformatting the posting list store and insert additional useful information.
23
Challenge Task 2 Implement LSI (Latent Semantic Indexing) and evalute In this case, could Lucene’s document scoring module still be reused? ……
24
V. Submission & Grading Deadline: 12.3 23:59 Challenge 属于选做内容
25
提交内容 工程报告文档,包含以下部分: 1. 小组成员及分工 2. Lucene 进行文档匹配的评分计算公式; 3. 如何将 PageRank 的计算结果整合进来? 讲思路,不要贴程序代码。 4. 整合的效果如何?整合后又做了哪些改进尝 试? 用两个评测指标说明 5. (选做部分)简述实现语言模型或 LSI 的思路
26
提交内容(续) 代码包 至少包括结合了 VSM 和 PageRank 文档排序算 法的 lucene jar 包,并说明修改过的文件; 如果做了 Challenge ,请在代码包内加上额外的 文本文件说明; 提交格式: 将以上两部分打成 zip 或 rar 压缩包,命名格式: (组名) _ ( Project leader 学号).zip(rar)
27
Grading Policy 起评: 100 Challenge 1: +30 bonus Challenge 2: +40 bonus 独力完成的小组至少可以得到 75% 的分数 根据完成情况, Project Leader 有 0 - 20% 的 奖励
28
Any Questions?
29
Online References http://wiki.apache.org/nutch/NutchTutorial http://darksleep.com/lucene http://lucene.apache.org/java/2_1_0/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.