Presentation is loading. Please wait.

Presentation is loading. Please wait.

Retrieval Evaluation Modern Information Retrieval, Chapter 3 Ricardo Baeza-Yates, Berthier Ribeiro-Neto 圖書與資訊學刊第 29 期 (1999 年 5 月 ), 台 大圖資所碩士論文, 江玉婷,陳光華.

Similar presentations


Presentation on theme: "Retrieval Evaluation Modern Information Retrieval, Chapter 3 Ricardo Baeza-Yates, Berthier Ribeiro-Neto 圖書與資訊學刊第 29 期 (1999 年 5 月 ), 台 大圖資所碩士論文, 江玉婷,陳光華."— Presentation transcript:

1 Retrieval Evaluation Modern Information Retrieval, Chapter 3 Ricardo Baeza-Yates, Berthier Ribeiro-Neto 圖書與資訊學刊第 29 期 (1999 年 5 月 ), 台 大圖資所碩士論文, 江玉婷,陳光華

2 Outline Introduction Retrieval Performance Evaluation  Recall and precision  Alternative measures Reference Collections  TREC Collection  CACM&ISI Collection  CF Collection Trends and Research Issues

3 Introduction Type of evaluation  Functional analysis phase, and Error analysis phase  Performance evaluation Performance evaluation  Response time/space required Retrieval performance evaluation  The evaluation of how precise is the answer set

4 Retrieval Performance Evaluation 評估以 batch query 為主的 IR 系統 collection Relevant Docs In Answer Set |Ra| Relevant Docs |R| Answer Set |A| Recall=|Ra|/|R| Precision=|Ra|/|A| Sorted by relevance

5 Precision versus recall curve R q ={d 3,d 5,d 9,d 25,d 39,d 44,d 56, d 71,d 89,d 123 } P = 100% at R=10% P= 66% at R=20% P= 50% at R=30% Ranking for query q: 1.d 123 * 2.d 84 3.d 56 * 4.d 6 5.d 8 6.d 9 * 7.d 511 8.d 129 9.d 187 10.d 25 * 11.d 38 12.d 48 13.d 250 14.d 11 15.d 3 * Usually based on 11 standard recall levels: 0%, 10%,..., 100%

6 Precision versus recall curve For a single query Fig3.2

7 Average Over Multiple Queries P(r)=average precision at the recall level r N q = Number of queries used P i (r)= The precision at recall level r for the i-th query

8 Interpolated precision R q ={d 3,d 56,d 129 } P = 33% at R=33% P= 25% at R=66% P= 20% at R=100% P(r j )=max r i ≦ r ≦ r j+1 P(r) 1.d 123 2.d 84 3.d 56 * 4.d 6 5.d 8 6.d 9 7.d 511 8.d 129* 9.d 187 10.d 25 11.d 38 12.d 48 13.d 250 14.d 11 15.d 3 *

9 Interpolated precision Let r j, j{0, 1, 2, …, 10}, be a reference to the j-th standard recall level P(r j )=max r i ≦ r ≦ r j+1 P(r) R=30%, P 3 (r)~P 4 (r)=33% R=40%, P 4 (r)~P 5 (r) R=50%, P 5 (r)~P 6 (r) R=60%, P 6 (r)~P 7 (r)=25%

10 Average recall vs. precision figure

11 Single Value Summaries Average precision versus recall:  Compare retrieval algorithms over a set of example queries Sometimes we need to compare individual query ’ s performance  Average precision 可能會隱藏演算法中不正常的部分  可能需要知道, 兩個演算法中,對某特定 query 的 performance 為何 Need a single value summary  The single value should be interpreted as a summary of the corresponding precision versus recall curve

12 Single Value Summaries Average Precision at Seen Relevant Documents  Averaging the precision figures obtained after each new relevant document is observed.  Example: Figure 3.2, (1+0.66+0.5+0.4+0.3)/5=0.57  此方法對於很快找到相關文件的系統是相當有利的 ( 相關文件被 排在越前面, precision 值越高 ) R-Precision  The precision at the R-th position in the ranking  R: the total number of relevant documents of the current query (total number in R q )  Fig3.2:R=10, value=0.4  Fig3.3,R=3, value=0.33

13 Precision Histograms Use R-precision measures to compare the retrieval history of two algorithms through visual inspection RP A/B (i)=RP A (i)-RP B (i)

14 Summary Table Statistics 將所有 query 相關的 single value summary 放在 table 中  the number of queries,  total number of documents retrieved by all queries,  total number of relevant documents were effectively retrieved when all queries are considered  total number of relevant documents retrieved by all queries …

15 Precision and Recall 的適用性 Maximum recall 值的產生,需要知道所有文件相關的 背景知識 Recall and precision 是相對的測量方式,兩者要合併 使用比較適合。 Measures which quantify the informativeness of the retrieval process might now be more appropriate Recall and precision are easy to define when a linear ordering of the retrieved documents is enforced

16 Alternative Measures The Harmonic Mean , 介於 0,1 The E Measure- 加入喜好比重  b=1, E(j)=F(j)  b>1, more interested in precision  b<1, more interested in recall

17 User-Oriented Measure 假設: Query 與使用者有相關, 不同使用者有不同的 relevant docs  Coverage=|R k |/|U|  Novelty=|R u |/(|R u |+|R k |) Coverage 越高, 系 統找到使用者期 望的文件越多 Noverlty 越高, 系 統找到許多使用 者之前不知道相 關的文件越多

18 Reference Collection 用來作為評估 IR 系統 reference test collections  TIPSTER/TREC: 量大,實驗用  CACM, ISI: 歷史意義  Cystic Fibrosis: small collections, relevant documents 由 專家研討後產生

19 IR system 遇到的批評 Lacks a solid formal framework as a basic foundation  無解 ! 一個文件是否與查詢相關,是相當主觀的 ! Lacks robust and consistent testbeds and benchmarks  較早,發展實驗性質的小規模測試資料  1990 後, TREC 成立,蒐集上萬文件,提供給研究團體作 IR 系 統評量之用

20 TREC (Text REtrieval Conference) Initiated under the National Institute of Standards and Technology(NIST) Goals:  Providing a large test collection  Uniform scoring procedures  Forum 7 th TREC conference in 1998:  Document collection: test collections, example information requests (topics), relevant docs  The benchmarks tasks

21 The Documents Collection 由 SGML 編輯 WSJ880406-0090 AT&T Unveils Services to Upgrade Phone Networks Under Global Plan Janet GuyonWSJ Staff) New York American Telephone & Telegrapj Co. introduced the first of a new generation of phone service with broad… WSJ880406-0090 AT&T Unveils Services to Upgrade Phone Networks Under Global Plan Janet GuyonWSJ Staff) New York American Telephone & Telegrapj Co. introduced the first of a new generation of phone service with broad…

22 TREC1-6 Documents

23 The Example Information Requests (Topics) 用自然語言將資訊需求描述出來 Topic number: 給不同類型的 topics Number:168 Topic:Financing AMTRAK Description: ….. Narrative:A …..

24 TREC ~ Topics 主題結構與長度 主題建構 主題篩選  pre-search  判斷相關文件的數量

25 TREC-6 之主題篩選程序

26 TREC ~相關判斷 判斷方法  Pooling Method  人工判斷 判斷基準 : 二元式, 相關與不相關 相關判斷品質  完整性  一致性

27 Pooling 法 針對每個查詢主題,從參與評比的各系統所送回之測試結果中 抽取出前 n(=100) 篇文件,合併形成一個 Pool 視為該查詢主題可能的相關文件候選集合,將集合中重覆的文 件去除後,再送回給該查詢主題的原始建構者進行相關判斷。 利用此法的精神是希望能透過多個不同的系統與不同的檢索技 術,盡量網羅可能的相關文件,藉此減少人工判斷的負荷。

28 TREC 候選集合與實際相關文件之對照表

29 The (Benchmark) Tasks at the TREC Conferences Ad hoc task:  Receive new requests and execute them on a pre-specified document collection Routing task  Receive test info. Requests, two document collections  first doc:training and tuning retrieval algorithm  Second doc:testing the tuned retrieval algorithm

30 Other tasks: *Chinese Filtering Interactive *NLP(natural language procedure) Cross languages High precision Spoken document retrieval Query Task(TREC-7)

31 TREC ~評比

32 TREC ~質疑與負面評價 測試集方面  查詢主題 並非真實的使用者需求, 過於人工化 缺乏需求情境的描述  相關判斷 二元式的相關判斷不實際 pooling method 會遺失相關文件, 導致回收率不準確 品質與一致性 效益測量方面  只關注量化測量  回收率的問題  適合作系統間的比較, 但不適合作評估

33 TREC ~質疑與負面評價 ( 續 ) 評比程序方面  互動式檢索 缺乏使用者介入 靜態的資訊需求不切實際

34 Evaluation Measures at the TREC Conferences Summary table statistics Recall-precision Document level averages* Average precision histogram

35 The CACM Collection Small collections about computer science literature Text of doc Structured subfields  word stems from the title and abstract sections  Categories  direct references between articles:a list of pairs of documents[da,d b ]  Bibliographic coupling connections:a list of triples[d 1,d 2,n cited ]  Number of co-citations for each pair of articles[d 1,d 2,n citing ] A unique environment for testing retrieval algorithms which are based on information derived from cross-citing patterns

36 The ISI Collection ISI 的 test collection 是由之前在 ISI(Institute of Scientific Information) 的 Small 組合而成 這些文件大部分是由當初 Small 計畫中有關 cross- citation study 中挑選出來 支持有關於 terms 和 cross-citation patterns 的相似性 研究

37 The Cystic Fibrosis Collection 有關於 “ 囊胞性纖維症 ” 的文件 Topics 和相關文件由具有此方面在臨床或研究的專家所 產生 Relevance scores  0:non-relevance  1:marginal relevance  2:high relevance

38 Characteristics of CF collection Relevance score 均由專家給定 Good number of information requests(relative to the collection size)  The respective query vectors present overlap among themselves  利用之前的 query 增加檢索效率

39 Trends and Research Issues Interactive user interface  一般認為 feedback 的檢索可以改善效率  如何決定此情境下的評估方式 (Evaluation measures)? 其它有別於 precise, recall 的評估方式研究


Download ppt "Retrieval Evaluation Modern Information Retrieval, Chapter 3 Ricardo Baeza-Yates, Berthier Ribeiro-Neto 圖書與資訊學刊第 29 期 (1999 年 5 月 ), 台 大圖資所碩士論文, 江玉婷,陳光華."

Similar presentations


Ads by Google