Download presentation
Published byPosy McDowell Modified over 8 years ago
1
Retrieval Evaluation Modern Information Retrieval, Chapter 3
Ricardo Baeza-Yates, Berthier Ribeiro-Neto
2
Outline Introduction Retrieval Performance Evaluation
Recall and precision Alternative measures Reference Collections TREC Collection CACM & ISI Collection CF Collection
3
Introduction Type of evaluation Retrieval performance evaluation
Functional analysis phase, and Error analysis phase Proper working with no errors Performance evaluation Response time/space required Retrieval performance evaluation The evaluation of how precise is the answer set
4
Retrieval Performance Evaluation
batch query IR Relevant Docs In Answer Set |Ra| Recall=|Ra|/|R| Precision=|Ra|/|A| collection Answer Set |A| Relevant Docs |R| Sorted by relevance
5
Precision and Recall Precision Recall
Ratio of the number of relevant documents retrieved to the total number of documents retrieved The number of hits that are relevant Recall Ratio of number of relevant documents retrieved to the total number of relevant documents The number of relevant documents that are hits
6
Precision and Recall Retrieved Documents Relevant Document Space
Low Precision Low Recall High Precision Low Recall Low Precision High Recall High Precision High Recall
7
Precision and Recall Retrieved Documents |A| Relevant |R| Information
Space |RA| Recall = |RA| |R| The user isn’t usually given the answer set RA at once The documents in A are sorted to a degree of relevance (ranking) which the user examines. Recall and precision vary as the user proceeds with their examination of the answer set A Will discuss further in the evaluation lecture. Precision = |RA| |A|
8
Precision and Recall Trade Off
100% Increase number of documents retrieved Likely to retrieve more of the relevant documents and thus increase the recall But typically retrieve more inappropriate documents and thus decrease precision
9
Precision versus recall curve
Rq={d3,d5,d9,d25,d39,d44,d56, d71,d89,d123} P=100% at R=10% P= 66% at R=20% P= 50% at R=30% Ranking for query q: 1.d123* 2.d84 3.d56* 4.d6 5.d8 6.d9* 7.d511 8.d129 9.d187 10.d25* 11.d38 12.d48 13.d250 14.d11 15.d3* Usually based on 11 standard recall levels: 0%, 10%, ..., 100%
10
Precision versus recall curve
For a single query Fig3.2
11
Average Over Multiple Queries
P(r)=average precision at the recall level r Nq = Number of queries used Pi(r) =The precision at recall level r for the i-th query
12
Interpolated precision
Rq={d3,d56,d129} P=33% at R=33% P= 25% at R=66% P= 20% at R=100% P(rj)=max rj≦ r≦ rj+1P(r) 1.d123 2.d84 3.d56* 4.d6 5.d8 6.d9 7.d511 8.d129* 9.d187 10.d25 11.d38 12.d48 13.d250 14.d11 15.d3*
13
Interpolated precision
Let rj, j{0, 1, 2, …, 10}, be a reference to the j-th standard recall level P(rj)=max rj≦ r≦ rj+1P(r) R=30%, P3(r)~P4(r)=33% R=40%, P4(r)~P5(r) R=50%, P5(r)~P6(r) R=60%, P6(r)~P7(r)=25%
14
Average recall vs. precision figure: used to compare the retrieval performance of several algorithms
15
Single Value Summaries
Average precision versus recall: Compare retrieval algorithms over a set of example queries Sometimes we need to compare individual query’s performance Average precision Need a single value summary The single value should be interpreted as a summary of the corresponding precision versus recall curve
16
Single Value Summaries
Average Precision at Seen Relevant Documents Averaging the precision figures obtained after each new relevant document is observed. Example: Figure 3.2, ( )/5=0.57
17
Single Value Summaries
R-Precision The precision at the R-th position in the ranking R: the total number of relevant documents of the current query (total number in Rq) Fig3.2:R=10, value=0.4 Fig3.3,R=3, value=0.33
18
Single Value Summaries
Precision Histograms Use R-precision measures to compare the retrieval history of two algorithms through visual inspection RPA/B(i)=RPA(i)-RPB(i)
19
Summary Table Statistics
the number of queries , total number of documents retrieved by all queries, total number of relevant documents were effectively retrieved when all queries are considered total number of relevant documents retrieved by all queries…
20
Precision and Recall Maximum recall Recall and precision
Measures which quantify the informativeness of the retrieval process might now be more appropriate Recall and precision are easy to define when a linear ordering of the retrieved documents is enforced
21
Alternative Measures The Harmonic Mean The E Measure
b=1, E(j)= 1 - F(j) b>1, more interested in precision b<1, more interested in recall
22
User-Oriented Measure
Query relevant docs Coverage=|Rk|/|U|, the fraction of the documents known to the user to be relevant which has actually been retrieved Novelty=|Ru|/(|Ru|+|Rk|), the fraction of the relevant documents retrieved which was unknown to the user
23
Reference Collections
Reference collections have been used throughout the years for the evaluation of IR systems Reference test collections TIPSTER/TREC CACM, ISI Cystic Fibrosis: small collections, relevant documents
24
TREC (Text REtrieval Conference)
Research in IR lacks a solid formal framework as a basic foundation Lacks robust and consistent testbeds and benchmarks
25
TREC (Text REtrieval Conference)
Initiated under the National Institute of Standards and Technology(NIST) Goals: Providing a large test collection Uniform scoring procedures Forum 1st TREC conference in 1992 7th TREC conference in 1998 Document collection: test collections, example information requests (topics), relevant docs The benchmarks tasks
26
The Document Collection
The TREC collection has been growing steadily over the years. TREC-3: 2 gigabytes TREC-6: 5.8 gigabytes The collection can be obtained at a small fee It is distributed in 6 CDs of compressed text Documents from sub-collections are tagged with SGML to allow easy parsing.
27
TREC1-6 Documents
28
The Documents Collection
TREC doc numbered WSJ <doc> <docno>WSJ </docno> <hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl> <author>Janet Guyon (WSJ Staff)</author> <dateline>New York</dateline> <text> American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad… </text> </doc>
29
The Example Information Requests (Topics)
Example information requests can be used for testing a new ranking algorithm A request is a description of an information need in natural language Example: Topic numbered 168 <top> <num> Number:168 <title>Topic:Financing AMTRAK <desc>Description: ….. <nar>Narrative:A ….. </top>
30
The Relevant Documents for each Example Information Request
There is more than 350 topics The set of relevant docs for each example information request (topic) is obtained from a pool of possible relevant documents. Pooling method: This pool is created by taking the top k documents (k=100) in the ranking generated by various IR systems. The documents in the pool are then shown to human assessors who ultimately decide on the relevance of each document
31
The (Benchmark) Tasks at the TREC Conferences
Ad hoc task: Receive new requests and execute them on a pre-specified document collection Routing task Fixed requests, changing document collections first doc: training and tuning retrieval algorithm second doc: testing the tuned retrieval algorithm Similar to filtering, but still have ranking.
32
Other tasks: Ref. p. 90 Chinese Filtering Interactive
NLP(natural language procedure) Cross languages High precision Spoken document retrieval Query Task(TREC-7)
33
TREC
34
Evaluation Measures at the TREC Conferences
At the TREC conference, 4 basic types of evaluation measures are used: (ref. p.91) Summary table statistics Recall-precision averages Document level averages Average precision histogram
35
The CACM Collection Small collections about computer science literature Text of doc Structured subfields word stems from the title and abstract sections Categories direct references between articles: a list of pairs of documents [da,db] Bibliographic coupling connections:a list of triples[d1,d2,ncited] Number of co-citations for each pair of articles[d1,d2,nciting] A unique environment for testing retrieval algorithms which are based on information derived from cross-citing patterns
36
The ISI Collection ISI is a test collection of 1460 docs selected from a previous collection at the Institute of Scientific Information (ISI) The main purpose of the ISI collection is to support investigation of similarities based on terms and cross-citation patterns
37
CFC Collection 1,239 documents indexed with the term cystic fibrosis in the National Library of Medicine’s MEDLINE Each doc record is composed of: MEDLINE accession number author title source major subjects minor subjects abstract references citations
38
CFC Collection 100 information requests with extensive relevance judgements: 4 separate relevance scores for each request Scores proviced by human experts and by a medical bibliographer Each score: 0 (not relevant) 1 (marginally relevant) 2 (strongly relevant)
39
CFC Collection Small and nice collection for experimentation
Number of information requests is large relative to the collection size Good relevance judgements For online access:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.