Information Retrieval Quality of a Search Engine
Is it good ? How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of the query language
Measures for a search engine All of the preceding criteria are measurable The key measure: user happiness …useless answers won’t make a user happy
Happiness: elusive to measure Commonest approach is given by the relevance of search results How do we measure it ? Requires 3 elements: 1.A benchmark document collection 2.A benchmark suite of queries 3.A binary assessment of either Relevant or Irrelevant for each query-doc pair
Evaluating an IR system Standard benchmarks TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant On the Web everything is more complicated since we cannot mark the entire corpus !!
General scenario Relevant Retrieved collection
Precision: % docs retrieved that are relevant [issue “junk” found] Precision vs. Recall Relevant Retrieved collection Recall: % docs relevant that are retrieved [issue “info” found]
How to compute them Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) RelevantNot Relevant Retrievedtp (true positive) fp (false positive) Not Retrievedfn (false negative) tn (true negative)
Some considerations Can get high recall (but low precision) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved Precision usually decreases
Precision vs. Recall Relevant Highest precision, very low recall Retrieved Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved
Relevant Lowest precision and recall Retrieved Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved Precision vs. Recall
Relevant Low precision and very high recall Retrieved Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved Precision vs. Recall
Relevant Very high precision and recall Retrieved Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved Precision vs. Recall
Precision-Recall curve We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries precision recall x x x x
A common picture precision recall x x x x
Interpolated precision If you can increase precision by increasing recall, then you should get to count that…
Other measures Precision at fixed recall most appropriate for web search: 10 results 11-point interpolated average precision The standard measure for TREC: you take the precision at 11 levels of recall varying from 10% to 100% by 10% of retrieved docs each step, using interpolation, and average them
F measure Combined measure (weighted harmonic mean) : People usually use balanced F 1 measure i.e., with = 1 or = ½ thus 1/F = ½ (1/P + 1/R) Use this if you need to optimize a single measure that balances precision and recall.