Download presentation
Presentation is loading. Please wait.
Published bySheila Little Modified over 9 years ago
1
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1
2
Functional Evaluation Performance Evaluation Precision & Recall Collection Evaluation Interface Evaluation User Satisfaction Users Experiments
3
Functional analysis Does the system provide most of the functions that the user expects? What are unique functions of this system? How user-friendly is the system? Error Analysis How often does the system fail? How easy does the user make errors?
4
Given a query, how well will the system perform? How do we define the retrieval performance? Is finding all the related information our goal? Is it possible to know that the system has found all the information? Given user’s information needs, how well will the system perform? Is the information found useful? -- Relevance
5
Relevance — Dictionary Definition: 1. Pertinence to the matter at hand. 2. Applicability to social issues. 3. Computer Science. The capability of an information retrieval system to select and retrieve data appropriate to a user's needs.
6
A measurement of the outcome of a search The judgment on what should or should not be retrieved There are no simple answers to what is relevant and what is not relevant difficult to define subjective depending on knowledge, needs, time, situation, etc. The central concept of information retrieval
7
Information Needs problems? requests ? queries ? The final test of relevance is if users find the information useful if users can use the information to solve the problems they have if users fill information gap they perceived.
8
The user's judgment How well the retrieved documents satisfy the user's information needs How useful the retrieved documents If it is related but not useful, It is still not relevant The system's judgment How well the retrieved document match the query How likely would the user judge this information as useful?
9
Subjects: Judge by their subject relatedness Novelty: -- how much new information in the retrieved document Uniqueness/Timeliness Quality/Accuracy/Truth Availability Source or pointer? Accessibility Cost Language English or non-English Readability
10
Binary relevant or not relevant Likert scale Not relevant, somewhat relevant, relevant, highly relevant
11
Given a query, how many documents should a system retrieve: Are all the retrieved documents relevant? Have all the relevant documents been retrieved ? Measures for system performance: The first question is about the precision of the search The second is about the completeness (recall) of the search.
12
RelevantNot Relevant Retrieved Not retrieved a b c d P = -------------- a a+b R = -------------- a a+c
13
Number of relevant documents retrieved Precision = -------------------------------------------- Total number of documents retrieved Recall = ----------------------------------------------------- Number of relevant documents retrieved Number of all the relevant documents in the database
14
Precision measures how precise a search is. the higher the precision, the less unwanted documents. Recall measures how complete a search is. the higher the recall, the less missing documents.
15
Theoretically, R and P not depend on each other. Practically, High Recall is achieved at the expense of precision. High Precision is achieved at the expense of recall. When will p = 0? Only when none of the retrieved documents is relevant. When will p=1? Only when every retrieved documents are relevant.
16
What does p=0.75 mean? What does r=.25 mean? What is your goal (in term of p & r ) when conducting a search? Depending on the purpose of the search Depending on information needs Depending on the system What values of p and r would indicate a good system or good search? There is not a fixed value.
17
Why increasing recall will often mean decreasing precision? In order not to miss anything and to cover all possible sources, one would have to scan many more materials, of which many might be not relevant.
18
Ideal IR system would have P=1, R= 1, for all the queries Is it possible? Why? If information needs could be defined very precisely, and If relevance judgments could be done unambiguously, and If query matching could be designed perfectly, Then We would have an ideal system. Then It is not an information retrieval system.
19
Combining recall and precision The harmonic means F of recall & precision 2 F = ------------------------- 1/R + 1/ P Attempt to find the best possible compromise between R &P 2 2
20
By Rijsbergen: the idea is to allow the user to specify whether he is more interested in recall or precision 1 + k^2 E = ------------------------- k ^2/ R + 1/ P Values of k greater than 1 indicates is the user is more interested in P than R While values of smaller than 1 indicates the contrary!!
21
Relevant docs Retrieved Docs Relevant docs Known to the user Relevant docs retrieved unknown to the user
22
Coverage: the fraction of the documents known to the user to be relevant which has actually been retrieved Coverage = ----------------------------------------- If coverage=1, Everything the user knows has been retrieved. Relevant Docs retrieved and known to the user Relevant Docs known to the user
23
Novelty: the fraction of the relevant documents retrieved which was unknown to the user. Novelty= -------------------------------- Relevant docs unknown to the user Relevant docs retrieved
24
Using Recall & Precision Conduct query searches Try many different queries Results may depend on sampling queries. Compare results of Precision & Recall Recall & Precision need to be considered together.
25
P /R Query 1Query 2Query 3Query 4Query 5 System A 0.9 / 0.10.7 / 0.40.45/0.50.3/0.60.1/ 0.8 System B 0.8/ 0.20.5/ 0.30.4/0.50..3/0.70.2/0.8 System C 0.9/ 0.40.7/ 0.60.5/ 0.70.3/0.80.2/ 0.9
26
0.1 0.5 1.0 R P 0.1 0.5 1.0 System C System A System B
27
Precision R=.25 R=.50 R=.75 System 1System 2System 3 0.6 0.70.9 0.50.40.7 0.20.30.4
28
Number of relevant documents N=10 N=30 N=40 Query 1Query 2Query 3 456 5 5 17 8624 N=20 N=50 Number of documents retrieved 4516 Average Precision System A 0.5 0.41 0.3 0.31 10 6 250.27
29
For real world system, Recall is always an estimate. Results depend on sampling queries. Recall and Precision do not catch interactive aspect of the retrieval process. Recall & Precision is only one aspect of system performance High recall/high precision is desirable, but not necessary the most important thing that the user considers. R and P are based on the assumption that the set of relevant documents for a query is the same, independent of the user.
30
Data quality Coverage of database It will not be found if it is not in the database. Completeness and accuracy of data Indexing methods and indexing quality It will not be found if it is not indexed. indexing types currency of indexing ( Is it updated often?) indexing sizes
33
How do you evaluate http://scholar.google.com http://scholar.google.com Functional Evaluation Functional Evaluation Performance Evaluation Performance Evaluation Precision & Recall Collection Evaluation Collection Evaluation Interface Evaluation Interface Evaluation User Satisfaction User Satisfaction Users Experiments Users Experiments
34
User friendly interface How long does it take for a user to learn advanced features? How well can the user explore or interact with the query output? How easy is it to customize output displays?
35
User satisfaction The final test is the user! User satisfaction is more important then precision and recall Measuring user satisfaction Survey Use statistics User experiments
36
Observe and collect data on System behaviors User search behaviors User-system interaction Interpret experiment results for system comparisons for understanding user’s information seeking behaviors for developing new retrieval systems/interfaces
37
An evaluation of retrieval effectiveness for a full-text document retrieval system 1985, by David Blair and M. E. Maron The first large-scale evaluation on full- text retrieval Significant and controversial results Good experimental Design
38
An IBM full-text retrieval system with 40,000 documents of $350,000 pages. Documents to be used in the defense of a large corporate law suit. Large by 1985 standards; typical standard today Mostly Boolean searching functions, with some ranking functions added. Full-text automatic indexing.
39
Two lawyers generated 51 requests. Two paralegals conducted searches again and again until the lawyers satisfied the results Until the lawyers believed that more than 75% of relevant documents had been found. The paralegals and lawyers could have as many discussions as needed.
40
Average precision=.79 Average Recall=.20 Precision Recall 1.0.20 1.0
41
The lawyers judged “vita”, “satisfactory”, “marginally relevant”, and “irrelevant” All the first three were counted as “relevant” in precision calculation.
42
Sampling from a subset of the database believed to be rich in relevant documents Mixed with retrieved sets to send to the lawyers for relevant judgments
43
The recall is low. Even though the recall is only 20%, the lawyers were satisfied (and believed that 75% of relevant documents had been retrieved).
44
Why the recall was so low? Do we really need high recall? If the study were run today on search engines like Google, would the results be the same or different?
45
Levels of Evaluation On the engineering level On the input level On the processing level On the output level On the use and user level On the social level --- Tefko Saracevic, SIGIR’95
46
Focus of this week Understand challenges of IR system evaluations Pros and cons of several IR evaluation methods
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.