INFO 624 Week 3 Retrieval System Evaluation

INFO 624 Week 3 Retrieval System Evaluation
Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University 1 1

Assignment 1 How did you select the search engines?
How did you find the search engines? How did you evaluate the systems? How did you compare the systems? Did you test the system? Functionally? Performance? Systematically?

Assignment 2 Get your account ready Understand what you need to do
Use some “Cut-and-Paste” for answers.

IR System Evaluation Functional Evaluation Performance Evaluation
Precision & Recall Collection Evaluation Interface Evaluation User Satisfaction Users Experiments

Functional Evaluation
Functional analysis Does the system provide most of the functions that the user expects? What are unique functions of this system? How user-friendly is the system? Error Analysis How often does the system fail? How easy does the user make errors?

Performance Evaluation
Given a query, how well will the system perform? How do we define the retrieval performance? Is finding all the related information our goal? Is it possible to know that the system has found all the information? Given user’s information needs, how well will the system perform? Is the information found useful? -- Relevance

Relevance Relevance — Dictionary Definition:
1. Pertinence to the matter at hand. 2. Applicability to social issues. 3. Computer Science. The capability of an information retrieval system to select and retrieve data appropriate to a user's needs.

Relevance for IR A measurement of the outcome of a search
The judgment on what should or should not be retrieved There are no simple answers to what is relevant and what is not relevant difficult to define subjective depending on knowledge, needs, time, situation, etc. The central concept of information retrieval

Relevance to What? Information Needs problems? requests ? queries ?
The final test of relevance is if users find the information useful if users can use the information to solve the problems they have if users fill information gap they perceived.

Relevance Judgment The user's judgment
How well the retrieved documents satisfy the user's information needs How useful the retrieved documents If it is related but not useful, It is still not relevant The system's judgment How well the retrieved document match the query How likely would the user judge this information as useful?

Factors for Relevance Judgment
Subjects: Judge by their subject relatedness Novelty: -- how much new information in the retrieved document Uniqueness/Timeliness Quality/Accuracy/Truth Availability Source or pointer? Accessibility Cost Language English or non-English Readability

Relevance Measurement
Binary relevant or not relevant Likert scale Not relevant, somewhat relevant, relevant, highly relevant

Precision and Recall Given a query, how many documents should a system retrieve: Are all the retrieved documents relevant? Have all the relevant documents been retrieved ? Measures for system performance: The first question is about the precision of the search The second is about the completeness (recall) of the search.

Relevant Not Relevant Retrieved a b Not retrieved d c a a
P = R = a+b a+c 3 3

Precision = --------------------------------------------
Number of relevant documents retrieved Precision = Total number of documents retrieved Number of relevant documents retrieved Recall = Number of all the relevant documents in the database

Precision measures how precise a search is.
the higher the precision, the less unwanted documents. Recall measures how complete a search is. the higher the recall, the less missing documents.

Relationship of R and P Theoretically,
R and P not depend on each other. Practically, High Recall is achieved at the expense of precision. High Precision is achieved at the expense of recall. When will p = 0? Only when none of the retrieved documents is relevant. When will p=1? Only when every retrieved documents are relevant.

What does p=0.75 mean? What does r=.25 mean? What is your goal (in term of p & r ) when conducting a search? Depending on the purpose of the search Depending on information needs Depending on the system What values of p and r would indicate a good system or good search? There is not a fixed value.

Why increasing recall will often mean decreasing precision?
In order not to miss anything and to cover all possible sources, one would have to scan many more materials, of which many might be not relevant.

Ideal Retrieval Systems
Ideal IR system would have P=1, R= 1, for all the queries Is it possible? Why? If information needs could be defined very precisely, and If relevance judgments could be done unambiguously, and If query matching could be designed perfectly, Then We would have an ideal system. It is not an information retrieval system.

Alternative measures Combining recall and precision 2
F = 1/R / P 1 + k E = k / R / P 2 2

User-Oriented Measures
Relevant docs Retrieved Docs Relevant docs Known to the user Relevant docs retrieved unknown to the user

Measure: Coverage Coverage: the fraction of the documents known to the user to be relevant which has actually been retrieved Coverage = If coverage=1, Everything the user knows has been retrieved. Relevant Docs retrieved and known to the user Relevant Docs known to the user

Measure: Novelty Novelty: the fraction of the relevant documents retrieved which was unknown to the user. Novelty= Relevant docs unknown to the user Relevant docs retrieved

Evaluation of IR Systems
Using Recall & Precision Conduct query searches Try many different queries Results may depend on sampling queries. Compare results of Precision & Recall Recall & Precision need to be considered together. 4

Use Precision and Recall to Evaluate IR Systems
Query 1 Query 2 Query 3 Query 4 Query 5 System A 0.9 / 0.1 0.7 / 0.4 0.45/0.5 0.3/0.6 0.1/ 0.8 System B 0.8/ 0.2 0.5/ 0.3 0.4/0.5 0..3/0.7 0.2/0.8 System C 0.9/ 0.4 0.7/ 0.6 0.5/ 0.7 0.3/0.8 0.2/ 0.9 4

P-R diagram P 1.0 System A System B 0.5 System C 0.1 R 0.1 1.0 0.5

Use fixed interval levels of Recall to compare Precision
System 1 System 2 System 3 R=.25 0.6 0.7 0.9 0.5 0.4 0.7 R=.50 0.2 0.3 0.4 R=.75

Use fixed intervals of the number of retrieved documents to compare Precision
Number of relevant documents System A Average Precision Query 1 Query 2 Query 3 N=10 4 5 6 0.5 N=20 4 5 16 0.41 N=30 5 17 5 0.3 N=40 8 6 24 0.31 N=50 10 25 0.27 6 Number of documents retrieved

Problems using P/R for Evaluation
For real world system, Recall is always an estimate. Results depend on sampling queries. Recall and Precision do not catch interactive aspect of the retrieval process. Recall & Precision is only one aspect of system performance High recall/high precision is desirable, but not necessary the most important thing that the user considers. R and P are based on the assumption that the set of relevant documents for a query is the same, independent of the user.

Collection Evaluation
Data quality Coverage of database It will not be found if it is not in the database. Completeness and accuracy of data Indexing methods and indexing quality It will not be found if it is not indexed. indexing types currency of indexing ( Is it updated often?) indexing sizes 17

Web Coverage: total 320 million pages

Examples: Invalid links
3 3

Example How do you evaluate http://scholar.google.com
Functional Evaluation Performance Evaluation Precision & Recall Collection Evaluation Interface Evaluation User Satisfaction Users Experiments

Interface Consideration
User friendly interface How long does it take for a user to learn advanced features? How well can the user explore or interact with the query output? How easy is it to customize output displays?

User Satisfaction User satisfaction The final test is the user!
User satisfaction is more important then precision and recall Measuring user satisfaction Survey Use statistics User experiments

User Experiments Observe and collect data on System behaviors
User search behaviors User-system interaction Interpret experiment results for system comparisons for understanding user’s information seeking behaviors for developing new retrieval systems/interfaces

An Landmark Study An evaluation of retrieval effectiveness for a full-text document retrieval system 1985, by David Blair and M. E. Maron The first large-scale evaluation on full-text retrieval Significant and controversial results Good experimental Design

The Setting An IBM full-text retrieval system with 40,000 documents of $350,000 pages. Documents to be used in the defense of a large corporate law suit. Large by 1985 standards; typical standard today Mostly Boolean searching functions, with some ranking functions added. Full-text automatic indexing.

The Experiment Two lawyers generated 51 requests.
Two paralegals conducted searches again and again until the lawyers satisfied the results Until the lawyers believed that more than 75% of relevant documents had been found. The paralegals and lawyers could have as many discussions as needed.

The results Average precision=.79 Average Recall=.20 Recall Precision
1.0 Recall .20 .20 Precision 1.0

Precision Calculation
The lawyers judged “vita”, “satisfactory”, “marginally relevant”, and “irrelevant” All the first three were counted as “relevant” in precision calculation.

Recall Calculation Sampling from a subset of the database believed to be rich in relevant documents Mixed with retrieved sets to send to the lawyers for relevant judgments

The most significant results
The recall is low. Even though the recall is only 20%, the lawyers were satisfied (and believed that 75% of relevant documents had been retrieved).

Questions Why the recall was so low? Do we really need high recall?
If the study were run today on search engines like Google, would the results be the same or different?

Discussion: Levels of Evaluation On the engineering level
On the input level On the processing level On the output level On the use and user level On the social level --- Tefko Saracevic, SIGIR’95

Summary Focus of this week
Understand challenges of IR system evaluations Pros and cons of several IR evaluation methods

INFO 624 Week 3 Retrieval System Evaluation

Similar presentations

Presentation on theme: "INFO 624 Week 3 Retrieval System Evaluation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INFO 624 Week 3 Retrieval System Evaluation

Similar presentations

Presentation on theme: "INFO 624 Week 3 Retrieval System Evaluation"— Presentation transcript:

Similar presentations

About project

Feedback