Advanced Information Retrieval

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Evaluating Search Engine
Search Engines and Information Retrieval
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
Modern Information Retrieval
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
© Tefko Saracevic1 Search strategy & tactics Governed by effectiveness&feedback.
CS 430 / INFO 430 Information Retrieval
Evaluating the Performance of IR Sytems
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Overview of Search Engines
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Search Engines and Information Retrieval Chapter 1.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid
Information Retrieval Evaluation and the Retrieval Process.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
1 CS 430: Information Discovery Lecture 11 Cranfield and TREC.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Information Retrieval in Practice
Information Retrieval in Practice
Sampath Jayarathna Cal Poly Pomona
Evaluation of Information Retrieval Systems
Evaluation Anisio Lacerda.
Information Retrieval (in Practice)
IS442 Information Systems Engineering
Multimedia Information Retrieval
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
CS 430: Information Discovery
Introduction into Knowledge and information
Q4 Measuring Effectiveness
Evaluation of IR Performance
Chapter 13 Quality Management
Introduction to Information Retrieval
COMP444 Human Computer Interaction Usability Engineering
Evaluation of Information Retrieval Systems
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Reference Collections
Presentation transcript:

Advanced Information Retrieval Meeting #3

Performance Evaluation How do you evaluate the performance of an information retrieval system? Or compare two different systems?

Relevance for IR A measurement of the outcome of a search The judgment on what should or should not be retrieved There is no simple answers to what is relevant and what is not relevant difficult to define subjective depending on knowledge, needs, time, situation, etc. The central concept of information retrieval

Relevance to What? Information Needs The final test of relevance is Problems? requests ? queries ? The final test of relevance is if users find the information useful if users can use the information to solve the problems they have if users fill information gap they perceived.

Relevance Judgment The user's judgment The intermediary's judgment How well the retrieved documents satisfy the user's information needs How useful the retrieved documents Related but not useful --- still not relevant The intermediary's judgment How likely does the user judge the information as useful? How important does the user treat the information? The system's judgment How well the retrieved document match the query

What is the goal of an IR system? A “good” IR system is able to extract meaningful (relevant) information while withholding non-relevant information Why is this difficult? What are we testing?

What are the components of relevance? Some criteria… Depth/scope Accuracy/validity Content novelty Document novelty Tangibility Ability to understand Recency External validation Effectiveness Access Clarity Source quality etc.

Determining relevance? Subjective in nature; may be determined by The user who posed the retrieval problem Realistic but based on many personal factors Relates to problem/information need An external judge Relates to statement/query Assumption of independence Should it be binary or n-ary?

Precision vs. Recall Which one is more important? Depends on the task! Generic web search engine Precision! Index of court cases Need all legal precedents! Recall!

Relationship of R and P Theoretically, Practically, When will p = 0? R and P are not depend on each other. Practically, High Recall is achieved at the expense of precision. High Precision is achieved at the expense of recall. When will p = 0? Only when none of the retrieved documents is relevant. When will p=1? Only when every retrieved documents are relevant.

Ideal Retrieval Systems Ideal IR system would have P=1, R= 1, for all the queries Is it possible? Why? If information needs could be defined very precisely, and If relevance judgments could be done unambiguously, and If query matching could be designed perfectly, Then We would have an ideal system. It is not an ideal information retrieval system.

Precision vs. Recall Inversely related As recall increases, precision decreases Precision Recall

Evaluation of IR Systems Using Recall & Precision Conduct query searches Try many different queries Compare results of Precision & Recall Recall & Precision need to be considered together. Results varies depending on test data and queries. Recall & Precision is only one aspect of system performance High recall/high precision is desirable, but not necessary the most important thing that the user considers. 4

Precision vs. Recall Precision and Recall depend on size of selected set Will depend on user interface, the user, and the user’s task Boolean system: assume all documents are presented and viewed by user Ranked system: depends on number of documents viewed by user from ranked list

Implementing Precision & Recall Common method: Measure the precision at several levels of recall Measure precision at increasing sizes of “selected” set For each query, calculate precision at 11 levels of recall (0, 10, … 100%) Average across all queries Plot precision vs. recall curve

Data quality consideration Coverage of database Completeness and accuracy of data Indexing methods and indexing quality indexing types currency of indexing ( Is it updated often?) indexing sizes 17

Precision vs. Recall Average precision across all user queries   Rong Jin, Alex G. Hauptmann,  Cheng Xiang Zhai. Title language model for information retrieval. 2002. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval.

Precision & Recall for a single query 14 documents: 1,2,4,6 and 13 are relevant ith document retrieved

Precision & Recall for a Single Query 14 documents: 1,2,4,6 and 13 are relevant

Example: How to show that a search engine is better than other search engines? On each level of recall, the precision for your system is higher than the precisions of other systems. Three-point average, r=.25, .5, .75 11-point average: r= 0.0, .1, .2, .3, .4 …, .9, 1.0 Comparing the first 10, 20, 30, … items returned by search engines, your system always has more relevant documents than other systems.

Use fixed interval levels of Recall to compare Precision System 1 System 2 System 3 R=.25 0.6 0.7 0.9 0.5 0.4 0.7 R=.50 0.2 0.3 0.4 R=.75

Use fixed intervals of the number of retrieved documents to compare Precision Number of relevant documents System A Average Precision Query 1 Query 2 Query 3 N=10 4 5 6 0.5 N=20 4 5 16 0.42 N=30 5 17 5 0.3 N=40 8 6 24 0.32 N=50 10 25 0.27 6 Number of documents retrieved

Use Precision and Recall to Evaluate IR Systems 1 2 3 4 5 System A 0.9 / 0.1 0.7 / 0.4 0.45/0.5 0.3/0.6 0.1/ 0.8 System B 0.8/ 0.2 0.5/ 0.3 0.4/0.5 0..3/0.7 0.2/0.8 System C 0.9/ 0.4 0.7/ 0.6 0.5/ 0.7 0.3/0.8 0.2/ 0.9 4

P-R diagram P 1.0 System A System B 0.5 System C 0.1 R 0.1 1.0 0.5

Problems with Recall/Precision Poor match with user needs Limited usefulness of recall Does not handle interactivity well Computation of recall? Relevance is not utility Averages ignore individual differences in queries

Interface Consideration User friendly interface How long does it take for a user to learn advanced features? How well can the user explore or interact with the query output? How easy is it to customize output displays?

User-Centered IR Evaluation More user-oriented measures Satisfaction, informativeness Other types of measures Time, cost-benefit, error rate, task analysis Evaluation of user characteristics Evaluation of interface Evaluation of process or interaction

User satisfaction The final test is the user! User satisfaction is more important then precision and recall Measuring user satisfaction Survey Use statistics User experiments

Retrieval Effectiveness Designing an information retrieval system has many decisions: Manual or automatic indexing? Natural language or controlled vocabulary? What stoplists? What stemming methods? What query syntax? etc. How do we know which of these methods are most effective? Is everything a matter of judgment?

Studies of Retrieval Effectiveness • The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, 1957 -1968 • SMART System, Gerald Salton, Cornell University, 1964-1988 • TREC, Donna Harman, National Institute of Standards and Technology (NIST), 1992 -

Cranfield Experiments (Example) Comparative efficiency of indexing systems: (Universal Decimal Classification, alphabetical subject index, a special facet classification, Uniterm system of co-ordinate indexing) Four indexes prepared manually for each document in three batches of 6,000 documents -- total 18,000 documents, each indexed four times. The documents were reports and paper in aeronautics. Indexes for testing were prepared on index cards and other cards. Very careful control of indexing procedures.

Cranfield Experiments (continued) Searching: • 1,200 test questions, each satisfied by at least one document • Reviewed by expert panel • Searches carried out by 3 expert librarians • Two rounds of searching to develop testing methodology • Subsidiary experiments at English Electric Whetstone Laboratory and Western Reserve University

The Cranfield Data The Cranfield data was made widely available and used by other researchers • Salton used the Cranfield data with the SMART system (a) to study the relationship between recall and precision, and (b) to compare automatic indexing with human indexing • Sparc Jones and van Rijsbergen used the Cranfield data for experiments in relevance weighting, clustering, definition of test corpora, etc.

The Cranfield Experiments 1950s/1960s Time Lag. The interval between the demand being made and the answer being given. Presentation. The physical form of the output. User effort. The effort, intellectual or physical, demanded of the user. Recall. The ability of the system to present all relevant documents. Precision. The ability of the system to withhold non-relevant documents.

Cranfield Experiments -- Analysis Cleverdon introduced recall and precision, based on concept of relevance. recall (%) practical systems precision (%)

Why not the others? According to Cleverdon: Time lag Presentation a function of hardware Presentation successful if the user can read and understand the list of references returned User effort can be measured with a straightforward examination of a small number of cases.

In Reality Need to consider the user task carefully Cleverdon was focusing on batch interfaces Interactive browsing interfaces very significant (Turpin & Hersh) Interactive systems User effort & presentation very important

In Spite of That Precision & Recall Usability extensively evaluated not so much

Why Not Usability Usability requires a user-study Every new feature needs a new study (expensive) High variance – many confounding factors Offline analysis of accuracy Once a dataset is found Easy to control factors Repeatable Automatic Free If the system isn’t accurate, it isn’t going to be usable

Measures • From IR (User) precision, aspectual recall • From experimental psychology Quantitative: time, number of errors, … Qualitative: user opinions Example evaluation measures: System viewpoint User viewpoint Effectiveness recall/precision quality of solution Efficiency retrieval time task completion time Satisfaction Preference confidence

Another view… “The omission of the user from the traditional IR model, whether it is made explicit or not, stems directly from the user’s absence from the Cranfield experiment”. (Harter and Hert, 1997)

The TREC era Text REtrieval Evaluation Conference Sponsored and hosted by NIST Begun in 1992 Participants from academia, industry, and government Provides standard test collections and queries; relevance judges and data analysis at NIST http://trec.nist.gov

TREC Goals… to encourage research in information retrieval based on large test collections; to increase communication among industry, academia, and government by creating an open forum for the exchange of research ideas; to speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements in retrieval methodologies on real-world problems; to increase the availability of appropriate evaluation techniques for use by industry and academia, including development of new evaluation techniques more applicable to current systems

TREC databases About 5 gigabytes of text; sources include WSJ, AP, Computer Selects, Federal Register, SJ Mercury, FT, Congressional Record, FBIS, LA Times Simple SGML tagging; no correction of errors in text

TREC Experiments 1. NIST provides text corpus on CD-ROM Participant builds index using own technology 2. NIST provides 50 natural language topic statements Participant converts to queries (automatically or manually) 3. Participant run search, returns up to 1,000 hits to NIST. NIST analyzes for recall and precision (all TREC participants use rank based methods of searching)

TREC Evaluation Summary table statistics Recall-precision averages No. of topics, documents, relevant documents retrieved, relevant documents available Recall-precision averages Average precision at 11 levels of recall Document level averages Average precision at specified document cutoff values Average precision histogram Single measure for each topic

Evaluation of Web Search Engines Dynamic nature of database Differences between databases (content, indexing) Operational system Recall “practically unknowable” Generally relatively little overlap

What IR still lacks… Wider range of task; suites for evaluation, especially interfaces Appropriate measures for interactive evaluation Standard tests for standard users Flexibility to match users need and outcomes Mechanisms to study process and strategy as well as outcomes