1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluating Search Engine
Search Engines and Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
Modern Information Retrieval
Case study - usability evaluation Howell Istance.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Evaluation of Retrieval Effectiveness 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Information Retrieval in Practice
© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Evaluation of Retrieval Effectiveness 2.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
CS/Info 430: Information Retrieval
CS 430 / INFO 430 Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
The 2nd International Conference of e-Learning and Distance Education, 21 to 23 February 2011, Riyadh, Saudi Arabia Prof. Dr. Torky Sultan Faculty of Computers.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engines and Information Retrieval Chapter 1.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
1 CS 430: Information Discovery Lecture 15 Usability 2.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 CS430: Information Discovery Lecture 18 Usability 3.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
1 CS 430: Information Discovery Lecture 11 Cranfield and TREC.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
Information Retrieval in Practice
Evaluation Anisio Lacerda.
Design and modeling 10 step design process
Information Retrieval (in Practice)
Ten-Stage Design Process
Human Computer Interaction Lecture 21,22 User Support
Text Based Information Retrieval
Ten-Stage Design Process
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Retrieval Evaluation - Reference Collections
Retrieval Performance Evaluation - Measures
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

2 Course administration

3 Precision-recall graph precision recall The red system appears better than the black, but is the difference statistically significant?

4 Statistical tests Suppose that a search is carried out on systems i and j System i is superior to system j if, for all test cases, recall(i) >= recall(j) precisions(i) >= precision(j) In practice, we have data from a limited number of test cases. What conclusions can we draw?

5 Statistical tests The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to this data. The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples. The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.

6 Text Retrieval Conferences (TREC) Led by Donna Harman (NIST) and Ellen Voorhees, with DARPA support, since 1992 Corpus of several million textual documents, total of more than five gigabytes of data Researchers attempt a standard set of tasks, e.g., -> search the corpus for topics provided by surrogate users -> match a stream of incoming documents against standard queries Participants include large commercial companies, small information retrieval vendors, and university research groups.

7 Characteristics of Evaluation Experiments Corpus: Standard set of documents that can be used for repeated experiments. Topic statements: Formal statement of user information need, not related to any query language or approach to searching. Results set for each topic statement: Identify all relevant documents (or a well-defined procedure for estimating all relevant documents) Publication of results: Description of testing methodology, metrics, and results.

8 TREC Ad Hoc Track 1.NIST provides text corpus on CD-ROM Participant builds index using own technology 2.NIST provides 50 natural language topic statements Participant converts to queries (automatically or manually) 3.Participant run search (possibly using relevance feedback and other iterations), returns up to 1,000 hits to NIST 4.NIST uses pooled results to estimate set of relevant documents 5.NIST analyzes for recall and precision (all TREC participants use rank based methods of searching) 6.NIST publishes methodology and results

9 Notes on the TREC Corpus The TREC corpus consists mainly of general articles. The Cranfield data was in a specialized engineering domain. The TREC data is raw data: -> No stop words are removed; no stemming -> Words are alphanumeric strings -> No attempt made to correct spelling, sentence fragments, etc.

10 Relevance Assessment: TREC Problem: Too many documents to inspect each one for relevance. Solution: For each topic statement, a pool of potentially relevant documents is assembled, using the top 100 ranked documents from each participant The human expert who set the query looks at every document in the pool and determines whether it is relevant. Documents outside the pool are not examined. In a TREC-8 example, with 71 participants: 7,100 documents in the pool 1,736 unique documents (eliminating duplicates) 94 judged relevant

11 Some other TREC tracks (not all tracks offered every year) Cross-Language Track Retrieve documents written in different languages using topics that are in one language. Filtering Track In a stream of incoming documents, retrieve those documents that match the user's interest as represented by a query. Adaptive filtering modifies the query based on relevance feed-back. Genome Track Study the retrieval of genomic data: gene sequences and supporting documentation, e.g., research papers, lab reports, etc.

12 Some Other TREC Tracks (continued) HARD Track High accuracy retrieval, leveraging additional information about the searcher and/or the search context. Question Answering Track Systems that answer questions, rather than return documents. Video Track Content-based retrieval of digital video. Web Track Search techniques and repeatable experiments on Web documents.

13 A Cornell Footnote The TREC analysis uses a program developed by Chris Buckley, who spent 17 years at Cornell before completing his Ph.D. in Buckley has continued to maintain the SMART software and has been a participant at every TREC conference. SMART has been used as the basis against which other systems are compared. During the early TREC conferences, the tuning of SMART with the TREC corpus led to steady improvements in retrieval efficiency, but after about TREC-5 a plateau was reached. TREC-8, in 1999, was the final year for the ad hoc experiment.

14 Searching and Browsing: The Human in the Loop Search index Return hits Browse repository Return objects

15 Evaluation: User criteria System-centered and user-centered evaluation -> Is user satisfied? -> Is user successful? System efficiency -> What efforts are involved in carrying out the search? Suggested criteria (none very satisfactory) recall and precision response time user effort form of presentation content coverage

16 D-Lib Working Group on Metrics DARPA-funded attempt to develop a TREC-like approach to digital libraries (1997) with a human in the loop. "This Working Group is aimed at developing a consensus on an appropriate set of metrics to evaluate and compare the effectiveness of digital libraries and component technologies in a distributed environment. Initial emphasis will be on (a) information discovery with a human in the loop, and (b) retrieval in a heterogeneous world. " Very little progress made. See:

17 MIRA Evaluation Frameworks for Interactive Multimedia Information Retrieval Applications European study Chair Keith Van Rijsbergen, Glasgow University Expertise Multi Media Information Retrieval Information Retrieval Human Computer Interaction Case Based Reasoning Natural Language Processing

18 MIRA Starting Point Information Retrieval techniques are beginning to be used in complex goal and task oriented systems whose main objectives are not just the retrieval of information. New original research in Information Retrieval is being blocked or hampered by the lack of a broader framework for evaluation.

19 Some MIRA Aims Bring the user back into the evaluation process. Understand the changing nature of Information Retrieval tasks and their evaluation. Evaluate traditional evaluation methodologies. Understand how interaction affects evaluation. Understand how new media affects evaluation. Make evaluation methods more practical for smaller groups.

20 MIRA Approaches Developing methods and tools for evaluating interactive Information Retrieval. Studying real users and their overall goals. Design for a multimedia test collection. Get together collaborative projects. (TREC was organized as competition.) Pool tools and data.

21 Market Evaluation System that are successful in the market place must be satisfying some group of users. ExampleDocumentsApproach LibraryLibrary ofcatalog fielded data catalogsCongressrecordsBoolean search Scientific Medlineindex recordsthesaurus information+ abstractsranked search Web searchGoogleweb pagessimilarity + document rank

22 Market Research Methods of Evaluation Expert opinion (e.g. consultant) Competitive analysis Focus groups Observing users (user protocols) Measurements effectiveness in carrying out tasks speed Usage logs

23 Market Research Methods Initial Mock-upPrototypeProduction Expert opinions    Competitive analysis  Focus groups   Observing users    Measurements   Usage logs 

24 Focus Group A focus group is a group interview Interviewer Potential users Typically 5 to 12 Similar characteristics (e.g., same viewpoint) Structured set of questions May show mock-ups Group discussions Repeated with contrasting user groups

25 The Search Explorer Application: Reconstruct a User Sessions