Is Relevance Associated with Successful Use of Information Retrieval Systems? William Hersh Professor and Head Division of Medical Informatics & Outcomes.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Project Proposal.
© Cambridge International Examinations 2013 Component/Paper 1.
8. Evidence-based management Step 3: Critical appraisal of studies
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Kalervo Järvelin – Issues in Context Modeling – ESF IRiX Workshop - Glasgow THEORETICAL ISSUES IN CONTEXT MODELLING Kalervo Järvelin
Critical Appraisal Dr Samira Alsenany Dr SA 2012 Dr Samira alsenany.
Evaluating Search Engine
Search Engines and Information Retrieval
Do Batch and User Evaluations Give the Same Results? Authors: William Hersh, Andrew Turpin, Susan Price, Benjamin Chan, Dale Kraemer, Lynetta Sacherek,
INFO 624 Week 3 Retrieval System Evaluation
© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Understanding Standards: Biology An Overview of the Standards for Unit and Course Assessment.
Interdisciplinary role of English in the field of medicine: integrating content and context Nataša Milosavljević, Zorica Antić University of Niš, Faculty.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
How to develop research skills in students. The model of searching information. Carol Collier Kuhlthau How to develop research skills in students. The.
Quality Function Deployment
Search Engines and Information Retrieval Chapter 1.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2006.
This material was developed by Oregon Health & Science University, funded by the Department of Health and Human Services, Office of the National Coordinator.
For ABA Importance of Individual Subjects Enables applied behavior analysts to discover and refine effective interventions for socially significant behaviors.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Retrieval 1/2 BDK12-5 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Evidence-Based Medicine – Definitions and Applications 1 Component 2 / Unit 5 Health IT Workforce Curriculum Version 1.0 /Fall 2010.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
The role of knowledge in conceptual retrieval: a study in the domain of clinical medicine Jimmy Lin and Dina Demner-Fushman University of Maryland SIGIR.
Week 2 The lecture for this week is designed to provide students with a general overview of 1) quantitative/qualitative research strategies and 2) 21st.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
EBM --- Journal Reading Presenter :呂宥達 Date : 2005/10/27.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation of Information Retrieval Systems Xiangming Mu.
Unit 11: Evaluating Epidemiologic Literature. Unit 11 Learning Objectives: 1. Recognize uniform guidelines used in preparing manuscripts for publication.
D O B ATCH AND U SER E VALUATIONS G IVE THE S AME R ESULTS ? William Hersh, Andrew Turpin, Susan Price, Benjamin Chan, Dale Kraemer, Lynetta Sacherek,
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Information Retrieval in Practice
Quality Function Deployment
DATA COLLECTION METHODS IN NURSING RESEARCH
IR Theory: Evaluation Methods
CS 430: Information Discovery
Retrieval Evaluation - Reference Collections
Relevance in ISR Peter Ingwersen Department of Information Studies
Component 11 Unit 7: Building Order Sets
Retrieval Performance Evaluation - Measures
Presentation transcript:

Is Relevance Associated with Successful Use of Information Retrieval Systems? William Hersh Professor and Head Division of Medical Informatics & Outcomes Research Oregon Health & Science University

Goal of talk Answer question of association of relevance- based evaluation measures with successful use of information retrieval (IR) systems By describing two sets of experiments in different subject domains Since focus of talk is on one question assessed in different studies, I will necessarily provide only partial details of the studies

For more information on these studies… Hersh W et al., Challenging conventional assumptions of information retrieval with real users: Boolean searching and batch retrieval evaluations, Info. Processing & Management, 2001, 37: Hersh W et al., Further analysis of whether batch and user evaluations give the same results with a question-answering task, Proceedings of TREC-9, Gaithersburg, MD, 2000, Hersh W et al., Factors associated with success for searching MEDLINE and applying evidence to answer clinical questions, Journal of the American Medical Informatics Association, 2002, 9:

Outline of talk Information retrieval system evaluation Text REtrieval Conference (TREC) Medical IR Methods and results of experiments TREC Interactive Track Medical searching Implications

Information retrieval system evaluation

Evaluation of IR systems Important not only to researchers but also users so we can Understand how to build better systems Determine better ways to teach those who use them Cut through hype of those promoting them There are a number of classifications of evaluation, each with a different focus

Lancaster and Warner (Information Retrieval Today, 1993) Effectiveness e.g., cost, time, quality Cost-effectiveness e.g., per relevant citation, new citation, document Cost-benefit e.g., per benefit to user

Hersh and Hickam (JAMA, 1998) Was system used? What was it used for? Were users satisfied? How well was system used? Why did system not perform well? Did system have an impact?

Most research has focused on relevance-based measures Measure quantities of relevant documents retrieved Most common measures of IR evaluation in published research Assumptions commonly applied in experimental settings Documents are relevant or not to user information need Relevance is fixed across individuals and time

Recall and precision defined Recall Precision

Some issues with relevance- based measures Some IR systems return retrieval sets of vastly different sizes, which can be problematic for “point” measures Sometimes it is unclear what a “retrieved document” is Surrogate vs. actual document Users often perform multiple searches on a topic, with changing needs over time There are differing definitions of what is a “relevant document”

What is a relevant document? Relevance is intuitive yet hard to define (Saracevic, various) Relevance is not necessarily fixed Changes across people and time Two broad views Topical – document is on topic Situational – document is useful to user in specific situation (aka, psychological relevance, Harter, JASIS, 1992)

Other limitations of recall and precision Magnitude of a “clinically significant” difference unknown Serendipity – sometimes we learn from information not relevant to the need at hand External validity of results – many experiments test using “batch” mode without real users; is not clear that results translate to real searchers

Alternatives to recall and precision “Task-oriented” approaches that measure how well user performs information task with system “Outcomes” approaches that determine whether system leads to better outcome or a surrogate for outcome Qualitative approaches to assessing user’s cognitive state as they interact with system

Text Retrieval Conference (TREC) Organized by National Institutes for Standards and Technology (NIST) Annual cycle consisting of Distribution of test collections and queries to participants Determination of relevance judgments and results Annual conference for participants at NIST (each fall) TREC-1 began in 1992 and has continued annually Web site: trec.nist.gov

TREC goals Assess many different approaches to IR with a common large test collection, set of real-world queries, and relevance judgements Provide forum for academic and industrial researchers to share results and experiences

Organization of TREC Began with two major tasks Ad hoc retrieval – standard searching Discontinued with TREC 2001 Routing – identify new documents with queries developed for known relevant ones In some ways, a variant of relevance feedback Discontinued with TREC-7 Has evolved to a number of tracks Interactive, natural language processing, spoken documents, cross-language, filtering, Web, etc.

What has been learned in TREC? Approaches that improve performance e.g., passage retrieval, query expansion, 2-poisson weighting Approaches that may not improve performance e.g., natural language processing, stop words, stemming Do these kinds of experiments really matter? Criticisms of batch-mode evaluation from Swanson, Meadow, Saracevic, Hersh, Blair, etc. Results that question their findings from Interactive Track, e.g., Hersh, Belkin, Wu & Wilkinson, etc.

The TREC Interactive Track Developed out of interest in how with real users might search using TREC queries, documents, etc. TREC 6-8 ( ) used instance recall task TREC 9 (2000) and subsequent years used question-answering task Now being folded into Web track

TREC-8 Interactive Track Task for searcher: retrieve instances of a topic in a query Performance measured by instance recall Proportion of all instances retrieved by user Differs from document recall in that multiple documents on same topic count as one instance Used Financial Times collection ( ) Queries derived from ad hoc collection Six 20-minute topics for each user Balanced design: “experimental” vs. “control”

TREC-8 sample topic Title Hubble Telescope Achievements Description Identify positive accomplishments of the Hubble telescope since it was launched in 1991 Instances In the time allotted, please find as many DIFFERENT positive accomplishments of the sort described above as you can

TREC-9 Interactive Track Same general experimental design with A new task Question-answering A new collection Newswire from TREC disks 1-5 New topics Eight questions

Issues in medical IR Searching priorities vary by setting In busy clinical environment, users usually want quick, short answer Outside clinical environment, users may be willing to explore in more detail As in other scientific fields, researchers likely to want more exhaustive information Clinical searching task has many similarities to Interactive Track design, so methods are comparable

Some results of medical IR evaluations (Hersh, 2003) In large bibliographic databases (e.g., MEDLINE), recall and precision comparable to those seen in other domains (e.g., 50%-50%, minimal overlap across searchers) Bibliographic databases not amenable to busy clinical setting, i.e., not used often, information retrieved not preferred Biggest challenges now in digital library realm, i.e., interoperability of disparate resources

Methods and results Research question: Is relevance associated with successful use of information retrieval systems?

TREC Interactive Track and our research question Do the results of batch IR studies correspond to those obtained with real users? i.e., Do term weighting approaches which work better in batch studies do better for real users? Methodology Identify a prior test collection that measures large batch performance differential over some baseline Use interactive track to see if this difference is maintained with interactive searching and new collection Verify that previous batch difference is maintained with new collection

TREC-8 experiments Determine the best-performing measure Use instance recall data from previous years as batch test collection with relevance defined as documents containing >1 instance Perform user experiments TREC-8 Interactive Track protocol Verify optimal measure holds Use TREC-8 instance recall data as batch test collection similar to first experiment

IR system used for our TREC- 8 (and 9) experiments MG Public domain IR research system Described in Witten et. al., Managing Gigabytes, 1999 Experimental version implements all “modern” weighting schemes (e.g., TFIDF, Okapi, pivoted normalization) via Q-expressions, c.f., Zobel and Moffat, SIGIR Forum, 1998 Simple Web-based front end

Experiment 1 – Determine best “batch” performance Okapi term weighting performs much better than TFIDF.

Experiment 2 – Did benefit occur with interactive task? Methods Two user populations Professional librarians and graduate students Using a simple natural language interface MG system with Web front end With two different term weighting schemes TFIDF (baseline) vs. Okapi

User interface

Results showed benefit for better batch system (Okapi) +18%, BUT...

All differences were due to one query

Experiment 3 – Did batch results hold with TREC-8 data? Yes, but still with high variance and without statistical significance.

TREC-9 Interactive Track experiments Similar to approach used in TREC-8 Determine the best-performing weighting measure Use all previous TREC data, since no baseline Perform user experiments Follow protocol of track Use MG Verify optimal measure holds Use TREC-9 relevance data as batch test collection analogous first experiment

Determine best “batch” performance Okapi+PN term weighting performs better than TFIDF.

Interactive experiments – comparing systems Little difference across systems but note wide differences across questions.

Do batch results hold with new data? Batch results show improved performance whereas user results do not.

Further analysis (Turpin, SIGIR 2001) Okapi searches definitely retrieve more relevant documents Okapi+PN user searches have 62% better MAP Okapi+PN user searches have 101% better documents But Users do 26% more cycles with TFIDF Users get overall same results per experiments

Possible explanations for our TREC Interactive Track results Batch searching results may not generalize User data show wide variety of differences (e.g., search terms, documents viewed) which may overwhelm system measures Or we cannot detect that they do Increase task, query, or system diversity Increase statistical power

Medical IR study design Orientation to experiment and system Brief training in searching and evidence- based medicine (EBM) Collect data on factors of users Subjects given questions and asked to search to find and justify answer Statistical analysis to find associations among user factors and successful searching

System used – OvidWeb MEDLINE

Experimental design Recruited 45 senior medical students 21 second (last) year NP students Large-group session Demographic/experience questionnaire Orientation to experiment, OvidWeb Overview of basic MEDLINE and EBM skills

Experimental design (cont.) Searching sessions Two hands-on sessions in library For each of three questions, randomly selected from 20, measured: Pre-search answer with certainty Searching and answering with justification and certainty Logging of system-user interactions User interface questionnaire (QUIS)

Searching questions Derived from two sources Medical Knowledge Self-Assessment Program (Internal Medicine board review) Clinical questions collection of Paul Gorman Worded to have answer of either Yes with good evidence Indeterminate evidence No with good evidence Answers graded by expert clinicians

Assessment of recall and precision Aimed to perform a “typical” recall and precision study and determine if they were associated with successful searching Designated “end queries” to have terminal set for analysis Half of all retrieved MEDLINE records judged by three physicians each as definitely relevant, possibly relevant, or not relevant Also measured reliability of raters

Overall results Prior to searching, rate of correctness (32.1%) about equal to chance for both groups Rating of certainly low for both groups With searching, medical students increased rate of correctness to 51.6% but NP students remained virtually unchanged at 34.7%

Overall results Medical students were better able to convert incorrect into correct answers, whereas NP students were hurt as often as helped by searching.

Recall and precision Recall and precision were not associated with successful answering of questions and were nearly identical for medical and NP students.

Conclusions from results Medical students improved ability to answer questions with searching, NP students did not Spatial visualization ability may explain Answering questions required >30 minutes whether correct or incorrect This content not amenable to clinical setting Recall and precision had no relation to successful searching

Implications

Limitations of studies Domains Many more besides newswire and medicine Numbers of users and questions Small and not necessarily representative Experimental setting Real-world users may behave differently

But I believe we can conclude Although batch evaluations are useful early in system development, their results cannot be assumed to apply to real users Recall and precision are important components of searching but not the most important determiners of success Further research should investigate what makes documents relevant to users and helps them solve their information problems

Thank you for inviting me… It’s great to be back in the Midwest!