Download presentation
Presentation is loading. Please wait.
Published byShauna Hardy Modified over 9 years ago
1
Is Relevance Associated with Successful Use of Information Retrieval Systems? William Hersh Professor and Head Division of Medical Informatics & Outcomes Research Oregon Health & Science University hersh@ohsu.edu
2
Goal of talk Answer question of association of relevance- based evaluation measures with successful use of information retrieval (IR) systems By describing two sets of experiments in different subject domains Since focus of talk is on one question assessed in different studies, I will necessarily provide only partial details of the studies
3
For more information on these studies… Hersh W et al., Challenging conventional assumptions of information retrieval with real users: Boolean searching and batch retrieval evaluations, Info. Processing & Management, 2001, 37: 383-402. Hersh W et al., Further analysis of whether batch and user evaluations give the same results with a question-answering task, Proceedings of TREC-9, Gaithersburg, MD, 2000, 407-416. Hersh W et al., Factors associated with success for searching MEDLINE and applying evidence to answer clinical questions, Journal of the American Medical Informatics Association, 2002, 9: 283-293.
4
Outline of talk Information retrieval system evaluation Text REtrieval Conference (TREC) Medical IR Methods and results of experiments TREC Interactive Track Medical searching Implications
5
Information retrieval system evaluation
6
Evaluation of IR systems Important not only to researchers but also users so we can Understand how to build better systems Determine better ways to teach those who use them Cut through hype of those promoting them There are a number of classifications of evaluation, each with a different focus
7
Lancaster and Warner (Information Retrieval Today, 1993) Effectiveness e.g., cost, time, quality Cost-effectiveness e.g., per relevant citation, new citation, document Cost-benefit e.g., per benefit to user
8
Hersh and Hickam (JAMA, 1998) Was system used? What was it used for? Were users satisfied? How well was system used? Why did system not perform well? Did system have an impact?
9
Most research has focused on relevance-based measures Measure quantities of relevant documents retrieved Most common measures of IR evaluation in published research Assumptions commonly applied in experimental settings Documents are relevant or not to user information need Relevance is fixed across individuals and time
10
Recall and precision defined Recall Precision
11
Some issues with relevance- based measures Some IR systems return retrieval sets of vastly different sizes, which can be problematic for “point” measures Sometimes it is unclear what a “retrieved document” is Surrogate vs. actual document Users often perform multiple searches on a topic, with changing needs over time There are differing definitions of what is a “relevant document”
12
What is a relevant document? Relevance is intuitive yet hard to define (Saracevic, various) Relevance is not necessarily fixed Changes across people and time Two broad views Topical – document is on topic Situational – document is useful to user in specific situation (aka, psychological relevance, Harter, JASIS, 1992)
13
Other limitations of recall and precision Magnitude of a “clinically significant” difference unknown Serendipity – sometimes we learn from information not relevant to the need at hand External validity of results – many experiments test using “batch” mode without real users; is not clear that results translate to real searchers
14
Alternatives to recall and precision “Task-oriented” approaches that measure how well user performs information task with system “Outcomes” approaches that determine whether system leads to better outcome or a surrogate for outcome Qualitative approaches to assessing user’s cognitive state as they interact with system
15
Text Retrieval Conference (TREC) Organized by National Institutes for Standards and Technology (NIST) Annual cycle consisting of Distribution of test collections and queries to participants Determination of relevance judgments and results Annual conference for participants at NIST (each fall) TREC-1 began in 1992 and has continued annually Web site: trec.nist.gov
16
TREC goals Assess many different approaches to IR with a common large test collection, set of real-world queries, and relevance judgements Provide forum for academic and industrial researchers to share results and experiences
17
Organization of TREC Began with two major tasks Ad hoc retrieval – standard searching Discontinued with TREC 2001 Routing – identify new documents with queries developed for known relevant ones In some ways, a variant of relevance feedback Discontinued with TREC-7 Has evolved to a number of tracks Interactive, natural language processing, spoken documents, cross-language, filtering, Web, etc.
18
What has been learned in TREC? Approaches that improve performance e.g., passage retrieval, query expansion, 2-poisson weighting Approaches that may not improve performance e.g., natural language processing, stop words, stemming Do these kinds of experiments really matter? Criticisms of batch-mode evaluation from Swanson, Meadow, Saracevic, Hersh, Blair, etc. Results that question their findings from Interactive Track, e.g., Hersh, Belkin, Wu & Wilkinson, etc.
19
The TREC Interactive Track Developed out of interest in how with real users might search using TREC queries, documents, etc. TREC 6-8 (1997-1999) used instance recall task TREC 9 (2000) and subsequent years used question-answering task Now being folded into Web track
20
TREC-8 Interactive Track Task for searcher: retrieve instances of a topic in a query Performance measured by instance recall Proportion of all instances retrieved by user Differs from document recall in that multiple documents on same topic count as one instance Used Financial Times collection (1991-1994) Queries derived from ad hoc collection Six 20-minute topics for each user Balanced design: “experimental” vs. “control”
21
TREC-8 sample topic Title Hubble Telescope Achievements Description Identify positive accomplishments of the Hubble telescope since it was launched in 1991 Instances In the time allotted, please find as many DIFFERENT positive accomplishments of the sort described above as you can
22
TREC-9 Interactive Track Same general experimental design with A new task Question-answering A new collection Newswire from TREC disks 1-5 New topics Eight questions
23
Issues in medical IR Searching priorities vary by setting In busy clinical environment, users usually want quick, short answer Outside clinical environment, users may be willing to explore in more detail As in other scientific fields, researchers likely to want more exhaustive information Clinical searching task has many similarities to Interactive Track design, so methods are comparable
24
Some results of medical IR evaluations (Hersh, 2003) In large bibliographic databases (e.g., MEDLINE), recall and precision comparable to those seen in other domains (e.g., 50%-50%, minimal overlap across searchers) Bibliographic databases not amenable to busy clinical setting, i.e., not used often, information retrieved not preferred Biggest challenges now in digital library realm, i.e., interoperability of disparate resources
25
Methods and results Research question: Is relevance associated with successful use of information retrieval systems?
26
TREC Interactive Track and our research question Do the results of batch IR studies correspond to those obtained with real users? i.e., Do term weighting approaches which work better in batch studies do better for real users? Methodology Identify a prior test collection that measures large batch performance differential over some baseline Use interactive track to see if this difference is maintained with interactive searching and new collection Verify that previous batch difference is maintained with new collection
27
TREC-8 experiments Determine the best-performing measure Use instance recall data from previous years as batch test collection with relevance defined as documents containing >1 instance Perform user experiments TREC-8 Interactive Track protocol Verify optimal measure holds Use TREC-8 instance recall data as batch test collection similar to first experiment
28
IR system used for our TREC- 8 (and 9) experiments MG Public domain IR research system Described in Witten et. al., Managing Gigabytes, 1999 Experimental version implements all “modern” weighting schemes (e.g., TFIDF, Okapi, pivoted normalization) via Q-expressions, c.f., Zobel and Moffat, SIGIR Forum, 1998 Simple Web-based front end
29
Experiment 1 – Determine best “batch” performance Okapi term weighting performs much better than TFIDF.
30
Experiment 2 – Did benefit occur with interactive task? Methods Two user populations Professional librarians and graduate students Using a simple natural language interface MG system with Web front end With two different term weighting schemes TFIDF (baseline) vs. Okapi
31
User interface
32
Results showed benefit for better batch system (Okapi) +18%, BUT...
33
All differences were due to one query
34
Experiment 3 – Did batch results hold with TREC-8 data? Yes, but still with high variance and without statistical significance.
35
TREC-9 Interactive Track experiments Similar to approach used in TREC-8 Determine the best-performing weighting measure Use all previous TREC data, since no baseline Perform user experiments Follow protocol of track Use MG Verify optimal measure holds Use TREC-9 relevance data as batch test collection analogous first experiment
36
Determine best “batch” performance Okapi+PN term weighting performs better than TFIDF.
37
Interactive experiments – comparing systems Little difference across systems but note wide differences across questions.
38
Do batch results hold with new data? Batch results show improved performance whereas user results do not.
39
Further analysis (Turpin, SIGIR 2001) Okapi searches definitely retrieve more relevant documents Okapi+PN user searches have 62% better MAP Okapi+PN user searches have 101% better Precision@5 documents But Users do 26% more cycles with TFIDF Users get overall same results per experiments
40
Possible explanations for our TREC Interactive Track results Batch searching results may not generalize User data show wide variety of differences (e.g., search terms, documents viewed) which may overwhelm system measures Or we cannot detect that they do Increase task, query, or system diversity Increase statistical power
41
Medical IR study design Orientation to experiment and system Brief training in searching and evidence- based medicine (EBM) Collect data on factors of users Subjects given questions and asked to search to find and justify answer Statistical analysis to find associations among user factors and successful searching
42
System used – OvidWeb MEDLINE
43
Experimental design Recruited 45 senior medical students 21 second (last) year NP students Large-group session Demographic/experience questionnaire Orientation to experiment, OvidWeb Overview of basic MEDLINE and EBM skills
44
Experimental design (cont.) Searching sessions Two hands-on sessions in library For each of three questions, randomly selected from 20, measured: Pre-search answer with certainty Searching and answering with justification and certainty Logging of system-user interactions User interface questionnaire (QUIS)
45
Searching questions Derived from two sources Medical Knowledge Self-Assessment Program (Internal Medicine board review) Clinical questions collection of Paul Gorman Worded to have answer of either Yes with good evidence Indeterminate evidence No with good evidence Answers graded by expert clinicians
46
Assessment of recall and precision Aimed to perform a “typical” recall and precision study and determine if they were associated with successful searching Designated “end queries” to have terminal set for analysis Half of all retrieved MEDLINE records judged by three physicians each as definitely relevant, possibly relevant, or not relevant Also measured reliability of raters
47
Overall results Prior to searching, rate of correctness (32.1%) about equal to chance for both groups Rating of certainly low for both groups With searching, medical students increased rate of correctness to 51.6% but NP students remained virtually unchanged at 34.7%
48
Overall results Medical students were better able to convert incorrect into correct answers, whereas NP students were hurt as often as helped by searching.
49
Recall and precision Recall and precision were not associated with successful answering of questions and were nearly identical for medical and NP students.
50
Conclusions from results Medical students improved ability to answer questions with searching, NP students did not Spatial visualization ability may explain Answering questions required >30 minutes whether correct or incorrect This content not amenable to clinical setting Recall and precision had no relation to successful searching
51
Implications
52
Limitations of studies Domains Many more besides newswire and medicine Numbers of users and questions Small and not necessarily representative Experimental setting Real-world users may behave differently
53
But I believe we can conclude Although batch evaluations are useful early in system development, their results cannot be assumed to apply to real users Recall and precision are important components of searching but not the most important determiners of success Further research should investigate what makes documents relevant to users and helps them solve their information problems
54
Thank you for inviting me… www.irbook.org It’s great to be back in the Midwest!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.