Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.

Slides:



Advertisements
Similar presentations
Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Developing and Evaluating a Query Recommendation Feature to Assist Users with Online Information Seeking & Retrieval With graduate students: Karl Gyllstrom,
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Evaluating Search Engine
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
© Tefko Saracevic, Rutgers University1 Search strategy & tactics Governed by effectiveness & feedback.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Modern Information Retrieval
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
A Task Oriented Non- Interactive Evaluation Methodology for IR Systems By Jane Reid Alyssa Katz LIS 551 March 30, 2004.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
© Tefko Saracevic1 Search strategy & tactics Governed by effectiveness&feedback.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Evaluating the Performance of IR Sytems
Human versus Machine in the Topic Distillation Task Mingfang Wu Alistair McLean Ross Wilkinson Gheorghe Muresan Muh-Chyun Tang Yuelin Li, Hyuk-Jin Lee.
Experimental Components for the Evaluation of Interactive Information Retrieval Systems Pia Borlund Dawn Filan 3/30/04 610:551.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Overview of Search Engines
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Is Relevance Associated with Successful Use of Information Retrieval Systems? William Hersh Professor and Head Division of Medical Informatics & Outcomes.
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
Search Engines and Information Retrieval Chapter 1.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid
Tag Data and Personalized Information Retrieval 1.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Information Retrieval Evaluation and the Retrieval Process.
Hao Wu Nov Outline Introduction Related Work Experiment Methods Results Conclusions & Next Steps.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Performance Measurement. 2 Testing Environment.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Evaluation of an Information System in an Information Seeking Process Lena Blomgren, Helena Vallo and Katriina Byström The Swedish School of Library and.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Information Retrieval in Practice
Information Retrieval in Practice
Sampath Jayarathna Cal Poly Pomona
Evaluation Anisio Lacerda.
Evaluation of IR Systems
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia

Outline A history of information retrieval (IR) evaluation –System-oriented evaluation –User-oriented evaluation Our experience with user-oriented evaluation Our observation Learnt lessons

Information Retrieval

Why evaluate an IR system? To select between alternative systems/algorithms/models What is the best for: –Ranking function (dot-product, cosine, …) –Term selection (stop word removal, stemming…) –Term weighting (TF, TF-IDF,…)

The traditional IR evaluation Test collection: a collection of documents, a set of queries, the relevance judgement Process: input the documents, put each query to the system, collect the output Measurement: usually precision and recall Document collection A set of queries Algorithm /system under test Evaluation Relevance judgement Retrieved result Precision and recall

Early Test Collections Different research groups used different and small test collections: –Hard to generalize the research outcomes –Hard to compare systems/algorithms across sites

The TREC Benchmark Text Retrieval Conference - organized by NIST, started in 1992, about 93 groups from 22 countries participated in Purposes: –To encourage research in IR based on large text collections. –To provide a common ground/task evaluation that allows cross-site comparison. –To develop new evaluation techniques, particularly for new applications, e.g. filtering, cross-language retrieval, web retrieval, high precision, question answering

Problems with the system-oriented experiment Pros: –Advanced the system development Cons: –System is an input-output device, while most real searches involve interaction. –Relevance is binary and judged independently of context, while relevance is Subjective: Depends upon a specific user’s judgment. Situational: Relates to user’s current needs. Cognitive: Depends on human perception and behavior. Dynamic: Changes over time.

TREC interactive track Goal: to investigate searching as an interactive task by examining the process as well as the outcome.

Interactive track tasks TREC3-4: finding relevant documents TREC5-9: finding any N short answers to a question, to which there are multiple answers of the same type. TREC10-11: finding any N short answers to a question and finding any N websites that meet the need specified in the task statement TREC12: topic distillation

Interactive track tasks TREC3-4: finding relevant documents TREC5-9: finding any N short answers to a question, to which there are multiple answers of the same type. TREC10-11: finding any N short answers to a question and finding any N websites that meet the need specified in the task statement TREC12: topic distillation

How to measure outcome? Aspectual precision –The proportion of the documents identified by a subject that were deemed to contain topic aspects. aspectual recall –The proportion of the know topic aspects contained in the documents identified by a subject.

How to measure process? Objective measures: –No. of query iterations –No. of document surrogates seen –No. of documents read –No. of documents saved –Actual time used Subjective measures: –searchers’ satisfaction with the interaction –searchers’ self-perception of their task completeness –searchers’ preference of an search system/interface

Experimental Design Factors: searchers, topic, and system Latin square experimental design 1E, B1C, B2 2C, B2E, B1 3E, B2C, B1 4C, B1E, G2 SearchersSystem, Topic E: Experimental System, C: Control System B1 and B2 are two blocks of (4) topics

Experimental Procedure Entry questionnaire Tutorial and demo Hands-on practice Pre-search questionnaire Search on the topic After-search questionnaire After-system questionnaire Exit questionnaire For each topic For each system time

Experiment I – clustering vs ranked list (I) Hypothesis: clustering structure is more effective than a ranked list for the aspect finding task.

Experiment I – clustering vs ranked list (II) Stage I – Can subjects recognize good clusters? Experimental task: to judge the relevance of a cluster to the topic based only on the description of cluster Non-standard TREC experiment, four subjects are involved.

The interface for judging the relevance of clusters

Experiment I – clustering vs ranked list (III) Stage II: Can clusters be used effectively for aspect finding task? TREC experiment: 8 topics, 16 searchers

The list interface

The clustering interface

Experiment I – findings Clustering structure works for some topics, but overall there is no significant difference between the clustering structure and the ranked list. Subjects preferred the clustering interface.

Experiment II - Document summary The relevant facts may exist within small chunks of a document, and these small chunks may not necessarily be related to the main theme of the document. These small chunks usually contain the keywords, and in the form of a complete sentence. We call this sentence the answer indicative sentence (AIS). When a user is scanning through a document to search for facts, s/he usually uses zoom-out strategy - keywords -> sentence -> document

Experiment II - hypothesis Hypothesis: The answer indicative sentences are better surrogate of a document than the first N words for the purpose of interactive fact finding.

The AIS An AIS should contain at least one query word and be at least ten words long. The AIS’ are first ranked according to the number of unique query words contained in each AIS. If two AIS’ have the same number of unique query words, they will be ranked according to their appearing sequence in the document. The top three AIS are then selected.

Control System (FIRST20)

Experimental System (AIS3)

Experiment II - findings Topic by topic, AIS3 has more successful sessions than the First20 in 7 topics (out of 8 topics). Subject by subject, 10 subjects are more successful with the AIS3 than the First20, 2 subjects are more successful with the First20 than the AIS3. Subjects thought the AIS3 is easier to use, preferred the AIS3, and takes less interactions with the AIS3.

Experience TREC interactive track evaluation platform –Pros: leverage the effort to build the evaluation platform Well developed experimental design and procedure –Cons: small number of subjects and topics hard to repeat experiment difficult to interpret results –E.g. performance vs. preference effective delivery works in the right context