Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Kalervo Järvelin – Issues in Context Modeling – ESF IRiX Workshop - Glasgow THEORETICAL ISSUES IN CONTEXT MODELLING Kalervo Järvelin
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Search Engines and Information Retrieval
© Tefko Saracevic, Rutgers University1 Search strategy & tactics Governed by effectiveness & feedback.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Modern Information Retrieval
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
A Task Oriented Non- Interactive Evaluation Methodology for IR Systems By Jane Reid Alyssa Katz LIS 551 March 30, 2004.
© Tefko Saracevic1 Search strategy & tactics Governed by effectiveness&feedback.
The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Search Engines and Information Retrieval Chapter 1.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Information Retrieval Evaluation and the Retrieval Process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2006.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
1 MARKETING RESEARCH Week 5 Session A IBMS Term 2,
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Search Engine Architecture
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Performance Measurement. 2 Testing Environment.
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
What Does the User Really Want ? Relevance, Precision and Recall.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Evaluation of an Information System in an Information Seeking Process Lena Blomgren, Helena Vallo and Katriina Byström The Swedish School of Library and.
Information Retrieval in Practice
Information Retrieval (in Practice)
Text Based Information Retrieval
Search Engine Architecture
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
Q4 Measuring Effectiveness
Evaluation of IR Performance
Advanced Information Retrieval
Search Engine Architecture
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Information Retrieval and Web Design
Presentation transcript:

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1

 Functional Evaluation  Performance Evaluation  Precision & Recall  Collection Evaluation  Interface Evaluation  User Satisfaction  Users Experiments

 Functional analysis  Does the system provide most of the functions that the user expects?  What are unique functions of this system?  How user-friendly is the system?  Error Analysis  How often does the system fail?  How easy does the user make errors?

 Given a query, how well will the system perform?  How do we define the retrieval performance?  Is finding all the related information our goal?  Is it possible to know that the system has found all the information?  Given user’s information needs, how well will the system perform?  Is the information found useful? -- Relevance

 Relevance — Dictionary Definition:  1. Pertinence to the matter at hand.  2. Applicability to social issues.  3. Computer Science. The capability of an information retrieval system to select and retrieve data appropriate to a user's needs.

 A measurement of the outcome of a search  The judgment on what should or should not be retrieved  There are no simple answers to what is relevant and what is not relevant  difficult to define  subjective  depending on knowledge, needs, time, situation, etc.  The central concept of information retrieval

 Information Needs  problems?  requests ?  queries ?  The final test of relevance is  if users find the information useful  if users can use the information to solve the problems they have  if users fill information gap they perceived.

 The user's judgment  How well the retrieved documents satisfy the user's information needs  How useful the retrieved documents  If it is related but not useful,  It is still not relevant  The system's judgment  How well the retrieved document match the query  How likely would the user judge this information as useful?

 Subjects:  Judge by their subject relatedness  Novelty: -- how much new information in the retrieved document  Uniqueness/Timeliness  Quality/Accuracy/Truth  Availability  Source or pointer?  Accessibility  Cost  Language  English or non-English  Readability

 Binary  relevant or not relevant  Likert scale  Not relevant,  somewhat relevant,  relevant,  highly relevant

 Given a query, how many documents should a system retrieve:  Are all the retrieved documents relevant?  Have all the relevant documents been retrieved ?  Measures for system performance:  The first question is about the precision of the search  The second is about the completeness (recall) of the search.

RelevantNot Relevant Retrieved Not retrieved a b c d P = a a+b R = a a+c

Number of relevant documents retrieved Precision = Total number of documents retrieved Recall = Number of relevant documents retrieved Number of all the relevant documents in the database

 Precision measures how precise a search is.  the higher the precision,  the less unwanted documents.  Recall measures how complete a search is.  the higher the recall,  the less missing documents.

 Theoretically,  R and P not depend on each other.  Practically,  High Recall is achieved at the expense of precision.  High Precision is achieved at the expense of recall.  When will p = 0?  Only when none of the retrieved documents is relevant.  When will p=1?  Only when every retrieved documents are relevant.

 What does p=0.75 mean?  What does r=.25 mean?  What is your goal (in term of p & r ) when conducting a search?  Depending on the purpose of the search  Depending on information needs  Depending on the system  What values of p and r would indicate a good system or good search?  There is not a fixed value.

 Why increasing recall will often mean decreasing precision?  In order not to miss anything and to cover all possible sources, one would have to scan many more materials, of which many might be not relevant.

 Ideal IR system would have P=1, R= 1, for all the queries  Is it possible? Why?  If information needs could be defined very precisely, and  If relevance judgments could be done unambiguously, and  If query matching could be designed perfectly,  Then  We would have an ideal system.  Then  It is not an information retrieval system.

 Combining recall and precision The harmonic means F of recall & precision 2 F = /R + 1/ P Attempt to find the best possible compromise between R &P 2 2

 By Rijsbergen: the idea is to allow the user to specify whether he is more interested in recall or precision 1 + k^2 E = k ^2/ R + 1/ P Values of k greater than 1 indicates is the user is more interested in P than R While values of smaller than 1 indicates the contrary!!

Relevant docs Retrieved Docs Relevant docs Known to the user Relevant docs retrieved unknown to the user

 Coverage: the fraction of the documents known to the user to be relevant which has actually been retrieved  Coverage =  If coverage=1,  Everything the user knows has been retrieved. Relevant Docs retrieved and known to the user Relevant Docs known to the user

 Novelty: the fraction of the relevant documents retrieved which was unknown to the user.  Novelty= Relevant docs unknown to the user Relevant docs retrieved

 Using Recall & Precision  Conduct query searches  Try many different queries  Results may depend on sampling queries.  Compare results of Precision & Recall  Recall & Precision need to be considered together.

P /R Query 1Query 2Query 3Query 4Query 5 System A 0.9 / / /0.50.3/0.60.1/ 0.8 System B 0.8/ / / /0.70.2/0.8 System C 0.9/ / / /0.80.2/ 0.9

R P System C System A System B

Precision R=.25 R=.50 R=.75 System 1System 2System

Number of relevant documents N=10 N=30 N=40 Query 1Query 2Query N=20 N=50 Number of documents retrieved 4516 Average Precision System A

 For real world system, Recall is always an estimate.  Results depend on sampling queries.  Recall and Precision do not catch interactive aspect of the retrieval process.  Recall & Precision is only one aspect of system performance  High recall/high precision is desirable, but not necessary the most important thing that the user considers.  R and P are based on the assumption that the set of relevant documents for a query is the same, independent of the user.

 Data quality  Coverage of database  It will not be found if it is not in the database.  Completeness and accuracy of data  Indexing methods and indexing quality  It will not be found if it is not indexed.  indexing types  currency of indexing ( Is it updated often?)  indexing sizes

 How do you evaluate  Functional Evaluation Functional Evaluation Performance Evaluation Performance Evaluation  Precision & Recall Collection Evaluation Collection Evaluation Interface Evaluation Interface Evaluation User Satisfaction User Satisfaction Users Experiments Users Experiments

 User friendly interface  How long does it take for a user to learn advanced features?  How well can the user explore or interact with the query output?  How easy is it to customize output displays?

 User satisfaction  The final test is the user!  User satisfaction is more important then precision and recall  Measuring user satisfaction  Survey  Use statistics  User experiments

 Observe and collect data on  System behaviors  User search behaviors  User-system interaction  Interpret experiment results  for system comparisons  for understanding user’s information seeking behaviors  for developing new retrieval systems/interfaces

 An evaluation of retrieval effectiveness for a full-text document retrieval system  1985, by David Blair and M. E. Maron  The first large-scale evaluation on full- text retrieval  Significant and controversial results  Good experimental Design

 An IBM full-text retrieval system with 40,000 documents of $350,000 pages.  Documents to be used in the defense of a large corporate law suit.  Large by 1985 standards; typical standard today  Mostly Boolean searching functions, with some ranking functions added.  Full-text automatic indexing.

 Two lawyers generated 51 requests.  Two paralegals conducted searches again and again until the lawyers satisfied the results  Until the lawyers believed that more than 75% of relevant documents had been found.  The paralegals and lawyers could have as many discussions as needed.

 Average precision=.79  Average Recall=.20 Precision Recall

 The lawyers judged “vita”, “satisfactory”, “marginally relevant”, and “irrelevant”  All the first three were counted as “relevant” in precision calculation.

 Sampling from a subset of the database believed to be rich in relevant documents  Mixed with retrieved sets to send to the lawyers for relevant judgments

 The recall is low.  Even though the recall is only 20%, the lawyers were satisfied (and believed that 75% of relevant documents had been retrieved).

 Why the recall was so low?  Do we really need high recall?  If the study were run today on search engines like Google, would the results be the same or different?

 Levels of Evaluation  On the engineering level  On the input level  On the processing level  On the output level  On the use and user level  On the social level --- Tefko Saracevic, SIGIR’95

 Focus of this week  Understand challenges of IR system evaluations  Pros and cons of several IR evaluation methods