Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1

 Functional Evaluation  Performance Evaluation  Precision & Recall  Collection Evaluation  Interface Evaluation  User Satisfaction  Users Experiments

 Functional analysis  Does the system provide most of the functions that the user expects?  What are unique functions of this system?  How user-friendly is the system?  Error Analysis  How often does the system fail?  How easy does the user make errors?

 Given a query, how well will the system perform?  How do we define the retrieval performance?  Is finding all the related information our goal?  Is it possible to know that the system has found all the information?  Given user’s information needs, how well will the system perform?  Is the information found useful? -- Relevance

 Relevance — Dictionary Definition:  1. Pertinence to the matter at hand.  2. Applicability to social issues.  3. Computer Science. The capability of an information retrieval system to select and retrieve data appropriate to a user's needs.

 A measurement of the outcome of a search  The judgment on what should or should not be retrieved  There are no simple answers to what is relevant and what is not relevant  difficult to define  subjective  depending on knowledge, needs, time, situation, etc.  The central concept of information retrieval

 Information Needs  problems?  requests ?  queries ?  The final test of relevance is  if users find the information useful  if users can use the information to solve the problems they have  if users fill information gap they perceived.

 The user's judgment  How well the retrieved documents satisfy the user's information needs  How useful the retrieved documents  If it is related but not useful,  It is still not relevant  The system's judgment  How well the retrieved document match the query  How likely would the user judge this information as useful?

 Subjects:  Judge by their subject relatedness  Novelty: -- how much new information in the retrieved document  Uniqueness/Timeliness  Quality/Accuracy/Truth  Availability  Source or pointer?  Accessibility  Cost  Language  English or non-English  Readability

 Binary  relevant or not relevant  Likert scale  Not relevant,  somewhat relevant,  relevant,  highly relevant

 Given a query, how many documents should a system retrieve:  Are all the retrieved documents relevant?  Have all the relevant documents been retrieved ?  Measures for system performance:  The first question is about the precision of the search  The second is about the completeness (recall) of the search.

RelevantNot Relevant Retrieved Not retrieved a b c d P = -------------- a a+b R = -------------- a a+c

Number of relevant documents retrieved Precision = -------------------------------------------- Total number of documents retrieved Recall = ----------------------------------------------------- Number of relevant documents retrieved Number of all the relevant documents in the database

 Precision measures how precise a search is.  the higher the precision,  the less unwanted documents.  Recall measures how complete a search is.  the higher the recall,  the less missing documents.

 Theoretically,  R and P not depend on each other.  Practically,  High Recall is achieved at the expense of precision.  High Precision is achieved at the expense of recall.  When will p = 0?  Only when none of the retrieved documents is relevant.  When will p=1?  Only when every retrieved documents are relevant.

 What does p=0.75 mean?  What does r=.25 mean?  What is your goal (in term of p & r ) when conducting a search?  Depending on the purpose of the search  Depending on information needs  Depending on the system  What values of p and r would indicate a good system or good search?  There is not a fixed value.

 Why increasing recall will often mean decreasing precision?  In order not to miss anything and to cover all possible sources, one would have to scan many more materials, of which many might be not relevant.

 Ideal IR system would have P=1, R= 1, for all the queries  Is it possible? Why?  If information needs could be defined very precisely, and  If relevance judgments could be done unambiguously, and  If query matching could be designed perfectly,  Then  We would have an ideal system.  Then  It is not an information retrieval system.

 Combining recall and precision The harmonic means F of recall & precision 2 F = ------------------------- 1/R + 1/ P Attempt to find the best possible compromise between R &P 2 2

 By Rijsbergen: the idea is to allow the user to specify whether he is more interested in recall or precision 1 + k^2 E = ------------------------- k ^2/ R + 1/ P Values of k greater than 1 indicates is the user is more interested in P than R While values of smaller than 1 indicates the contrary!!

Relevant docs Retrieved Docs Relevant docs Known to the user Relevant docs retrieved unknown to the user

 Coverage: the fraction of the documents known to the user to be relevant which has actually been retrieved  Coverage = -----------------------------------------  If coverage=1,  Everything the user knows has been retrieved. Relevant Docs retrieved and known to the user Relevant Docs known to the user

 Novelty: the fraction of the relevant documents retrieved which was unknown to the user.  Novelty= -------------------------------- Relevant docs unknown to the user Relevant docs retrieved

 Using Recall & Precision  Conduct query searches  Try many different queries  Results may depend on sampling queries.  Compare results of Precision & Recall  Recall & Precision need to be considered together.

P /R Query 1Query 2Query 3Query 4Query 5 System A 0.9 / 0.10.7 / 0.40.45/0.50.3/0.60.1/ 0.8 System B 0.8/ 0.20.5/ 0.30.4/0.50..3/0.70.2/0.8 System C 0.9/ 0.40.7/ 0.60.5/ 0.70.3/0.80.2/ 0.9

0.1 0.5 1.0 R P 0.1 0.5 1.0 System C System A System B

Precision R=.25 R=.50 R=.75 System 1System 2System 3 0.6 0.70.9 0.50.40.7 0.20.30.4

Number of relevant documents N=10 N=30 N=40 Query 1Query 2Query 3 456 5 5 17 8624 N=20 N=50 Number of documents retrieved 4516 Average Precision System A 0.5 0.41 0.3 0.31 10 6 250.27

 For real world system, Recall is always an estimate.  Results depend on sampling queries.  Recall and Precision do not catch interactive aspect of the retrieval process.  Recall & Precision is only one aspect of system performance  High recall/high precision is desirable, but not necessary the most important thing that the user considers.  R and P are based on the assumption that the set of relevant documents for a query is the same, independent of the user.

 Data quality  Coverage of database  It will not be found if it is not in the database.  Completeness and accuracy of data  Indexing methods and indexing quality  It will not be found if it is not indexed.  indexing types  currency of indexing ( Is it updated often?)  indexing sizes

 How do you evaluate  http://scholar.google.com http://scholar.google.com Functional Evaluation Functional Evaluation Performance Evaluation Performance Evaluation  Precision & Recall Collection Evaluation Collection Evaluation Interface Evaluation Interface Evaluation User Satisfaction User Satisfaction Users Experiments Users Experiments

 User friendly interface  How long does it take for a user to learn advanced features?  How well can the user explore or interact with the query output?  How easy is it to customize output displays?

 User satisfaction  The final test is the user!  User satisfaction is more important then precision and recall  Measuring user satisfaction  Survey  Use statistics  User experiments

 Observe and collect data on  System behaviors  User search behaviors  User-system interaction  Interpret experiment results  for system comparisons  for understanding user’s information seeking behaviors  for developing new retrieval systems/interfaces

 An evaluation of retrieval effectiveness for a full-text document retrieval system  1985, by David Blair and M. E. Maron  The first large-scale evaluation on full- text retrieval  Significant and controversial results  Good experimental Design

 An IBM full-text retrieval system with 40,000 documents of $350,000 pages.  Documents to be used in the defense of a large corporate law suit.  Large by 1985 standards; typical standard today  Mostly Boolean searching functions, with some ranking functions added.  Full-text automatic indexing.

 Two lawyers generated 51 requests.  Two paralegals conducted searches again and again until the lawyers satisfied the results  Until the lawyers believed that more than 75% of relevant documents had been found.  The paralegals and lawyers could have as many discussions as needed.

 Average precision=.79  Average Recall=.20 Precision Recall 1.0.20 1.0

 The lawyers judged “vita”, “satisfactory”, “marginally relevant”, and “irrelevant”  All the first three were counted as “relevant” in precision calculation.

 Sampling from a subset of the database believed to be rich in relevant documents  Mixed with retrieved sets to send to the lawyers for relevant judgments

 The recall is low.  Even though the recall is only 20%, the lawyers were satisfied (and believed that 75% of relevant documents had been retrieved).

 Why the recall was so low?  Do we really need high recall?  If the study were run today on search engines like Google, would the results be the same or different?

 Levels of Evaluation  On the engineering level  On the input level  On the processing level  On the output level  On the use and user level  On the social level --- Tefko Saracevic, SIGIR’95

 Focus of this week  Understand challenges of IR system evaluations  Pros and cons of several IR evaluation methods

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Similar presentations

Presentation on theme: "Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Similar presentations

Presentation on theme: "Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1."— Presentation transcript:

Similar presentations

About project

Feedback