Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 3: Goals: Retrieval Evaluation Alexander Gelbukh www.Gelbukh.com.

Slides:



Advertisements
Similar presentations
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers New York University Stern School Victor Sheng Foster Provost Panos.
Advertisements

Multicriteria Decision-Making Models
Special Topics in Computer Science The Art of Information Retrieval Chapter 10: User Interfaces and Visualization Alexander Gelbukh
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8): Indexing.
Alexander Gelbukh Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 3: Retrieval Evaluation Alexander Gelbukh
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling Alexander Gelbukh
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
Chapter 7 Sampling and Sampling Distributions
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)
Spoofing State Estimation
Configuration management
Database Performance Tuning and Query Optimization
Hash Tables.
Text Categorization.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Traditional IR models Jian-Yun Nie.
Chapter 12 Analyzing Semistructured Decision Support Systems Systems Analysis and Design Kendall and Kendall Fifth Edition.
Chapter 11 Describing Process Specifications and Structured Decisions
Essential Cell Biology
Multiple Regression and Model Building
1 Functions and Applications
1 A Systematic Review of Cross- vs. Within-Company Cost Estimation Studies Barbara Kitchenham Emilia Mendes Guilherme Travassos.
WEB OF KNOWLEDGE 5.2
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Evaluating Search Engine
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Modern Information Retrieval
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Reference Collections: Collection Characteristics.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Retrieval Evaluation Modern Information Retrieval, Chapter 3
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Evaluation of Information Retrieval Systems
Evaluation.
Modern Information Retrieval
CS 430: Information Discovery
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Presentation transcript:

Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 3: Goals: Retrieval Evaluation Alexander Gelbukh

2 Previous chapter Models are needed for formal operations Boolean model is the simplest Vector model is the best combination of quality and simplicity oTF-IDF term weighting oThis (or similar) weighting is used in all further models Many interesting and not well-investigated variations opossible future work

3 Previous chapter: Research issues How people judge relevance? oranking strategies How to combine different sources of evidence? What interfaces can help users to understand and formulate their information need? ouser interfaces: an open issue Meta-search engines: how to combine results from different Web search engines? oThese results almost do not intersect oHow to combine rankings?

4 To write a paper: Evaluation! How do you measure whether a system is good or bad? To go to the right direction, need to know where you want to get to. We can do it this way vs. This way it performs better oI think it is better... oWe do it this way... oOur method takes into account syntax and semantics... oI like the results... Criterion of truth. Crucial for any science. Enables competition financial policy attracts people oTREC international competitions

5 Methodology to write a paper Define formally your task and constraints Define formally your evaluation criterion (argue if needed) oOne numerical value is better than several Show that your method gives better value than othe baseline (the simple obvious way), such as: Retrieve all. Retrieve none. Retrieve at random. Use Google. ostate-of-the-art (the best reported method) in the same setting and same evaluation method! and your parameter settings are optimal oConsider extreme settings: 0,

6... Methodology The only valid way of reasoning But we want the clusters to be non-trivial oAdd this as a penalty to your criteria or as constraints Divide your acceptability considerations into: oConstraints: yes/no. oEvaluation: better/worse. Check that your evaluation criteria are well justified oMy formula gives it this way oMy result is correct since this is what my algorithm gives oReason in terms of the user task, not your algorithm / formulas Are your good/bad judgments in accord with intuition?

7 Evaluation? (Possible? How?) IR: user satisfaction oDifficult to model formally oExpensive to measure directly (experiments with subjects) At least two contradicting parameters oCompleteness vs. quality oNo good way to combine into one single numerical value oSome user-defined weights of importance of the two Not formal, depend on situation Art

8 Parameters to evaluate Performance (in general sense) oSpeed oSpace Tradoff oCommon for all systems. Not discussed here. Retrieval performance (quality?) o= goodness of a retrieval strategy oA test reference collection: docs and queries. oThe correct set (or ordering) provided by experts oA similarity measure to compare system output with the correct one.

9 Evaluation: Model User Satisfaction User task oBatch query processing? Interaction? Mixed? Way of use oReal-life situation: what factors matter? oInterface type In this chapter: laboratory settings oRepeatability oScalability

10 Sets (Boolean): Precision & Recall Tradeoff (as with time and space) Assumes the retrieval results are sets oas in Boolean; in Vector, use threshold Measures closeness between two sets Recall: Of relevant docs, how many (%) were retrieved? Others are lost. Precision: Of retrieved docs, how many (%) are relevant? Others are noise. Nowadays with huge collections Precision is more important!

11 Precision & Recall Recall = Precision =

12 Ranked Output (Vector): ? Truth: ordering built by experts System output: guessed ordering Ways to compare two rankings: ? Build the truth set is not possible or too expensive So not used (rarely used?) in practice One can built the truth set automatically oResearch topic for us?

13 Ranked Output (Vector) vs. Set Truth: unordered relevant set Output: ordered guessing Compare ordered set with an unordered one

14... Ranked Output vs. set (one query) Plot precision vs. recall curve In the initial part of the list containing n% of all relevant docs, what the precision is? o11 standard recall levels: 0%, 10%,..., 90%, 100%. o0%: interpolated

15... Many queries Average precision and recall Ranked output: Average precision at each recall level To get equal (standard) recall levels, interpolation oof 3 relevant docs, there is no 10% level! oInterpolated value at level n = maximum known value between n and n + 1 oIf none known, use the nearest known.

16 Precision vs. Recall Figures Alternative method: document cutoff values oPrecision at first 5, 10, 15, 20, 30, 50, 100 docs Used to compare algorithms. oSimple oIntuitive NOT a one-value comparison!

Which one is better?

18 Single-value summaries Curves cannot be used for averaging by multiple queries We need single-value performance for each query oCan be averaged over several queries oHistogram for several queries can be made oTables can be made Precision at first relevant doc? Average precision at (each) seen relevant docs oFavors systems that give several relevant docs first R-precision oprecision at R-th retrieved (R = total relevant)

Precision histogram Two algs: A, B R(A)-R(B). Which is better?

20 Alternative measures for Boolean Problems with Precision & Recall measure: oRecall cannot be estimated with large collections oTwo values, but we need one value to compare oDesigned for batch mode, not interactive. Informativeness! oDesigned for linear ordering of docs (not weak ordering) Alternative measures: combine both in one F-measure: E-measure: user preference Rec vs. Prec

User-oriented measures Definitions:

22 User-oriented measures Coverage ratio oMany expected docs Novelty ratio oMany new docs Relative recall: # found / # expected Recall effort: # expected / # examined until those are found Other: oexpected search length (good for weak order) osatisfaction (considers only relevant docs) ofrustration (considers only non-relevant docs)

23 Reference collections Texts with queries and relevant docs known TREC Text REtrieval Conference. Different in different years Wide variety of topics. Document structure marked up. 6 GB. See NIST website: available at small cost Not all relevant docs marked! oPooling method: otop 100 docs in ranking of many search engines omanually verified oWas tested that is a good approximation to the real set

24...TREC tasks Ad-hoc (conventional: query answer) Routing (ranked filtering of changing collection) Chinese ad-hoc Filtering (changing collection; no ranking) Interactive (no ranking) NLP: does it help? Cross-language (ad-hoc) High precision (only 10 docs in answer) Spoken document retrieval (written transcripts) Very large corpus (ad-hoc, 20 GB = 7.5 M docs) Query task (several query versions; does strategy depends on it?) Query transforming oAutomatic oManual

25...TREC evaluation Summary table statistics o# of requests used in the task o# of retrieved docs; # of relevant retrieved and not retrieved Recall-precision averages o11 standard points. Interpolated (and not) Document level averages oAlso, can include average R-value Average precision histogram oBy topic. oE.g., difference between R-precision of this system and average of all systems

26 Smaller collections Simpler to use Can include info that TREC does not Can be of specialized type (e.g., include co-citations) Less sparse, greater overlap between queries Examples: oCACM oISI othere are others

27 CACM collection Communications of ACM, articles Computer science Structure info (author, date, citations,...) Stems (only title and abstract) Good for algorithms relying on cross-citations oIf a paper cites another one, they are related oIf two papers cite the same ones, they are related 52 queries with Boolean form and answer sets

28 ISI collection On information sciences 1460 docs For similarity in terms and cross-citation Includes: oStems (title and abstracts) oNumber of cross-citations 35 natural-language queries with Boolean form and answer sets

29 Cystic Fibrosis (CF) collection Medical 1239 docs MEDLINE data okeywords assigned manually! 100 requests 4 judgments for each doc oGood to see agreement Degrees of relevance, from 0 to 2 Good answer set overlap ocan be used for learning from previous queries

30 Research issues Different types of interfaces; interactive systems: oWhat measures to use? oSuch as infromativeness

31 Conclusions Main measures: Precision & Recall. oFor sets oRankings are evaluated through initial subsets There are measures that combine them into one oInvolve user-defined preferences Many (other) characteristics oAn algorithm can be good at some and bad at others oAverages are used, but not always are meaningful Reference collection exists with known answers to evaluate new algorithms

32 Thank you! Till... ??