Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluation of IR LIS531H. Why eval? When designing and using a system there are decisions to be made:  Manual or automatic indexing?  Controlled vocabularies.

Similar presentations


Presentation on theme: "Evaluation of IR LIS531H. Why eval? When designing and using a system there are decisions to be made:  Manual or automatic indexing?  Controlled vocabularies."— Presentation transcript:

1 Evaluation of IR LIS531H

2 Why eval? When designing and using a system there are decisions to be made:  Manual or automatic indexing?  Controlled vocabularies or natural language.  Stem or not? Use stop lists?  What kind of query syntax?  How to display the retrieved items to the user?  What would be best for your user population and company/library? Are human judgments consistent? Reliable? Need to demonstrate empirically that a system is good …and find ways to improve it.

3 How to evaluate Consider IR evaluation as any other research project: need repeatable criteria to evaluate how effective an IR system is in meeting the information needs of the user. How to define the task that the human is attempting? What criteria mean success?

4 How to evaluate? Big question! What specifically or whole system?  Speed of retrieval?  Resources required  Presentation of documents  Market appeal Eval generally comparative: system A vs. B Cost-benefit analysis Most commonly: Retrieval Effectiveness

5 How to evaluate Relevance and test collections Metrics:  Recall and precision  E and F  Swets and operating characteristics  Expected search length Case studies (known item searches) Significance tests Other issues and problems

6 Relevance A relevant doc is one judged useful in the context of a query, but …  Who judges? What’s useful?  Serendipitous utility  Humans aren’t consistent in judgments With real collections, one never knows the full set of relevant documents in the collection Retrieval models incorporate the notion of relevance  Probability of (relevance | query, document)  How similar to the query  What language model is used in the IR system

7 How to find all relevant docs? Collection needs complete judges: “yes”, “no” or some score for every query-doc pair! For small test collections: can review all docs for all queries Not practical for medium- or large- collections Other approaches:  Pooling, sampling, search-based

8 Finding relevant docs (2) Search based:  Rather than read every doc, use manually-guided search; read retrieved docs until convinced all relevance found Pooling:  Retrieve docs using several, usually automatic, techniques; Judge the top n docs for each technique. Relevant set is the union; subset of true relevant set Sampling:  Estimate size of true relevant set by sampling All are incomplete, so when testing…  How should unjudged documents be treated?  How might this affect results?

9 Evaluation Fundamentals D = set of documents A = set of documents that satisfy some user- based criterion B = set of documents identified by the search system

10 Evaluation Fundamentals retrieved relevant | A  B | relevant | A | retrieved relevant | A  B | retrieved | B | retrieved not-relevant | B - A  B | not-relevant | D - A | recall = = precision = = fallout = =

11 Evaluation Fundamentals Recall and Precision: depend on concept of relevance. Relevance is a context-, task-dependent property of documents. [What’s linguistically and philosophically wrong about this approach?] “Relevance is the correspondence in context between an information requirement statement... and an article (a document), that is, the extent to which the article covers the material that is appropriate to the requirement statement.? F. W. Lancaster, 1979

12 Evaluation Fundamentals Relevancy judgments vary in reliability:  For text documents, users knowledgeable about a subject are usually in close agreement in deciding whether a document is relevant to an info requirement (the query)  Less consistency with non-textual docs, such as photos  Attempts to use a level of relevance such as a Likert scale haven’t worked. [Do you believe it? Why not? Other tests?]

13 Effectiveness Measures - Recall & Precision Precision  Proportion of a retrieved set that’s relevant  Precision = |relevant Ç  retrieved| ÷ |relevant| Recall  Proportion of all relevant docs in the collection included in the retrieval set  Recall = |relevant Ç  retrieved| ÷ |relevant|  = P(retrieved | relevant) Precision & recall well-defined for sets. For ranked retrieval:  compute a P/R point for each relevant document; compare a value at fixed recall points (e.g., Precision at 20% recall); compute value at fixed rank cutoffs (e.g., precision at rank 20)

14 Another common representation Relevant = A+C Retrieved = A+B Collection size=A+B+C+D Precision = A÷(A+B) Recall = A ÷ (A+C) Miss = C ÷ (A+C) False alarm (fallout) = B ÷ (B+D) RetrievedNot relevant RetrievedAB Not retrieved CD

15 Precision & Recall example

16 Average precision of a query Often want a single-number effectiveness measure; average precision is widely used Calculate by averaging precision when recall increases:

17 Precision and Recall - ex. 2

18 Evaluation Fundamentals - History Cranfield Experiments - Cyril Cleverdon, Cranfield College of Aeronautics, 1957-68. SMART System, G. Salton, Cornell, 1964- 1988 TREC, Donna Hartman, National Institute of Standards and Technology (NIST), 1992 -  Various TREC tracks

19 Sample test collections CollectionCranfieldCACMISIWestTrec2 Coll size1,400 docs3,2041,46011,953742,611 Size (mb)1.52.32.22542,162 Year19681983 19901991 Unique stems 8,2265,4935,448196,7071,040,415 Stem occurrences 123,20011,56898,30421,798,833243,800,000 Max within doc freq 27 1,309 Mean doc length 8836.767.31,823328 # of queries225503544100

20 Sample Experiment (Cranfield) Comparative efficiency of indexing systems: (Universal Decimal Classification, alphabetical subject index, a special facet classification, Uniterm system of co-ordinate indexing) Four indexes prepared manually for each document in three batches of 6,000 documents -- total 18,000 documents, each indexed four times. The documents were reports and paper in aeronautics. Indexes for testing were prepared on index cards and other cards. Very careful control of indexing procedures.

21 Sample Experiment (Cranfield) Searching: 1,200 test questions, each satisfied by at least one document Reviewed by expert panel Searches carried out by 3 expert librarians Two rounds of searching to develop testing methodology Subsidiary experiments at English Electric Whetstone Laboratory and Western Reserve University

22 Sample Experiment (Cranfield) The Cranfield data was made widely available and used by other researchers Salton used the Cranfield data with the SMART system (a) to study the relationship between recall and precision, and (b) to compare automatic indexing with human indexing Sparc Jones and van Rijsbergen used the Cranfield data for experiments in relevance weighting, clustering, definition of test corpora, etc.

23 Sample Experiment (Cranfield) Cleverdon's work was applied to matching methods. He made extensive use of recall and precision, based on concept of relevance. recall (%) precision (%) x x x x x x x x x x Each x represents one search. The graph illustrates the trade-off between precision and recall.

24 Typical precision-recall graph for different queries 1.0 0.75 0.5 0.25 1.0 0.75 0.5 0.25 recall precision Broad, general query Narrow, specific query Using Boolean type queries

25 Some Cranfield Results The various manual indexing systems have similar retrieval efficiency Retrieval effectiveness using automatic indexing can be at least as effective as manual indexing with controlled vocabularies -> original results from the Cranfield + SMART experiments (published in 1967) -> considered counter-intuitive -> other results since then have supported this conclusion

26 Precision and Recall with Ranked Results Precision and recall are defined for a fixed set of hits, e.g., Boolean retrieval. Their use needs to be modified for a ranked list of results.

27 Ranked retrieval: Recall and precision after retrieval of n documents SMART system using Cranfield data, 200 documents in aeronautics of which 5 are relevant nrelevantrecallprecision 1yes0.21.0 2yes0.41.0 3no0.40.67 4yes0.60.75 5no0.60.60 6yes0.80.67 7no0.80.57 8no0.80.50 9no0.80.44 10no0.80.40 11no0.80.36 12no0.80.33 13yes1.00.38 14no1.00.36

28 Evaluation Fundamentals: Graphs 1.0 0.75 0.5 0.25 1.0 0.75 0.5 0.25 recall precision 1 2 3 4 5 6 12 13 200 Note: Some authors plot recall against precision.

29 11 Point Precision (Recall Cut Off) p(n) is precision at that point where recall has first reached n Define 11 standard recall points p(r 0 ), p(r 1 ),... p(r 10 ), where p(r j ) = p(j/10) Note: if p(r j ) is not an exact data point, use interpolation

30 Recall cutoff graph: choice of interpolation points 1.0 0.75 0.5 0.25 1.0 0.75 0.5 0.25 recall 1 2 3 4 5 6 12 13 200 The blue line is the recall cutoff graph.

31 Example 2: SMART Recall Precision 0.0 1.0 0.1 1.0 0.2 1.0 0.3 1.0 0.4 1.0 0.5 0.75 0.6 0.75 0.7 0.67 0.8 0.67 0.9 0.38 1.0 0.38 Precision values in blue are actual data. Precision values in red are by interpolation (by convention equal to the next actual data value).

32 Average precision Average precision for a single topic is the mean of the precision obtained after each relevant document is obtained. Example: p = (1.0 + 1.0 + 0.75 + 0.67 + 0.38) / 5 = 0.75 Mean average precision for a run consisting of many topics is the mean of the average precision scores for each individual topic in the run. Definitions from TREC-8

33 Normalized recall measure 51015200195 ideal ranks actual ranks worst ranks recall ranks of retrieved documents

34 Normalized recall area between actual and worst area between best and worst Normalized recall = R norm = 1 - (after some mathematical manipulation) r i - i n(N - n) i = 1 n  n 

35 Combining Recall and Precision: Normalized Symmetric Difference Relevant Retrieved D = set of documents A B Symmetric difference, S = A  B - A  B Normalized symmetric difference = |S| / 2 (|A| + |B|) = 1 - 1 (1/recall + 1/precision) Symmetric Difference: The set of elements belonging to one but not both of two given sets. 1212 { }

36 Statistical tests Suppose that a search is carried out on systems i and j System i is superior to system j if, for all test cases, recall(i) >= recall(j) precisions(i) >= precision(j) In practice, we have data from a limited number of test cases. What conclusions can we draw?

37 Recall-precision graph 1.0 0.75 0.5 0.25 1.0 0.75 0.5 0.25 recall precision The red system appears better than the black, but is the difference statistically significant?

38 Statistical tests The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to these data. The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples. The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.

39 Test collections Compare retrieval performance using a test collection:  a set of documents, a set of queries, and as et of relevance judgments (which docs are relevant to each query) To compare the performance of 2 techniques:  Each technique used to evaluate test queries  Results (set or ranked list) compared using some performance measure; most commonly precision and recall Usually multiple measures to get different views of performance; usually test with multiple collections (performance is collection dependent)

40 Recall/Precision graphs Average precision hides info Sometimes better to show tradeoff in table or graph

41 Averaging across queries Hard to compare P/R graphs or tables for individual queries (too much data) Two main types of averaging:  Micro-average (each relevant doc is a point in the avg)  Macro-average (each query is a point in the avg)* Also done with average precision value  Average of many queries’ average precision values  Called mean average precision (MAP)

42 Averaging Graphs: a false start The raw plots of P/R usually require interpolation

43 Interpolation As recall increases, precision decreases (on average!) One approach takes a set of recall/precision points (R,P) and make a step function:

44 Step graph

45 Another example Van Rijsbergen, p. 117 (1979)

46 Interpolation and Averaging

47 Interpolated average precision Average precision at standard recall points For a given query, compute P/R point for every relevant doc Interpolate precision at standard recall levels  11-pt is usually 100%, 90%… 10%, 0%  3-pt is usually 75%, 50%, 25% Average overall queries to get average precision at each recall level Average interpolated recall levels to get single result  “mean average precision” is common.

48 Improvements in IR over the years

49 Effectiveness Measures: E & F E measure (van Rijsbergen) Used to emphasize precision (or recall)  Essentially a weighted avg of precision and recall  Large a increases importance of precision  Can transform by  = 1(  2+1), b=P/R  When  =1 (  =1/2), equal importance of P and R  Normalized symmetric difference of retrieved and relevant sets

50 Symmetric Difference and E

51 F measure F=1-E often used - good results mean larger values of F “F1” measure is popular, especially with classification researchers: F with  =1

52 F measure as an average Harmonic mean of P and R (inverse of averages of their inverses!); Heavily penalizes low values of P or R

53 F measure, geometric interpretation

54 Other single-valued measures Swets (next) Expected search length Normalized precision or recall (measures area between actual and ideal curves) Breakeven point (point at which precision = recall) Utility measures  Assign costs to each cell in the contingency table; sum costs for all queries Many others!

55 Swets Model, 1963 Based on signal detection & statistical decision theory “Properties of a desirable measure of retrieval performance”  Based on ability of system to distinguish between wanted and unwanted items  Should express discrimination power independent of “any acceptance criterion” employed by system or user! Characterizes recall-fallout curve generated by variation of control valuable, e.g., P(Q|D) Assume is normally distributed on the set of relevant docs and also on the set of non-relevant documents

56 Probability distributions for relevant and non-relevant documents. Two normal distributions for l, one on the set of relevant docs; the other on non-relevant documents. Shaded areas represent recall and fallout.

57 Swets model, con’t. Note - we’re skipping a lot more about interpreting Swets. Recognize that Swets doesn’t use R/P but Recall/Fallout From this, DARPA studies are probabilities of missing documents vs. probability of false alarm (false hits).

58 Expected Search Length Evaluation Eval is based on type of info need:  Only one relevant doc is required  Some arbitrary number n  All relevant documents  A given proportion of relevant documents Search strategy output assumed to be weak ordering Simple ordering means never have 2+ docs at the same level of ordering Search Length in a single ordering is the number of non-relevant doc a suer must scan before the info need is satisfied Expected Search Length appropriate for weak ordering

59 Expected Search Length

60 Evaluation Case Study: Known Item Search What if purpose of query is to find something you know exists? (e.g., a homepage) Effective Site Finding using Anchor Information (Craswell, Hawking, and Robertson, SIGIR 2001) Links turn out to be more useful than content! Turning this idea into a researchable problem …

61 Problem statement Site finding  A site is an organized collection of pages on a specific topic, maintained by a single person or group  Not the same as a domain (e.g., aol.com which has numerous sites)  Topic can be extremely broad Examples (from a query log)  Where can I find the official Harry Potter site?  [Not: “Who was Cleopatra”, “Where’s Paris”] Therefore, given a query, find a URL.

62 Setting up for an evaluation Recall: need docs, queries, and relevance judgments Doc set is a collection of 18.5 million web pages Here, will create pairs Query-first selection:  Randomly choose a query from an Excite query log  Have a human find the URL that is a match  Must ensure that page exists in the test collection Document-first selection:  Randomly select a web page  Navigate to its “entry page”  Create a query that matches it  May not be representative, tho

63 Evaluation the Known Item Usually there’s only 1 possible answer (the site’s page)  So recall is either 0 or 1  Recall/precision graphs are not very interesting, tho Instead, measure the rank where the site’s URL was found  Sometimes scored as 1/rank (“reciprocal rank”)  When averaged over many queries, called “mean reciprocal rank” Can also plot it as a graph

64 Rank where site found Anchor method vastly more likely to get answer at top rank:  Anchor: MRR =.791  Content: MRR =.321 Intuition:  On avg, ranchor method finds the site at rank 1/.791=1.25; content finds it at rank 3.1 Precision at rank 1:  Anchor: 68%; content: 21%

65 University example

66 Evaluation: Significance Tests System A beats System B on one query…  Just lucky? Maybe System B is better for some things System A and B are identical on all but 1 query  If A beats B by enough on that one query, average will make A look better than B System A is.00001% better … so what? Significance tests consider those issues…

67 Significance Tests Are the observed differences statistically different? Can’t make assumptions about underlying distribution  Most significance tests do make such assumptions  See linguistic corpora for distributions, but … Sign test or Wilcoxon signed-ranks tests are typical  Don’t require normally distributed data; sign test answers “how often”; Wilcoxon answers “how much” But are these differences detectable by users?

68 Feedback Evaluation Will review relevance feedback later, too. How to treat documents that have been seen before? Rank freezing:  Ranks of relevant docs are fixed for subsequent iterations; compare ranking with original ranking; performance can’t get worse Residual collection:  All previously-seen docs (e.g., the top n) are removed from the collection; compare reranking with original (n+1…D) Both approaches are problematic:  Users probably want to see the good documents move to the top of the list

69 Feedback Eval: user perceptions Effectiveness measures give a user’s sense of quality - but might consider:  Time to complete a retrieval task  User ‘satisfaction’  How well users believe the system works An “intelligent” IR system is one that doesn’t look stupid to the users User studies difficult to design and expensive to conduct Subjects can’t repeat queries Hard to isolate effects of search engine and user interface Hard to control for individual performance differences TREC “interactive” track

70 Problems with evaluation Retrieval techniques are highly collection- and query-specific  Single techniques must be tested on multiple coll.  Comparison of techniques must be on same coll.  Isolated tests aren’t very useful; Scalability, too Standard methods assume user knows right collection Usually impossible to control all variables Hard to separate effects of retrieval model and interface when model requires user interaction Good test collections are hard + expensive to produce Usually can’t do cost-benefit analysis Haven’t satisfied criteria specified by Swets and others in the 1960s … in short, the question is wide open how to evaluate retrieval effectiveness.


Download ppt "Evaluation of IR LIS531H. Why eval? When designing and using a system there are decisions to be made:  Manual or automatic indexing?  Controlled vocabularies."

Similar presentations


Ads by Google