Presentation is loading. Please wait.

Presentation is loading. Please wait.

2004.09.30 SLIDE 1IS 202 – FALL 2004 Lecture 10: IR Evaluation Workshop Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30.

Similar presentations


Presentation on theme: "2004.09.30 SLIDE 1IS 202 – FALL 2004 Lecture 10: IR Evaluation Workshop Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30."— Presentation transcript:

1 2004.09.30 SLIDE 1IS 202 – FALL 2004 Lecture 10: IR Evaluation Workshop Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004 http://www.sims.berkeley.edu/academics/courses/is202/f04/ SIMS 202: Information Organization and Retrieval

2 2004.09.30 SLIDE 2IS 202 – FALL 2004 Lecture Overview Review –Evaluation of IR systems Precision vs. Recall Cutoff Points and other measures Test Collections/TREC Blair & Maron Study –Discussion Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

3 2004.09.30 SLIDE 3IS 202 – FALL 2004 What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of information –Form of presentation –Effort required/ease of use –Time and space efficiency –Recall Proportion of relevant material actually retrieved –Precision Proportion of retrieved material actually relevant What to Evaluate? Effectiveness

4 2004.09.30 SLIDE 4IS 202 – FALL 2004 Relevant vs. Retrieved Relevant Retrieved All Docs

5 2004.09.30 SLIDE 5IS 202 – FALL 2004 Precision vs. Recall Relevant Retrieved All Docs

6 2004.09.30 SLIDE 6IS 202 – FALL 2004 Retrieved vs. Relevant Documents Very high precision, very low recall Relevant

7 2004.09.30 SLIDE 7IS 202 – FALL 2004 Retrieved vs. Relevant Documents Very low precision, very low recall (0 in fact) Relevant

8 2004.09.30 SLIDE 8IS 202 – FALL 2004 Retrieved vs. Relevant Documents High recall, but low precision Relevant

9 2004.09.30 SLIDE 9IS 202 – FALL 2004 Retrieved vs. Relevant Documents High precision, high recall (at last!) Relevant

10 2004.09.30 SLIDE 10IS 202 – FALL 2004 Precision/Recall Curves There is a well-known tradeoff between Precision and Recall So we typically measure Precision at different (fixed) levels of Recall Note: this is an AVERAGE over MANY queries precision recall x x x x

11 2004.09.30 SLIDE 11IS 202 – FALL 2004

12 2004.09.30 SLIDE 12IS 202 – FALL 2004 Sample TREC Query (Topic) Number: 168 Topic: Financing AMTRAK Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant.

13 2004.09.30 SLIDE 13IS 202 – FALL 2004

14 2004.09.30 SLIDE 14IS 202 – FALL 2004

15 2004.09.30 SLIDE 15IS 202 – FALL 2004

16 2004.09.30 SLIDE 16IS 202 – FALL 2004

17 2004.09.30 SLIDE 17IS 202 – FALL 2004

18 2004.09.30 SLIDE 18IS 202 – FALL 2004 Other Test Forums/Collections CLEF (Cross-language Evaluation Forum) –Collections in English, French, German, Spanish, Italian with new languages being added (Russian, Finnish, etc). Primarily Euro. NTCIR (NII-NACSIS Test Coll. For IR Sys.) –Primarily Japanese, Chinese and Korean, with partial English INEX (Initiative for Evaluation of XML Retrieval). –Main track uses about 525Mb of XML data from IEEE. Combines Structure and Content.

19 2004.09.30 SLIDE 19IS 202 – FALL 2004 Blair and Maron 1985 A classic study of retrieval effectiveness –Earlier studies were on unrealistically small collections Studied an archive of documents for a lawsuit –~350,000 pages of text –40 queries –Focus on high recall –Used IBM’s STAIRS full-text system Main Result: –The system retrieved less than 20% of the relevant documents for a particular information need –Lawyers thought they had 75% But many queries had very high precision

20 2004.09.30 SLIDE 20IS 202 – FALL 2004 Lecture Overview Review –Evaluation of IR systems Precision vs. Recall Cutoff Points and other measures Test Collections/TREC Blair & Maron Study Discussion Evaluation Exercise Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

21 2004.09.30 SLIDE 21IS 202 – FALL 2004 An Evaluation of Retrieval Effectiveness (Blair & Maron) Questions from Shufei Lei Blair and Maron concluded that the full-text retrieval system such as IBMs STAIRS was ineffective because Recall was very low (average 20%) when searching for documents in a large database of documents (about 40,000 documents). However, the lawyers who were asked to perform this test were quite satisfied with the results of their search. Think about how you search the web today. How do you evaluate the effectiveness of the full-text retrieval system (user satisfaction or Recall rate)? The design of the full-text retrieval system is based on the assumption that it is simple matter for users to foresee the exact words and phrases that will be used in the documents they will find useful, and only in those documents.The author pointed out some factors that invalidate this assumption: misspellings, using different terms to refer to the same event, synonyms, etc. What can we do to help overcome these problems?

22 2004.09.30 SLIDE 22IS 202 – FALL 2004 Rave Reviews (Belew) Questions from Scott Fisher What are the drawbacks of using an "expert" to evaluate documents in a collection for relevance? RAVEUnion follows the pooling procedure used by many evaluators. What is a weakness of this procedure? How do the RAVE researchers try to overcome this weakness?

23 2004.09.30 SLIDE 23IS 202 – FALL 2004 A Case for Interaction (Koeneman & Belkin) Questions from Lulu Guo It is reported that people thought that using the feedback component as a suggestion device made them "lazy" since the task of generating terms was replaced by selecting terms. Is there any potential problem with this "laziness"? In evaluating the effectiveness of the second search task, the authors reported median precision (M) instead of mean (X bar) precision. What's the difference between the two, and which do you think is more appropriate?

24 2004.09.30 SLIDE 24IS 202 – FALL 2004 Work Tasks and Socio-Cognitive Relevance (Hjorland & Chritensen) Questions from Kelly Snow Schizophrenia research has a number of different theories (psychosocial, biochemical) leading to different courses of treatment. According to the reading, finding a 'focus' is crucial for the search process. When prevailing consensus has not been reached, how might a Google- like page-rank approach be a benefit? How might it pose problems? The article discusses relevance ranking by the user as a subjective measure. Relevance ranking can be a reflection of a user's uncertainty about an item's relevance. It can also reflect relevance to a specific situation at a certain time - A document might be relevant for discussion with a colleague but not for clinical treatment. Does this insight change the way you've been thinking about relevance as discussed in the course so far?

25 2004.09.30 SLIDE 25IS 202 – FALL 2004 Social Information Filtering (Shardanand & Maes) Questions from Judd Antin Would carelessly rating albums or artists 'break' Ringo? Why or why not? How would you break Ringo if you wanted to? Is the accuracy or precision of predicted target values a good measure of system performance? What good is a social filtering system if it never provides information which leads to new or different behavior? How do we measure performance in a practical sense? One important criticism of Social Information Filtering is that it does not situate information in its sociocultural context - that liking or disliking a piece of music is an evolving relationship between music and the listening environment. So, in this view, Social Information Filtering fails because a quantitative, statistical measure of preference is not enough to account for the reality of any individual user's preference. How might a system account for this failing? Would it be enough to include additional metadata such as 'Mood,' 'Genre,' 'First Impression,' etc.?

26 2004.09.30 SLIDE 26IS 202 – FALL 2004 Evaluation Exercise Evaluating three systems “Bear”, “Cardinal” and “Wolf” using the following relevance information and rankings for three queries… Ranking and relevance data available in an Excel spreadsheet at: –HTTP://sims.berkeley.edu/courses/is202/f04/eval-data_blank.xls

27 2004.09.30 SLIDE 27IS 202 – FALL 2004 What do we need to know and have? What are the four most important things required for an IR evaluation? 1.A collection of data to test with… 2.Some IR systems that produce ranked output in response to queries… 3.Some queries to test with… 4.RELEVANCE JUDGEMENTS on the Queries!

28 2004.09.30 SLIDE 28IS 202 – FALL 2004 Once the systems have run the queries? We have a ranked list of document IDs for each query and each system in the evaluation Someone (ideally the query’s author) evaluates the results and assigns relevance judgements –In TREC and most other evaluations these judgements are binary (relevant or not relevant) –INEX uses two scales

29 2004.09.30 SLIDE 29IS 202 – FALL 2004 Relevance Information

30 2004.09.30 SLIDE 30IS 202 – FALL 2004 Query 0 for all Three Systems Using the relevance info for for query 1 we can fill in the number of relevant documents “seen” at each rank…

31 2004.09.30 SLIDE 31IS 202 – FALL 2004 Query 1 for all three systems

32 2004.09.30 SLIDE 32IS 202 – FALL 2004 Query 2 for all three systems

33 2004.09.30 SLIDE 33IS 202 – FALL 2004 Once we have filled in the tables… We have the number of RELEVANT documents at each ranking level

34 2004.09.30 SLIDE 34IS 202 – FALL 2004 Query 0 for all Three Systems

35 2004.09.30 SLIDE 35IS 202 – FALL 2004 Query 1 for all three systems

36 2004.09.30 SLIDE 36IS 202 – FALL 2004 Query 2 for all three systems

37 2004.09.30 SLIDE 37IS 202 – FALL 2004 Now for Recall and Precision… We now have, for each query and system the number of relevant documents seen (retrieved) at each rank… With that information and information about the TOTAL NUMBER of Relevant documents for each query, we can calculate, at each rank:

38 2004.09.30 SLIDE 38IS 202 – FALL 2004 For our purposes Precision and Recall become… What is the total Recall for Each Query and system? Why do we get that Recall value?

39 2004.09.30 SLIDE 39IS 202 – FALL 2004 By systems… for Bear…

40 2004.09.30 SLIDE 40IS 202 – FALL 2004 For Cardinal…

41 2004.09.30 SLIDE 41IS 202 – FALL 2004 For Wolf…

42 2004.09.30 SLIDE 42IS 202 – FALL 2004 The answers… Apply the formulas we have seen FOR EACH RANK…

43 2004.09.30 SLIDE 43IS 202 – FALL 2004 By systems… for Bear…

44 2004.09.30 SLIDE 44IS 202 – FALL 2004 For Cardinal…

45 2004.09.30 SLIDE 45IS 202 – FALL 2004 For Wolf…

46 2004.09.30 SLIDE 46IS 202 – FALL 2004 For our final table… What is the AVERAGE precision at the “fixed” recall levels: 20% or 0.20? 40% or 0.40? 60% or 0.60? 80% or 0.80? 100% or 1.00? How? For each query, find the HIGHEST RANKED precision value where the matching Recall value is equal to the desired level and take the average of those precision values across all of the queries…

47 2004.09.30 SLIDE 47IS 202 – FALL 2004 Averages for all systems…

48 2004.09.30 SLIDE 48IS 202 – FALL 2004 Averages for all systems…

49 2004.09.30 SLIDE 49IS 202 – FALL 2004 Graphing the results Put precision on the y axis and recall on the x axis for each of the points in the previous table…

50 2004.09.30 SLIDE 50IS 202 – FALL 2004 Graphing the results

51 2004.09.30 SLIDE 51IS 202 – FALL 2004 Cutoff Levelsfor all systems… Note: we are using DOCUMENT cutoffs instead of the REL DOC cutoffs which are often used

52 2004.09.30 SLIDE 52IS 202 – FALL 2004

53 2004.09.30 SLIDE 53IS 202 – FALL 2004 Next Time Databases and Database Design Readings


Download ppt "2004.09.30 SLIDE 1IS 202 – FALL 2004 Lecture 10: IR Evaluation Workshop Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30."

Similar presentations


Ads by Google