Download presentation
Presentation is loading. Please wait.
Published byRussell Booth Modified over 9 years ago
1
Evaluation INST 734 Module 5 Doug Oard
2
Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies
3
IR as an Empirical Discipline Formulate a research question (the hypothesis) Design an experiment to answer the question Perform the experiment –Compare with a baseline “control” Does the experiment answer the question? –Are the results significant? Or is it just luck? –Are the results important, or imperceptable? Report the results
4
Types of Evaluation Intrinsic –Does it do what we want? Extrinsic –Does it do what we need? Formative –Provide a basis for system development Summative –Determine whether objectives were met
5
Experiment Design Examples Can morphology improve effectiveness? –Does stemming beat an unstemmed baseline? Does query expansion improve effectiveness? –Does synonym expansion beat an unexpanded baseline? Does highlighting help users evaluate utility? –Build two interfaces, one with highlighting, one without –Ask users which one they prefer and why Is letting users weight query terms a good idea? –Build two systems, one with weighting, one without –Measure which yields more relevant docs in 10 minutes
6
Evaluation Criteria Effectiveness –System-only –Human + system Efficiency –Retrieval time, indexing time, index size, … Usability –Learnability, novice use, expert use, …
7
IR Effectiveness Evaluation User-centered strategy –Given several users, and at least 2 retrieval systems –Have each user try the same task on both systems –Measure which system works the “best” System-centered strategy –Given documents, queries, and relevance judgments –Try several variations on the retrieval system –Measure which ranks more good docs near the top
8
Good Measures of Effectiveness Capture some aspect of what the user wants Have predictive value for other situations –Different queries, different document collection Easily replicated by other researchers Easily compared –Optimally, expressed as a single number
9
Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.