Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.

Evaluation INST 734 Module 5 Doug Oard

Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies

IR as an Empirical Discipline Formulate a research question (the hypothesis) Design an experiment to answer the question Perform the experiment –Compare with a baseline “control” Does the experiment answer the question? –Are the results significant? Or is it just luck? –Are the results important, or imperceptable? Report the results

Types of Evaluation Intrinsic –Does it do what we want? Extrinsic –Does it do what we need? Formative –Provide a basis for system development Summative –Determine whether objectives were met

Experiment Design Examples Can morphology improve effectiveness? –Does stemming beat an unstemmed baseline? Does query expansion improve effectiveness? –Does synonym expansion beat an unexpanded baseline? Does highlighting help users evaluate utility? –Build two interfaces, one with highlighting, one without –Ask users which one they prefer and why Is letting users weight query terms a good idea? –Build two systems, one with weighting, one without –Measure which yields more relevant docs in 10 minutes

Evaluation Criteria Effectiveness –System-only –Human + system Efficiency –Retrieval time, indexing time, index size, … Usability –Learnability, novice use, expert use, …

IR Effectiveness Evaluation User-centered strategy –Given several users, and at least 2 retrieval systems –Have each user try the same task on both systems –Measure which system works the “best” System-centered strategy –Given documents, queries, and relevance judgments –Try several variations on the retrieval system –Measure which ranks more good docs near the top

Good Measures of Effectiveness Capture some aspect of what the user wants Have predictive value for other situations –Different queries, different document collection Easily replicated by other researchers Easily compared –Optimally, expressed as a single number

Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies

Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.

Similar presentations

Presentation on theme: "Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.

Similar presentations

Presentation on theme: "Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving."— Presentation transcript:

Similar presentations

About project

Feedback