University of Malta CSA4080: Topic 8 © 2004- Chris Staff 1 of 49 CSA4080: Adaptive Hypertext Systems II Dr. Christopher Staff Department.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
A Metric for Software Readability by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Evaluating Search Engine
Search Engines and Information Retrieval
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Modern Information Retrieval
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Evaluating the Performance of IR Sytems
Lessons Learned from Information Retrieval Chris Buckley Sabir Research
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
1. Learning Outcomes At the end of this lecture, you should be able to: –Define the term “Usability Engineering” –Describe the various steps involved.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
University of Malta CSA3080: Lecture 9 © Chris Staff 1 of 13 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Search Engines and Information Retrieval Chapter 1.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
CSA3212: User Adaptive Systems Dr. Christopher Staff Department of Computer Science & AI University of Malta Lecture 9: Intelligent Tutoring Systems.
University of Malta CSA4080: Topic 5 © Chris Staff 1 of 52 CSA4080: Adaptive Hypertext Systems II Dr. Christopher Staff Department.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
University of Malta CSA3080: Lecture 7 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
University of Malta CSA3080: Lecture 4 © Chris Staff 1 of 14 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
1 of 49 t CSA3212: Lecture 7 © Chris Staff University of Malta Dr. Christopher Staff Department of Intelligent Computer Systems.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
University of Malta CSA3080: Lecture 3 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
University of Malta CSA3080: Lecture 12 © Chris Staff 1 of 22 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun
Information Retrieval
Foxtrot seminar Capturing knowledge of user preferences with recommender systems Stuart E. Middleton David C. De Roure, Nigel R. Shadbolt Intelligence,
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
University of Malta CSA4080: Topic 7 © Chris Staff 1 of 15 CSA4080: Adaptive Hypertext Systems II Dr. Christopher Staff Department.
CSA3200: Adaptive Hypertext Systems Dr. Christopher Staff Department of Intelligent Computer Systems University of Malta Lecture 10: Evaluation Methods.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
1 of 43 University of Malta CSA3212: Topic 2 © Chris Staff CSA3212: User-Adaptive Systems Dr. Christopher Staff Dept. Intelligent.
University of Malta CSA3080: Lecture 10 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Information Retrieval in Practice
Sampath Jayarathna Cal Poly Pomona
Evaluation Anisio Lacerda.
Text Based Information Retrieval
Evaluation of IR Systems
User-Adaptive Systems
IR Theory: Evaluation Methods
CSA3212: User Adaptive Systems
Retrieval Performance Evaluation - Measures
Presentation transcript:

University of Malta CSA4080: Topic 8 © Chris Staff 1 of 49 CSA4080: Adaptive Hypertext Systems II Dr. Christopher Staff Department of Computer Science & AI University of Malta Topic 8: Evaluation Methods

University of Malta CSA4080: Topic 8 © Chris Staff 2 of 49 Aims and Objectives Background to evaluation methods in user- adaptive systems Brief overviews of the evaluation of IR, QA, User Modelling, Recommender Systems, Intelligent Tutoring Systems, Adaptive Hypertext Systems

University of Malta CSA4080: Topic 8 © Chris Staff 3 of 49 Background to Evaluation Methods Systems need to be evaluated to demonstrate (prove) that the hypothesis on which they are based is correct In IR, we need to know that the system is retrieving all and only relevant documents for the given query

University of Malta CSA4080: Topic 8 © Chris Staff 4 of 49 Background to Evaluation Methods In QA, we need to know the correct answer to questions, and measure performance In User Modelling, we need to determine that the model is an accurate reflection of information needed to adapt to the user In Recommender Systems, we need to associate user preferences either with other similar users, or with product features

University of Malta CSA4080: Topic 8 © Chris Staff 5 of 49 Background to Evaluation Methods In Intelligent Tutoring Systems we need to know that learning through an ITS is beneficial or at least not (too) harmful In Adaptive Hypertext Systems, we need to measure the system’s ability to automatically represent user interests, to direct the user to relevant information, and to present the information in the best way

University of Malta CSA4080: Topic 8 © Chris Staff 6 of 49 Measuring Performance Information Retrieval: –Recall and Precision (overall, and also at top-n) Question Answering: –Mean Reciprocal Rank

University of Malta CSA4080: Topic 8 © Chris Staff 7 of 49 Measuring Performance User Modelling –Precision and Recall: if user is given all and only relevant info, or if system behaves exactly as user needs, then model is probably correct –Accuracy and predicted probability: to predict a user’s actions, location, or goals –Utility: the benefit derived from using system

University of Malta CSA4080: Topic 8 © Chris Staff 8 of 49 Measuring Performance Recommender Systems: –Content-based may be evaluated using precision and recall –Collaborative is harder to evaluate, because it depends on other users the system knows about Quality of individual item prediction Precision and Recall at top-n

University of Malta CSA4080: Topic 8 © Chris Staff 9 of 49 Measuring Performance Intelligent Tutoring Systems: –Ideally, being able to show that student can learn more efficiently using ITS than without –Usually, show that no harm is done Then, “releasing the tutor” and enabling self-paced learning becomes a huge advantage –Difficult to evaluate Cannot compare same student with and without ITS Students who volunteer are usually very motivated

University of Malta CSA4080: Topic 8 © Chris Staff 10 of 49 Measuring Performance Adaptive Hypertext Systems: –Can mix UM, IR, RS (content-based) methods of evaluation –Use empirical approach Different sets of users solve same task, one group with adaptivity, the other without How to choose participants?

University of Malta CSA4080: Topic 8 © Chris Staff 11 of 49 Evaluation Methods: IR IR systems’ performance is normally measured using precision and recall –Precision: percentage of retrieved documents that are relevant –Recall: percentage of relevant documents that are retrieved Who decides which documents are relevant?

University of Malta CSA4080: Topic 8 © Chris Staff 12 of 49 Evaluation Methods: IR Query Relevance Judgements –For each test query, the document collection is divided into two sets: relevant and non-relevant –Systems are compared using precision and recall –In early collections, humans would classify documents (p3-cleverdon.pdf) Cranfield collection: 1400 documents/221 queries CACM: 3204 documents/50 queries

University of Malta CSA4080: Topic 8 © Chris Staff 13 of 49 Evaluation Methods: IR Do humans always agree on relevance judgements? –No: can vary considerably ( mizzaro96relevance.pdf ) –So only use documents on which there is full agreement

University of Malta CSA4080: Topic 8 © Chris Staff 14 of 49 Evaluation Methods: IR TExt Retrieval Conference (TREC) ( ) –Runs competitions every year –QRels and document collection made available in a number of tracks (e.g., ad hoc, routing, question answering, cross-language, interactive, Web, terabyte,...)

University of Malta CSA4080: Topic 8 © Chris Staff 15 of 49 Evaluation Methods: IR What happens when collection grows? –E.g., Web track has 1GB of data! Terabyte track in the pipeline –Pooling Give different systems same document collection to index and queries Take the top-n retrieved documents from each Documents that are present in all retrieved sets are relevant, others not OR Assessors judge the relevance of unique documents in the pool

University of Malta CSA4080: Topic 8 © Chris Staff 16 of 49 Evaluation Methods: IR Advantages: –Possible to compare system performance –Relatively cheap QRels and document collection can be purchased for moderate price rather than organising expensive user trials –Can use standard IR systems (e.g., SMART) and build another layer on top, or build new IR model –Automatic and Repeatable

University of Malta CSA4080: Topic 8 © Chris Staff 17 of 49 Evaluation Methods: IR Common criticisms: –Judgements are subjective Same assessor may change judgement at different times! Doesn’t effect ranking –Judgements are binary –Some relevant documents are missed by pooling (QRels are incomplete) Doesn’t effect system performance

University of Malta CSA4080: Topic 8 © Chris Staff 18 of 49 Evaluation Methods: IR Common criticisms (contd.): –Queries are too long Queries under test conditions can have several hundred terms Average Web query length 2.35 terms (p5- jansen.pdf)

University of Malta CSA4080: Topic 8 © Chris Staff 19 of 49 Evaluation Methods: IR In massive document collections there may be hundreds, thousands, or even millions of relevant documents Must all of them be retrieved? Measure precision at top-5, 10, 20, 50, 100, 500 and take weighted average over results (Mean Average Precision)

University of Malta CSA4080: Topic 8 © Chris Staff 20 of 49 The E-Measure Combine Precision and Recall into one number ( ) P = precision R = recall b = measure of relative importance of P or R E.g, b = 0.5 means user is twice as interested in precision as recall

University of Malta CSA4080: Topic 8 © Chris Staff 21 of 49 Evaluation Methods: QA The aim in Question Answering is not to ensure that the overwhelming majority of relevant documents are retrieved, but to return an accurate answer Precision and recall are not accurate enough Usual measure is Mean Reciprocal Rank

University of Malta CSA4080: Topic 8 © Chris Staff 22 of 49 Evaluation Methods: QA MRR measures the average rank of the first correct answer for each query (1/rank, or 0 if correct answer is not in top-5) Ideally, the first correct answer is put into rank 1 qa_report.pdf

University of Malta CSA4080: Topic 8 © Chris Staff 23 of 49 Evaluation Methods: UM Information Retrieval evaluation has matured to the extent that it is very unusual to find an academic publication without a standard approach to evaluation On the other hand, up to 2001, only one- third of user models presented in UMUAI had been evaluated: and most of those were ITS related (see later) p181-chin.pdf

University of Malta CSA4080: Topic 8 © Chris Staff 24 of 49 Evaluation Methods: UM Unlike IR systems, it is difficult to evaluate UMs automatically –Unless they are stereotypes/course-grained classification systems So they tend to need to be evaluated empirically –User studies –Want to measure how well participants do with and without a UM supporting their task

University of Malta CSA4080: Topic 8 © Chris Staff 25 of 49 Evaluation Methods: UM Difficulties/problems include: –Ensuring a large enough number of participants to make results statistically meaningful –Catering for participants improving during rounds –Failure to use a control group –Ensuring that nothing happens to modify participant’s behaviour (e.g., thinking aloud)

University of Malta CSA4080: Topic 8 © Chris Staff 26 of 49 Evaluation Methods: UM Difficulties/problems (contd.): –Biasing the results –Not using blind-/double-blind testing when needed –...

University of Malta CSA4080: Topic 8 © Chris Staff 27 of 49 Evaluation Methods: UM Proposed reporting standards –No., source, and relevant background of participants –independent, dependent and covariant variables –analysis method –post-hoc probabilities –raw data (in the paper, or on-line via WWW) –effect size and power (at least 0.8) p181-chin.pdf

University of Malta CSA4080: Topic 8 © Chris Staff 28 of 49 Evaluation Methods: RS Recommender Systems Two types of recommender system –Content-based –Collaborative Both (tend to) use VSM to plot users/ product features into n-dimensional space

University of Malta CSA4080: Topic 8 © Chris Staff 29 of 49 Evaluation Methods: RS If we know the “correct” recommendations to make to a user with a specific profile, then we can use Precision, Recall, EMeasure, Fmeasure, Mean Average Precision, MRR, etc.

University of Malta CSA4080: Topic 8 © Chris Staff 30 of 49 Evaluation Methods: ITS Intelligent Tutoring Systems Evaluation to demonstrate that learning through ITS is at least as effective as traditional learning –Cost benefit of freeing up tutor, and permitting self-paced learning Show at a minimum that student is not harmed at all or is minimally harmed

University of Malta CSA4080: Topic 8 © Chris Staff 31 of 49 Evaluation Methods: ITS Difficult to “prove” that individual student learns better/same/worse with ITS than without –Cannot make student unlearn material in between experiments! Attempt to use statistically significant number of students, to show probable overall effect

University of Malta CSA4080: Topic 8 © Chris Staff 32 of 49 Evaluation Methods: ITS Usually suffers from same problems as evaluating UMs, and ubiquitous multimedia systems Students volunteer to evaluate ITSs –So are more likely to be motivated and so perform better –Novelty of system is also a motivator –Too many variables that are difficult to cater for

University of Malta CSA4080: Topic 8 © Chris Staff 33 of 49 Evaluation Methods: ITS However, usually empirical evaluation is performed Volunteers work with system Pass rates, retention rates, etc., may be compared to conventional learning environment (quantitative analysis) Volunteers asked for feedback about, e.g., usability (qualitative analysis)

University of Malta CSA4080: Topic 8 © Chris Staff 34 of 49 Evaluation Methods: ITS Frequently, students are split into groups (control and test) and performance measured against each other Control is usually ITS without the I - students must find their own way through learning material –However, this is difficult to assess, because performance of control group may be worse than traditional learning!

University of Malta CSA4080: Topic 8 © Chris Staff 35 of 49 Evaluation Methods: ITS “Learner achievement” metric (Muntean, 2004) –How much has student learnt from ITS? –Compare pre-learning knowledge to post- learning knowledge Can compare different systems (as long as they use same learning material), but with different users: so same problem as before

University of Malta CSA4080: Topic 8 © Chris Staff 36 of 49 Evaluation Methods: AHS Adaptive Hypertext Systems There are currently no standard metrics for evaluating AHSs Best practices are taken from fields like ITS, IR, and UM and applied to AHS Typical evaluation is “experiences” of using system with and without adaptive features

University of Malta CSA4080: Topic 8 © Chris Staff 37 of 49 Evaluation Methods: AHS If a test collection existed for AHS (like TREC) what might it look like? –Descriptions of user models + relevance judgements for relevant links, relevant documents, relevant presentation styles –Would we need a standard “open” user model description? Are all user models capturing the same information about the user?

University of Malta CSA4080: Topic 8 © Chris Staff 38 of 49 Evaluation Methods: AHS –What about following paths through hyperspace to pre-specified points and then having the sets of judgements? –Currently, adaptive hypertext systems appear to be performing very different tasks, but even if we take just one of the two things that can be adapted (e.g., links), it appears to be beyond our current ability to agree on how adapting links should be evaluated, mainly due to UM!

University of Malta CSA4080: Topic 8 © Chris Staff 39 of 49 Evaluation Methods: AHS HyperContext (HCT) (HCTCh8.pdf) HCT builds a short-term user model as a user navigates through hyperspace We evaluated HCT’s ability to make “See Also” recommendations Ideally, we would have had hyperspace with independent relevance judgements a particular points in path of traversal

University of Malta CSA4080: Topic 8 © Chris Staff 40 of 49 Evaluation Methods: AHS Instead, we used two mechanisms for deriving UM (one using interpretation, the other using whole document) After 5 link traversals we automatically generated a query from each user model, submitted it to search engine and found a relevant interpretation/document respectively

University of Malta CSA4080: Topic 8 © Chris Staff 41 of 49 Evaluation Methods: AHS Users asked to read all documents in the path and then give relevance judgement for each “See Also” recommendation Recommendations shown in random order Users didn’t know which was HCT recommended and which was not Assumed that if user considered doc to be relevant, then UM is accurate

University of Malta CSA4080: Topic 8 © Chris Staff 42 of 49 Evaluation Methods: AHS Not really enough participants to make strong claims about HCT approach to AH Not really significant differences in RJs between different ways of deriving UM (although both performed reasonably well!) However, significant findings if reading time is indication of skim-/deep-reading!

University of Malta CSA4080: Topic 8 © Chris Staff 43 of 49 Evaluation Methods: AHS Should users have been shown both documents? –Could reading two documents, instead of just one, have effected judgement of doc read second? Were users disaffected because it wasn’t a task that they needed to perform?

University of Malta CSA4080: Topic 8 © Chris Staff 44 of 49 Evaluation Methods: AHS Ideally, systems are tested in “real world” conditions in which evaluators are performing tasks Normally, experimental set-ups require users to perform artificial tasks, and it is difficult to measure performance because relevance is subjective!

University of Malta CSA4080: Topic 8 © Chris Staff 45 of 49 Evaluation Methods: AHS This is one of the criticisms of the TREC collections, but it does allow systems to be compared - even if the story is completely different once the system is in real use Building a robust enough system for use in the real world is expensive But then, so is conducting lab based experiments

University of Malta CSA4080: Topic 8 © Chris Staff 46 of 49 Modular Evaluation of AUIs Adaptive User Interfaces, or User-Adaptive Systems Difficult to evaluate “monolithic” systems So break up UAS’s into “modules” that can be evaluated separately

University of Malta CSA4080: Topic 8 © Chris Staff 47 of 49 Modular Evaluation of AUIs Paramythis, et. al. recommend –identifying the “evaluation objects” - that can be evaluated separately and in combination –presenting the “evaluation purpose” - the rationale for the modules and criteria for their evaluation –identifying the “evaluation process” - methods and techniques for evaluating modules during the AUI life cycle paramythis.pdf

University of Malta CSA4080: Topic 8 © Chris Staff 48 of 49 Modular Evaluation of AUIs

University of Malta CSA4080: Topic 8 © Chris Staff 49 of 49 Modular Evaluation of AUIs