Download presentation
Presentation is loading. Please wait.
Published byAudrey Chandler Modified over 9 years ago
1
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 1 of 49 cstaff@cs.um.edu.mt CSA4080: Adaptive Hypertext Systems II Dr. Christopher Staff Department of Computer Science & AI University of Malta Topic 8: Evaluation Methods
2
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 2 of 49 cstaff@cs.um.edu.mt Aims and Objectives Background to evaluation methods in user- adaptive systems Brief overviews of the evaluation of IR, QA, User Modelling, Recommender Systems, Intelligent Tutoring Systems, Adaptive Hypertext Systems
3
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 3 of 49 cstaff@cs.um.edu.mt Background to Evaluation Methods Systems need to be evaluated to demonstrate (prove) that the hypothesis on which they are based is correct In IR, we need to know that the system is retrieving all and only relevant documents for the given query
4
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 4 of 49 cstaff@cs.um.edu.mt Background to Evaluation Methods In QA, we need to know the correct answer to questions, and measure performance In User Modelling, we need to determine that the model is an accurate reflection of information needed to adapt to the user In Recommender Systems, we need to associate user preferences either with other similar users, or with product features
5
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 5 of 49 cstaff@cs.um.edu.mt Background to Evaluation Methods In Intelligent Tutoring Systems we need to know that learning through an ITS is beneficial or at least not (too) harmful In Adaptive Hypertext Systems, we need to measure the system’s ability to automatically represent user interests, to direct the user to relevant information, and to present the information in the best way
6
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 6 of 49 cstaff@cs.um.edu.mt Measuring Performance Information Retrieval: –Recall and Precision (overall, and also at top-n) Question Answering: –Mean Reciprocal Rank
7
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 7 of 49 cstaff@cs.um.edu.mt Measuring Performance User Modelling –Precision and Recall: if user is given all and only relevant info, or if system behaves exactly as user needs, then model is probably correct –Accuracy and predicted probability: to predict a user’s actions, location, or goals –Utility: the benefit derived from using system
8
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 8 of 49 cstaff@cs.um.edu.mt Measuring Performance Recommender Systems: –Content-based may be evaluated using precision and recall –Collaborative is harder to evaluate, because it depends on other users the system knows about Quality of individual item prediction Precision and Recall at top-n
9
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 9 of 49 cstaff@cs.um.edu.mt Measuring Performance Intelligent Tutoring Systems: –Ideally, being able to show that student can learn more efficiently using ITS than without –Usually, show that no harm is done Then, “releasing the tutor” and enabling self-paced learning becomes a huge advantage –Difficult to evaluate Cannot compare same student with and without ITS Students who volunteer are usually very motivated
10
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 10 of 49 cstaff@cs.um.edu.mt Measuring Performance Adaptive Hypertext Systems: –Can mix UM, IR, RS (content-based) methods of evaluation –Use empirical approach Different sets of users solve same task, one group with adaptivity, the other without How to choose participants?
11
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 11 of 49 cstaff@cs.um.edu.mt Evaluation Methods: IR IR systems’ performance is normally measured using precision and recall –Precision: percentage of retrieved documents that are relevant –Recall: percentage of relevant documents that are retrieved Who decides which documents are relevant?
12
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 12 of 49 cstaff@cs.um.edu.mt Evaluation Methods: IR Query Relevance Judgements –For each test query, the document collection is divided into two sets: relevant and non-relevant –Systems are compared using precision and recall –In early collections, humans would classify documents (p3-cleverdon.pdf) Cranfield collection: 1400 documents/221 queries CACM: 3204 documents/50 queries
13
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 13 of 49 cstaff@cs.um.edu.mt Evaluation Methods: IR Do humans always agree on relevance judgements? –No: can vary considerably ( mizzaro96relevance.pdf ) –So only use documents on which there is full agreement
14
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 14 of 49 cstaff@cs.um.edu.mt Evaluation Methods: IR TExt Retrieval Conference (TREC) ( http://trec.nist.gov ) http://trec.nist.gov –Runs competitions every year –QRels and document collection made available in a number of tracks (e.g., ad hoc, routing, question answering, cross-language, interactive, Web, terabyte,...)
15
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 15 of 49 cstaff@cs.um.edu.mt Evaluation Methods: IR What happens when collection grows? –E.g., Web track has 1GB of data! Terabyte track in the pipeline –Pooling Give different systems same document collection to index and queries Take the top-n retrieved documents from each Documents that are present in all retrieved sets are relevant, others not OR Assessors judge the relevance of unique documents in the pool
16
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 16 of 49 cstaff@cs.um.edu.mt Evaluation Methods: IR Advantages: –Possible to compare system performance –Relatively cheap QRels and document collection can be purchased for moderate price rather than organising expensive user trials –Can use standard IR systems (e.g., SMART) and build another layer on top, or build new IR model –Automatic and Repeatable
17
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 17 of 49 cstaff@cs.um.edu.mt Evaluation Methods: IR Common criticisms: –Judgements are subjective Same assessor may change judgement at different times! Doesn’t effect ranking –Judgements are binary –Some relevant documents are missed by pooling (QRels are incomplete) Doesn’t effect system performance
18
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 18 of 49 cstaff@cs.um.edu.mt Evaluation Methods: IR Common criticisms (contd.): –Queries are too long Queries under test conditions can have several hundred terms Average Web query length 2.35 terms (p5- jansen.pdf)
19
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 19 of 49 cstaff@cs.um.edu.mt Evaluation Methods: IR In massive document collections there may be hundreds, thousands, or even millions of relevant documents Must all of them be retrieved? Measure precision at top-5, 10, 20, 50, 100, 500 and take weighted average over results (Mean Average Precision)
20
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 20 of 49 cstaff@cs.um.edu.mt The E-Measure Combine Precision and Recall into one number ( http://www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.html ) http://www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.html P = precision R = recall b = measure of relative importance of P or R E.g, b = 0.5 means user is twice as interested in precision as recall
21
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 21 of 49 cstaff@cs.um.edu.mt Evaluation Methods: QA The aim in Question Answering is not to ensure that the overwhelming majority of relevant documents are retrieved, but to return an accurate answer Precision and recall are not accurate enough Usual measure is Mean Reciprocal Rank
22
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 22 of 49 cstaff@cs.um.edu.mt Evaluation Methods: QA MRR measures the average rank of the first correct answer for each query (1/rank, or 0 if correct answer is not in top-5) Ideally, the first correct answer is put into rank 1 qa_report.pdf
23
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 23 of 49 cstaff@cs.um.edu.mt Evaluation Methods: UM Information Retrieval evaluation has matured to the extent that it is very unusual to find an academic publication without a standard approach to evaluation On the other hand, up to 2001, only one- third of user models presented in UMUAI had been evaluated: and most of those were ITS related (see later) p181-chin.pdf
24
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 24 of 49 cstaff@cs.um.edu.mt Evaluation Methods: UM Unlike IR systems, it is difficult to evaluate UMs automatically –Unless they are stereotypes/course-grained classification systems So they tend to need to be evaluated empirically –User studies –Want to measure how well participants do with and without a UM supporting their task
25
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 25 of 49 cstaff@cs.um.edu.mt Evaluation Methods: UM Difficulties/problems include: –Ensuring a large enough number of participants to make results statistically meaningful –Catering for participants improving during rounds –Failure to use a control group –Ensuring that nothing happens to modify participant’s behaviour (e.g., thinking aloud)
26
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 26 of 49 cstaff@cs.um.edu.mt Evaluation Methods: UM Difficulties/problems (contd.): –Biasing the results –Not using blind-/double-blind testing when needed –...
27
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 27 of 49 cstaff@cs.um.edu.mt Evaluation Methods: UM Proposed reporting standards –No., source, and relevant background of participants –independent, dependent and covariant variables –analysis method –post-hoc probabilities –raw data (in the paper, or on-line via WWW) –effect size and power (at least 0.8) p181-chin.pdf
28
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 28 of 49 cstaff@cs.um.edu.mt Evaluation Methods: RS Recommender Systems Two types of recommender system –Content-based –Collaborative Both (tend to) use VSM to plot users/ product features into n-dimensional space
29
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 29 of 49 cstaff@cs.um.edu.mt Evaluation Methods: RS If we know the “correct” recommendations to make to a user with a specific profile, then we can use Precision, Recall, EMeasure, Fmeasure, Mean Average Precision, MRR, etc.
30
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 30 of 49 cstaff@cs.um.edu.mt Evaluation Methods: ITS Intelligent Tutoring Systems Evaluation to demonstrate that learning through ITS is at least as effective as traditional learning –Cost benefit of freeing up tutor, and permitting self-paced learning Show at a minimum that student is not harmed at all or is minimally harmed
31
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 31 of 49 cstaff@cs.um.edu.mt Evaluation Methods: ITS Difficult to “prove” that individual student learns better/same/worse with ITS than without –Cannot make student unlearn material in between experiments! Attempt to use statistically significant number of students, to show probable overall effect
32
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 32 of 49 cstaff@cs.um.edu.mt Evaluation Methods: ITS Usually suffers from same problems as evaluating UMs, and ubiquitous multimedia systems Students volunteer to evaluate ITSs –So are more likely to be motivated and so perform better –Novelty of system is also a motivator –Too many variables that are difficult to cater for
33
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 33 of 49 cstaff@cs.um.edu.mt Evaluation Methods: ITS However, usually empirical evaluation is performed Volunteers work with system Pass rates, retention rates, etc., may be compared to conventional learning environment (quantitative analysis) Volunteers asked for feedback about, e.g., usability (qualitative analysis)
34
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 34 of 49 cstaff@cs.um.edu.mt Evaluation Methods: ITS Frequently, students are split into groups (control and test) and performance measured against each other Control is usually ITS without the I - students must find their own way through learning material –However, this is difficult to assess, because performance of control group may be worse than traditional learning!
35
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 35 of 49 cstaff@cs.um.edu.mt Evaluation Methods: ITS “Learner achievement” metric (Muntean, 2004) –How much has student learnt from ITS? –Compare pre-learning knowledge to post- learning knowledge Can compare different systems (as long as they use same learning material), but with different users: so same problem as before
36
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 36 of 49 cstaff@cs.um.edu.mt Evaluation Methods: AHS Adaptive Hypertext Systems There are currently no standard metrics for evaluating AHSs Best practices are taken from fields like ITS, IR, and UM and applied to AHS Typical evaluation is “experiences” of using system with and without adaptive features
37
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 37 of 49 cstaff@cs.um.edu.mt Evaluation Methods: AHS If a test collection existed for AHS (like TREC) what might it look like? –Descriptions of user models + relevance judgements for relevant links, relevant documents, relevant presentation styles –Would we need a standard “open” user model description? Are all user models capturing the same information about the user?
38
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 38 of 49 cstaff@cs.um.edu.mt Evaluation Methods: AHS –What about following paths through hyperspace to pre-specified points and then having the sets of judgements? –Currently, adaptive hypertext systems appear to be performing very different tasks, but even if we take just one of the two things that can be adapted (e.g., links), it appears to be beyond our current ability to agree on how adapting links should be evaluated, mainly due to UM!
39
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 39 of 49 cstaff@cs.um.edu.mt Evaluation Methods: AHS HyperContext (HCT) (HCTCh8.pdf) HCT builds a short-term user model as a user navigates through hyperspace We evaluated HCT’s ability to make “See Also” recommendations Ideally, we would have had hyperspace with independent relevance judgements a particular points in path of traversal
40
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 40 of 49 cstaff@cs.um.edu.mt Evaluation Methods: AHS Instead, we used two mechanisms for deriving UM (one using interpretation, the other using whole document) After 5 link traversals we automatically generated a query from each user model, submitted it to search engine and found a relevant interpretation/document respectively
41
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 41 of 49 cstaff@cs.um.edu.mt Evaluation Methods: AHS Users asked to read all documents in the path and then give relevance judgement for each “See Also” recommendation Recommendations shown in random order Users didn’t know which was HCT recommended and which was not Assumed that if user considered doc to be relevant, then UM is accurate
42
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 42 of 49 cstaff@cs.um.edu.mt Evaluation Methods: AHS Not really enough participants to make strong claims about HCT approach to AH Not really significant differences in RJs between different ways of deriving UM (although both performed reasonably well!) However, significant findings if reading time is indication of skim-/deep-reading!
43
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 43 of 49 cstaff@cs.um.edu.mt Evaluation Methods: AHS Should users have been shown both documents? –Could reading two documents, instead of just one, have effected judgement of doc read second? Were users disaffected because it wasn’t a task that they needed to perform?
44
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 44 of 49 cstaff@cs.um.edu.mt Evaluation Methods: AHS Ideally, systems are tested in “real world” conditions in which evaluators are performing tasks Normally, experimental set-ups require users to perform artificial tasks, and it is difficult to measure performance because relevance is subjective!
45
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 45 of 49 cstaff@cs.um.edu.mt Evaluation Methods: AHS This is one of the criticisms of the TREC collections, but it does allow systems to be compared - even if the story is completely different once the system is in real use Building a robust enough system for use in the real world is expensive But then, so is conducting lab based experiments
46
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 46 of 49 cstaff@cs.um.edu.mt Modular Evaluation of AUIs Adaptive User Interfaces, or User-Adaptive Systems Difficult to evaluate “monolithic” systems So break up UAS’s into “modules” that can be evaluated separately
47
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 47 of 49 cstaff@cs.um.edu.mt Modular Evaluation of AUIs Paramythis, et. al. recommend –identifying the “evaluation objects” - that can be evaluated separately and in combination –presenting the “evaluation purpose” - the rationale for the modules and criteria for their evaluation –identifying the “evaluation process” - methods and techniques for evaluating modules during the AUI life cycle paramythis.pdf
48
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 48 of 49 cstaff@cs.um.edu.mt Modular Evaluation of AUIs
49
University of Malta CSA4080: Topic 8 © 2004- Chris Staff 49 of 49 cstaff@cs.um.edu.mt Modular Evaluation of AUIs
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.