1 Lecture 5 Experimental Design Lecturer – Prof Jim Warren (with references to Dix et al. chapter 9)

Slides:



Advertisements
Similar presentations
A small taste of inferential statistics
Advertisements

Evaluation Procedures
Validity (cont.)/Control RMS – October 7. Validity Experimental validity – the soundness of the experimental design – Not the same as measurement validity.
evaluation techniques
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Increasing your confidence that you really found what you think you found. Reliability and Validity.
Surveys and Questionnaires. How Many People Should I Ask? Ask a lot of people many short questions: Yes/No Likert Scale Ask a smaller number.
Copyright © 2010 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
The art and science of measuring people l Reliability l Validity l Operationalizing.
Chapter 14: Usability testing and field studies. Usability Testing Emphasizes the property of being usable Key Components –User Pre-Test –User Test –User.
© De Montfort University, 2001 Questionnaires contain closed questions (attitude scales) and open questions pre- and post questionnaires obtain ratings.
Using Statistics in Research Psych 231: Research Methods in Psychology.
Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.
MARE 250 Dr. Jason Turner Hypothesis Testing III.
1 User Centered Design and Evaluation. 2 Overview My evaluation experience Why involve users at all? What is a user-centered approach? Evaluation strategies.
The art and science of measuring people l Reliability l Validity l Operationalizing.
Dennis Shasha From a book co-written with Manda Wilson
PowerPoint presentation to accompany Research Design Explained 6th edition ; ©2007 Mark Mitchell & Janina Jolley Chapter 8 Survey Research.
Chapter 14: Usability testing and field studies
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Sampling Distributions and Hypothesis Testing. 2 Major Points An example An example Sampling distribution Sampling distribution Hypothesis testing Hypothesis.
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
ANOVA Greg C Elvers.
Presentation: Techniques for user involvement ITAPC1.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 8: Quantitative.
Evidence Based Medicine
Evaluation Techniques Material from Authors of Human Computer Interaction Alan Dix, et al.
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
MARE 250 Dr. Jason Turner Hypothesis Testing III.
Concept of Power ture=player_detailpage&v=7yeA7a0u S3A.
Data Collection Methods
User Study Evaluation Human-Computer Interaction.
Human Computer Interaction
Ms. Carmelitano RESEARCH METHODS EXPERIMENTAL STUDIES.
AP Statistics Section 11.1 A Basics of Significance Tests
Conducting a User Study Human-Computer Interaction.
Evaluation Techniques zEvaluation ytests usability and functionality of system yoccurs in laboratory, field and/or in collaboration with users yevaluates.
Inference We want to know how often students in a medium-size college go to the mall in a given year. We interview an SRS of n = 10. If we interviewed.
1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Evaluation Techniques Evaluation –tests usability and functionality of system –occurs in laboratory, field and/or in collaboration with users –evaluates.
CS 580 chapter 9 evaluation techniques. Evaluation Tests usability and functionality of system Occurs in laboratory, field and/or in collaboration with.
Section 10.1 Confidence Intervals
1 Lecture 19 chapter 9 evaluation techniques (part 2)
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Human-Computer Interaction. Overview What is a study? Empirically testing a hypothesis Evaluate interfaces Why run a study? Determine ‘truth’ Evaluate.
Sampling and Probability Chapter 5. Sampling & Elections >Problems with predicting elections: Sample sizes are too small Samples are biased (also tied.
Data Analysis Econ 176, Fall Populations When we run an experiment, we are always measuring an outcome, x. We say that an outcome belongs to some.
AMSc Research Methods Research approach IV: Experimental [1] Jane Reid
Research Strategies. Why is Research Important? Answer in complete sentences in your bell work spiral. Discuss the consequences of good or poor research.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Primary Research HSB 4UI ISU. Primary Research Quantitative Quantify (measure) Quantify (measure) Large number of test subjects Large number of test subjects.
Chapter 13 Understanding research results: statistical inference.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Inferential Statistics Psych 231: Research Methods in Psychology.
Chapter 9 evaluation techniques. Evaluation Techniques Evaluation – tests usability and functionality of system – occurs in laboratory, field and/or in.
Conducting surveys and designing questionnaires. Aims Provide students with an understanding of the purposes of survey work Overview the stages involved.
Evaluation through user participation
Qualitative vs. Quantitative
Understanding Results
Professor John Canny Spring 2003
Evaluation techniques
HCI Evaluation Techniques
Evaluation Techniques
Experimental Evaluation
Professor John Canny Fall 2004
Presentation transcript:

1 Lecture 5 Experimental Design Lecturer – Prof Jim Warren (with references to Dix et al. chapter 9)

2 Evaluating Implementations Requires an artefact: a simulation, a prototype, or a full implementation

3 Experimental evaluation controlled evaluation of specific aspects of interactive behaviour evaluator chooses hypothesis to be tested a number of experimental conditions are considered which differ only in the value of some controlled variable. changes in behavioural measure are attributed to different conditions

4 Experimental factors Subjects (i.e., the users, aka ‘participants’) –who – representative, sufficient sample not the programmer’s friend, boss, etc. huge variability in performance of individuals Variables –things to modify and measure Hypothesis –what you’d like to show Experimental design –how you are going to show it –Includes ‘Protocol’ – what the subjects do

5 Variables independent variable (IV) characteristic changed to produce different conditions e.g. interface style, number of menu items dependent variable (DV) characteristics measured in the experiment e.g. time taken, number of errors.

6 Hypothesis prediction of outcome –framed in terms of IV and DV e.g. “error rate will increase as font size decreases” null hypothesis: –states no difference between conditions –aim is to disprove this e.g. null hyp. = “no change with font size”

7 Experimental design “within groups” design (also called “repeated measures”) –each subject performs experiment under each condition –transfer of learning possible (practice makes performance better; or alternatively fatigue or boredom makes it worse) –less costly and less likely to suffer from user variation (each user is compared to themselves) between groups design –each subject performs under only one condition –no transfer of learning –more users required

8 Within v. Between Consider a test on the difference of beer v. vodka martinis on reaction time –Null hypothesis – no difference in increase in reaction time between the two beverages Design 1: –30 people try beer; 30 other people try vodka – D.V. is change in reaction time pre- v. post drinking Not bad – be sure to randomize who goes into beer group v. vodka group But ‘power’ of the experiment will be reduced due to the great variability of individuals in reaction to alcohol

9 Within v. Between (contd.) Design 2: –All 60 people first try beer, then immediately try vodka Problem of carryover effect Better Design: –All 60 try beer, then a week later try vodka Now each individual is compared with themselves Still possible problem of ordering effect (e.g., they might get a little better at the reaction time test) Best Design: –30 try beer, then a week later vodka; 30 try vodka and then a week later beer

10 Analysis of data Before you start to do any statistics: –look at data (e.g. average=5.25 – but 4.9 without the “outlier”) Choice of statistical technique depends on –type of data –information required Type of data –discrete finite number of values may be ordered, or unordered (e.g., colors) –continuous any value

11 ANOVA – analysis of variance Quite easy to test whether there’s a significant difference between groups in Excel –Need to invoke Tools/Add-ins/Analysis Toolpack to enable –Then just apply Tools/Data Analysis/ANOVA: Single Factor to the data

12 ANOVA from Excel If P-value < 0.05 then we usually say the result is ‘significant’ (result is more than expected chance variation) Say we have three columns of numbers representing the time to complete a task for 5, 5 and 7 users using three variations of an interface

13 When is a difference a difference? In the world of parametric stats, we look for a statistic to be large enough to be ‘significant’ –On the Gaussian (‘normal’) curve a |Z|=1.96 leaves 95% of the area of the curve behind so is a common ‘critical value’ for claiming significance

14 Parametric assumptions Parametric statistics assume that some mathematically elegant assumptions hold true for the data –E.g., ANOVA (and standard ‘regression’) assume, among other things, normally distributed random error –Trivia: The mathematical form of the probability density function for the normal distribution is remarkably formidable Centres on mean, , and is flattened by standard deviation,  Galton machine simulates normal distribution (aka ‘bell curve’) Exponential distribution models time between events happening with a constant average rate

15 So, what to measure? Usually one (or several) of these things: –Speed / efficiency How many (or whatever) per unit time can the user process with this interface? –Accuracy (errors) –Learnability (time to acquire the ability to do something in particular with the interface) And retention (how well they can do it some particular time later) –Satisfaction Subjective assessment – how does the user feel about the interface

16 Subjective assessment Very important –If the user seriously doesn’t like it, probably there’s something really wrong with the design (just maybe they can’t articulate what) Quantifiable –Yes/No, or better, Likert scale (ratings) through interview or questionnaire Need to ask the right questions –E.g., don’t have leading questions (BAD: “Is this the absolute worst system you have used ever?”) And need to ask the questions well (so user reliably expresses what they mean) –E.g., don’t trip them up with double negatives

17 Likert Scale Can be from 4 to 7 “points” Usually about agreement to a phrase –E.g., “I found the search function easy to use” –Strongly Agree, Agree, Neutral (optional), Disagree, Strongly Disagree May also be about importance –E.g., “A site search function is…” –“not very important” to “extremely important” Or a general assessment –E.g., “The performance of the search function was…” –Poor, Fair, Satisfactory, Good, Excellent Great to ask open-ended questions, too –E.g., “What was the best aspect of the search function?” –But it’s the Likert scale data that you can quantify See Heim – pages

18 Dimensions and validity When designing a questionnaire… –Have in mind a few underlying issues that you are trying to assess –Ask a few different questions that are coordinated around each issue –Ask different ways – vary whether positive or negative favours the issue in question –Ideally, verify the questionnaire by having people role-play particularly happy or angry, and middle-of- the-road users And see if they answer the questions the way you’d expect!

19 Questionnaire administration Face-to-face or telephone interviews –Esp. efficient (and unbiased) to have third party give the telephone survey Mail-out (including ) questionnaire Web-based questionnaire (maybe out URL) Even a questionnaire is an experiment on humans from the point of view of research ethics –Easier to achieve an ethical questionnaire administration if it’s truly anonymous

20 Response rates and bias Low response rate is a problem –Below 50% response rate, one wonders whether respondents were exceptional (happiest, angriest or a mix; but not the “normal” folks) –Better if you have some authority to motivate a response But back to issues of ethics – e.g., not truly anonymous if you know who to nag about non- response; and the “pressure” may be unfair Similar bias problems when you use volunteers for any experiment –Are these volunteers representative of your “normal” users?

21 Experimental studies on groups More difficult than single-user experiments Problems with: –subject groups –choice of task –data gathering –Analysis Unfortunately (in terms of experimental requirements) a lot of things that are interesting in the real world, involve computers mediating group behaviour

22 Subject groups larger number of subjects  more expensive longer time to `settle down’ … even more variation! difficult to timetable so … often only three or four groups

23 Groups (contd.): Data gathering several video cameras + direct logging of application problems: –synchronisation –sheer volume! one solution: –record from each perspective

24 Groups (contd.): Analysis N.B. vast variation between groups solutions: –‘within groups’ experiments (each group works under various conditions) –micro-analysis (e.g., gaps in speech) –anecdotal and qualitative analysis controlled experiments may `waste' resources! –Experiments dominated by dynamics of group formation –Field studies are apt to be more realistic

25 About statistics It’s an amazingly complex field –A lot of hidden complexities in running experiments and saying that the observed differences really make a difference ‘threats to validity’ – are those things that make it possible that your experimental conclusion is in error –Threats to internal validity: like carryover effects, or lack of randomization –Threats to external validity: like that your whole population of subjects were unusual in some way, or the task was not representative of real use of the tool –When the outcomes are serious (e.g., medical trials) professional statisticians are always used in design of the experiment as well as analysis and reporting of the findings –Plenty of texts and courses on stats available (the Wikipedia is pretty good on this topics, too – e.g., for ANOVA)

26 Unwanted biases in studies You can’t always take a study result at face value… must be attentive to what subjects are feeling Hawthorne effect –Worker is more productive when observed John Henry effect –Worker is [stubbornly] more productive when using his old tools (see Placebo effect –[Patient usually] gets some benefit just because they expect a benefit Pygmalion effect –Student performs better simply because they are expected to do so

Usability Analysis - Conclusion Remember: the ultimate goal is to learn –Learn what’s working and, most critically, what isn’t working for the end user –Do the usability testing that helps you make the best possible interface Test within your constraints –A quick talk aloud protocol session is far better than nothing and will probably find the most critical flaws –Then again, if it’s a “bet the business” interface, and it’s a big business, than organise testing on an appropriate scale! Hardest bit might be finding the time; nobody likes to delay a product release (but nobody wants to release a failure, either) 27