Evaluation (cont.): Empirical Studies CS352. Announcements Where we are in PRICPE: –Predispositions: Did this in Project Proposal. –RI: Research was studying.

Slides:



Advertisements
Similar presentations
Chapter 14: Usability testing and field studies
Advertisements

Foundations and Strategies Attention Investment CS352.
Controlled experiments Traditional scientific method Reductionist –clear convincing result on specific issues In HCI: –insights into cognitive process,
Data analysis and interpretation. Agenda Part 2 comments – Average score: 87 Part 3: due in 2 weeks Data analysis.
Quantitative Evaluation
User Testing & Experiments. Objectives Explain the process of running a user testing or experiment session. Describe evaluation scripts and pilot tests.
HCI 510 : HCI Methods I HCI Methods –Controlled Experiments.
Evaluation (cont.): Heuristic Evaluation Cognitive Walkthrough CS352.
Evaluation Types GOMS and KLM
Tuesday, March 17, 2015 Pop Quiz – Controlled Experiments in HCI 1.
CS160 Discussion Section Matthew Kam Apr 14, 2003.
Chapter 14: Usability testing and field studies. Usability Testing Emphasizes the property of being usable Key Components –User Pre-Test –User Test –User.
Instructor: Vincent Duffy, Ph.D. Associate Professor Lecture 10: Research, Design & Evaluation Tues. Feb. 20, 2007 IE 486 Work Analysis & Design II.
Useability.
Evaluation Methodologies
I213: User Interface Design & Development Marti Hearst March 1, 2007.
Design and Evaluation of Iterative Systems n For most interactive systems, the ‘design it right first’ approach is not useful. n The 3 basic steps in the.
Intro to Evaluation See how (un)usable your software really is…
1 User Centered Design and Evaluation. 2 Overview My evaluation experience Why involve users at all? What is a user-centered approach? Evaluation strategies.
Chapter 19: Confidence Intervals for Proportions
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Statistics: Unlocking the Power of Data Lock 5 Hypothesis Testing: Hypotheses STAT 101 Dr. Kari Lock Morgan SECTION 4.1 Statistical test Null and alternative.
Presentation 12 Chi-Square test.
Empirical Evaluation Assessing usability (with users)
Chapter 14: Usability testing and field studies
Evidence Based Medicine
Spring /6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis.
Ch 14. Testing & modeling users
Fall 2002CS/PSY Empirical Evaluation Analyzing data, Informing design, Usability Specifications Inspecting your data Analyzing & interpreting results.
Today: Our process Assignment 3 Q&A Concept of Control Reading: Framework for Hybrid Experiments Sampling If time, get a start on True Experiments: Single-Factor.
Conducting a User Study Human-Computer Interaction.
Evaluation Techniques zEvaluation ytests usability and functionality of system yoccurs in laboratory, field and/or in collaboration with users yevaluates.
Assessing the Frequency of Empirical Evaluation in Software Modeling Research Workshop on Experiences and Empirical Studies in Software Modelling (EESSMod)
Today: Assignment 2 misconceptions Our process True Experiments: Single-Factor Design Assignment 3 Q&A Mid-term: format, coverage.
Testing & modeling users. The aims Describe how to do user testing. Discuss the differences between user testing, usability testing and research experiments.
Karrie Karahalios, Eric Gilbert 6 April 2007 some slides courtesy of Brian Bailey and John Hart cs414 empirical user studies.
Usability Testing Chapter 6. Reliability Can you repeat the test?
Midterm Stats Min: 16/38 (42%) Max: 36.5/38 (96%) Average: 29.5/36 (78%)
Evaluation (cont.): Empirical Studies CS352. Announcements Notice upcoming due dates (web page). Where we are in PRICPE: –Predispositions: Did this in.
©2010 John Wiley and Sons Chapter 3 Research Methods in Human-Computer Interaction Chapter 3- Experimental Design.
Evaluation Techniques Evaluation –tests usability and functionality of system –occurs in laboratory, field and/or in collaboration with users –evaluates.
Human-Computer Interaction. Overview What is a study? Empirically testing a hypothesis Evaluate interfaces Why run a study? Determine ‘truth’ Evaluate.
AMSc Research Methods Research approach IV: Experimental [1] Jane Reid
1 MP2 Experimental Design Review HCI W2014 Acknowledgement: Much of the material in this lecture is based on material prepared for similar courses by Saul.
Chapter Seventeen. Figure 17.1 Relationship of Hypothesis Testing Related to Differences to the Previous Chapter and the Marketing Research Process Focus.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Scientific Method, Types of Experiments and Data Processing
C82MST Statistical Methods 2 - Lecture 1 1 Overview of Course Lecturers Dr Peter Bibby Prof Eamonn Ferguson Course Part I - Anova and related methods (Semester.
Evaluation (cont.): Heuristic Evaluation Cognitive Walkthrough CS352.
Paired Samples and Blocks
Handout Five: Between-Subjects Design of Analysis of Variance- Planned vs. Post Hoc Comparisons EPSE 592 Experimental Designs and Analysis in Educational.
Day 10 Analysing usability test results. Objectives  To learn more about how to understand and report quantitative test results  To learn about some.
Evaluation Types GOMS and KLM CS352. Quiz Announcements Notice upcoming due dates (web page). Where we are in PRICPE: –Predispositions: Did this in Project.
PSY 325 AID Education Expert/psy325aid.com FOR MORE CLASSES VISIT
School of Engineering and Information and Communication Technology KIT305/607 Mobile Application Development Week 7: Usability (think-alouds) Dr. Rainer.
Chapter 16: Sample Size “See what kind of love the Father has given to us, that we should be called children of God; and so we are. The reason why the.
Evaluation through user participation
Qualitative vs. Quantitative
Controlled Experiments
Analytical Evaluation with GOMS and KLM
Evaluation (cont.): Cognitive Walkthrough and Heuristic Evaluation
Evaluation (cont.): Empirical Studies
Evaluation Types CS352.
HCI Evaluation Techniques
Testing & modeling users
Evaluation Techniques
Experimental Evaluation
Evaluation (cont.): Empirical Studies
Evaluation (cont.): Empirical Studies: The Thinkaloud
Presentation transcript:

Evaluation (cont.): Empirical Studies CS352

Announcements Where we are in PRICPE: –Predispositions: Did this in Project Proposal. –RI: Research was studying users. Hopefully led to Insights. –CP: Concept and initial (very low-fi) Prototypes due Fri 7/16 at midnight.  Evaluate throughout, repeat iteratively!! 2

Evaluation Analytical – based on your head Empirical – based on data Advantages/disadvantages of empirical –More expensive (time, $) to do. +Greater accuracy in what users REALLY do. +You’ll get more surprises. +Greater credibility with your bosses.

Empirical Work with Humans What do you want from it? –List of problems: Usability study. (e.g., 5 users in a lab). –List of behaviors/strategies/needs: Think-aloud study or field observation. –Compare, boolean outcome (e.g., A>B) Statistical study. Note: –Impossible to “prove” no problems exist. Can only find problems.

The Usability Study Returns a list of UI problems. Metrics: –Time to complete task –Errors made –Difficulty to use (via questionnaire) –Emotional response, e.g., stressed out, discouraged, fun, enjoyable… Pick a task and user profile. Users do the task with your prototype –in here: paper OR CogTool

Examples A Xerox Palo Alto Research Center (PARC) employee wrote that PARC used extensive usability testing in creating the Xerox Star, introduced in [2] ] Only about 25,000 were sold, leading many to consider the Xerox Star a commercial failure.Xerox PARC [2]Xerox Star The Inside Intuit book, says (page 22, 1984), "... in the first instance of the Usability Testing that later became standard industry practice, LeFevre recruited people off the streets... and timed their Kwik-Chek (Quicken) usage with a stopwatch. After every test... programmers worked to improve the program."[1]) Scott Cook, Intuit co-founder, said, "... we did usability testing in 1984, five years before anyone else... there's a very big difference between doing it and having marketing people doing it as part of their... design... a very big difference between doing it and having it be the core of what engineers focus on. [3]Quicken[1]Scott CookIntuit [3]

Usability Study: How How many: 5 users find 60-80% problems. How: –Be organized!! Have everything ready,... –Test it first. –Users do task (one user at a time). Data IS the result (no stats needed).

Think-Aloud Usually most helpful with a working prototype or system, –but might be able to get use out of it for early prototypes. Returns a list of behaviors/strategies/impressions/thoughts... Pick a task and user profile. Users do the task with your prototype.

Think-Aloud: How How many: 5-10 users is usual. –Data analysis is time-consuming. –In here: 1-2 max! How: –Be organized!! Have everything ready,... –Test it first. –Users do task (one user at a time). Analyze data for patterns, surprises, etc. –No stats: not enough subjects for this.

Think-Aloud: How (cont.) Think-aloud training (in class) Sample think-aloud results: –From VL/HCC'03 (Prabhakararao et al.)VL/HCC'03 (Prabhakararao et al.)

Announcement Quiz Friday on Evaluating Concepts and Prototypes due Mon at 11:59pm Midterm on Tuesday

Evaluation Analytical – based on your head Empirical – based on data Advantages/disadvantages of empirical –More expensive (time, $) to do. +Greater accuracy in what users REALLY do. +You’ll get more surprises. +Greater credibility with your bosses.

Statistical Studies We will not do these in this class, but –you need to know some basics. Goal: answer a binary question. –eg: does system X help users create animations? –eg: are people better debuggers using X than using Y? Advantage: your audience believes it. Disadvantage: you might not find out enough about “why or why not”.

Hypotheses Testing Need to be specific, and provable/refutable. –e.g.: “users will debug better using system X than in system Y” –(Strictly speaking we use the “null” hypothesis, which says there won’t be a difference, but it’s a fine point...) –Pick a significance value (rule of thumb: 0.05). If you get a p-value <=0.05, this says you’ve shown a significant difference, but there’s a 5% chance that the difference is a fluke.

Lucid and testable hypothesis State a lucid, testable hypothesis –this is a precise problem statement Example 1: There is no difference in the number of cavities in children and teenagers using crest and no-teeth toothpaste when brushing daily over a one month period

Lucid and testable hypothesis Example 2: There is no difference in user performance (time and error rate) when selecting a single item from a pop-up or a pull down menu of 4 items, regardless of the subject’s previous expertise in using a mouse or using the different menu types” File Edit View Insert New Open Close Save File Edit View Insert New Open Close Save

Design the Experiment  Identify independent variables (“treatments”) we’ll manipulate: –eg: which system they use, X vs. Y? Identify outputs (dependent variables) for the hypotheses: –eg: more bugs fixed? –eg: fewer minutes to fix same number of bugs? –eg.: less time to check out?

Independent variables Hypothesis includes the independent variables that are to be altered –the things you manipulate independent of a subject’s behavior –determines a modification to the conditions the subjects undergo –may arise from subjects being classified into different groups

Independent variables in toothpaste experiment toothpaste type: uses Crest or No-teeth toothpaste age: 11 years in menu experiment menu type: pop-up or pull-down menu length: 3, 6, 9, 12, 15 subject type (expert or novice)

Design the Experiment Identify independent variables (“treatments”) we’ll manipulate: –eg: which system they use, X vs. Y?  Identify outputs (dependent variables) for the hypotheses: –eg: more bugs fixed? –eg: fewer minutes to fix same number of bugs? –eg.: less time to check out?

Dependant variables Hypothesis includes the dependent variables that will be measured variables dependent on the subject’s behavior / reaction to the independent variable the specific things you set out to quantitatively measure / observe

Dependant variables in menu experiment time to select an item selection errors made time to learn to use it to proficiency in toothpaste experiment number of cavities frequency of brushing preference

Design the experiment (cont.) Decide on within vs. between subject. –“Within”: 1 group experiences all treatments. In random order. “Within” is best, if possible. (Why?) –“Between”: different group for each treatment.

Between-Groups Design Wilma and Betty use one interface Dino and Fred use the other

Within-Groups Design Everyone uses both interfaces

Between-Groups vs. Within-Groups Within groups design –Pros: Is more powerful statistically (can compare the same person across different conditions, thus isolating effects of individual differences) Requires fewer participants than between-groups –Cons: Learning effects (can be reduced by randomizing the order of treatments) Fatigue effects

Design the experiment (cont.) How many subjects? –Rule of thumb: 30/treatment. –More subjects  more statistical power  more likely to get p<=0.05 if there really is a difference.

StratCell [Grigoreanu et al. ‘10] A spreadsheet system with debugging aid Target users: spreadsheet users 28

Control -11, Treatment-5 29

65 males, 67 females; 60+ users each treatment group

Design the experiment (cont.) Design the task they will do. –Since you usually run a lot of these at once and you’re comparing them, you need to be careful with length. Long enough to get over learning curve. Big enough to be convincing. Small enough to be do-able in the amount of time subjects have to spend with you. –Vary the order if multiple tasks.

Design the experiment (cont.) Develop the tutorial. –Practice it like crazy! (Must work the same for everyone!) –Example (see mashup study tutorial) Plan the data to gather. –Log files? –Questionnaires before/after? –Saved result files at end?

Designing the experiment (cont.) Water in the beer: –Sources of uncontrolled variation spoil your statistical power. Sources: –Too much variation in subject background. –Not a good enough tutorial. –Task not a good match for what you wanted to find out. –Etc. Result: no significant difference.

Finally, Analyze the Data Choose an appropriate statistical test. –An entire courses on this, i.e., ST516 Method of Analysis Run it –using stats software packages, e.g., R, SPSS, Excel. Hope for p<=0.05 (5% chance of being wrong) Summary: –Statistical studies are a lot of work, too much to do in this class! –Right choice for answering X>Y questions.

Statistical vs practical significance When n (sample size)is large, even a trivial difference may show up as a statistically significant result –eg menu choice: mean selection time of menu a is 3.00 seconds; menu b is 3.05 seconds Statistical significance does not imply that the difference is important! –a matter of interpretation –statistical significance often abused and used to misinform

The rest of the course Continue to work on your project… (prototype in CogTool, eval plan, final prototype) Introduction to various projects/research may include but not limited to: –Foundations and strategies (e.g., surprise-explain- reward, interruption, information foraging) –Gender issues in software environments –Studies of designers –Usability engineering for programmers –Designing for special populations, e.g., seniors, amnesia, … Extra credit opportunities –Presenting papers from the above areas