Experiments Testing hypotheses…. Agenda Homework assignment Review evaluation planning Observation continued Empirical studies In-class practice.

Slides:



Advertisements
Similar presentations
Chapter 14: Usability testing and field studies
Advertisements

Interaksi Manusia Komputer Evaluasi Empiris1/68 EVALUASI EMPIRIS Pengenalan Evaluasi Empiris Perancangan Eksperimen Partisipasi, IRB dan Etika Pengumpulan.
CS305: HCI in SW Development Evaluation (Return to…)
Research Methods for Counselors COUN 597 University of Saint Joseph Class # 8 Copyright © 2015 by R. Halstead. All rights reserved.
Data analysis and interpretation. Agenda Part 2 comments – Average score: 87 Part 3: due in 2 weeks Data analysis.
Chapter 14: Usability testing and field studies. 2 FJK User-Centered Design and Development Instructor: Franz J. Kurfess Computer Science Dept.
USABILITY AND EVALUATION Motivations and Methods.
The art and science of measuring people l Reliability l Validity l Operationalizing.
Chapter 14: Usability testing and field studies. Usability Testing Emphasizes the property of being usable Key Components –User Pre-Test –User Test –User.
Observation Watch, listen, and learn…. Agenda  Observation exercise Come back at 3:40.  Questions?  Observation.
1 User Centered Design and Evaluation. 2 Overview Why involve users at all? What is a user-centered approach? Evaluation strategies Examples from “Snap-Together.
Empirical Methods in Human- Computer Interaction.
Instructor: Vincent Duffy, Ph.D. Associate Professor Lecture 10: Research, Design & Evaluation Tues. Feb. 20, 2007 IE 486 Work Analysis & Design II.
Think-aloud usability experiments or concurrent verbal accounts Judy Kay CHAI: Computer human adapted interaction research group School of Information.
Experiments Testing hypotheses…. Recall: Evaluation techniques  Predictive modeling  Questionnaire  Experiments  Heuristic evaluation  Cognitive.
Intro to Evaluation See how (un)usable your software really is…
The Experimental Approach September 15, 2009Introduction to Cognitive Science Lecture 3: The Experimental Approach.
1 User Centered Design and Evaluation. 2 Overview My evaluation experience Why involve users at all? What is a user-centered approach? Evaluation strategies.
ICS 463, Intro to Human Computer Interaction Design: 8. Evaluation and Data Dan Suthers.
Intro to Evaluation See how (un)usable your software really is…
Inferential Statistics
Gathering Usability Data
Empirical Evaluation Assessing usability (with users)
Chapter 14: Usability testing and field studies
Usability Testing. Testing Methods Same as Formative Surveys/questionnaires Interviews Observation Documentation Automatic data recording/tracking Artificial/controlled.
Research and Statistics AP Psychology. Questions: ► Why do scientists conduct research?  answer answer.
Evaluation Framework Prevention vs. Intervention CHONG POH WAN 21 JUNE 2011.
Intro to Evaluation See how (un)usable your software really is…
Spring /6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis.
Ch 14. Testing & modeling users
User Study Evaluation Human-Computer Interaction.
The Scientific Method Honors Biology Laboratory Skills.
Fall 2002CS/PSY Empirical Evaluation Analyzing data, Informing design, Usability Specifications Inspecting your data Analyzing & interpreting results.
Human Computer Interaction
Usability testing. Goals & questions focus on how well users perform tasks with the product. – typical users – doing typical tasks. Comparison of products.
Usability Evaluation June 8, Why do we need to do usability evaluation?
Usability Testing CS774 Human Computer Interaction Spring 2004.
Karrie Karahalios, Eric Gilbert 6 April 2007 some slides courtesy of Brian Bailey and John Hart cs414 empirical user studies.
COMP5047 Pervasive Computing: 2012 Think-aloud usability experiments or concurrent verbal accounts Judy Kay CHAI: Computer human adapted interaction research.
Data analysis and interpretation. Project part 3 Watch for comments on your evaluation plans Finish your plan – Finalize questions, tasks – Prepare scripts.
Intro to Evaluation See how (un)usable your software really is…
The Statistical Analysis of Data. Outline I. Types of Data A. Qualitative B. Quantitative C. Independent vs Dependent variables II. Descriptive Statistics.
The product of evaluation is knowledge. This could be knowledge about a design, knowledge about the user or knowledge about the task.
AMSc Research Methods Research approach IV: Experimental [1] Jane Reid
©2010 John Wiley and Sons Chapter 2 Research Methods in Human-Computer Interaction Chapter 2- Experimental Research.
Observation & Experiments Watch, listen, and learn…
EVALUATION PROfessional network of Master’s degrees in Informatics as a Second Competence – PROMIS ( TEMPUS FR-TEMPUS-JPCR)
Day 10 Analysing usability test results. Objectives  To learn more about how to understand and report quantitative test results  To learn about some.
Intro to Evaluation See how (un)usable your software really is…
INTRODUCTION TO METHODS Higher Psychology. What do Psychologists do?  Discuss in groups  5MINS.
Observation & Experiments Watch, listen, and learn…
Usability Engineering Dr. Dania Bilal IS 582 Spring 2007.
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
Evaluation / Usability. ImplementDesignAnalysisEvaluateDevelop ADDIE.
1 ITM 734 Introduction to Human Factors in Information Systems Cindy Corritore Testing the UI – part 2.
School of Engineering and Information and Communication Technology KIT305/607 Mobile Application Development Week 7: Usability (think-alouds) Dr. Rainer.
User Interface Evaluation
SIE 515 Design Evaluation Lecture 7.
Evaluation through user participation
Qualitative vs. Quantitative
How (un)usable is your software?
Data analysis and interpretation
© 2011 The McGraw-Hill Companies, Inc.
SY DE 542 User Testing March 7, 2005 R. Chow
Inspecting your data Analyzing & interpreting results
Observation & Experiments
HCI Evaluation Techniques
Empirical Evaluation Data Collection: Techniques, methods, tricks Objective data IRB Clarification All research done outside the class (i.e., with non-class.
Human-Computer Interaction: Overview of User Studies
Presentation transcript:

Experiments Testing hypotheses…

Agenda Homework assignment Review evaluation planning Observation continued Empirical studies In-class practice

Recall evaluation distinctions Form of results of obtained Quantitative Qualitative Who is experimenting with the design End users HCI experts Approach Experimental Naturalistic Predictive

Evaluation techniques Predictive modeling Questionnaire Experiments Heuristic evaluation Cognitive walkthrough Think aloud (protocol analysis) Interviews Experience Sampling Focus Groups

Recall: Observation Watch users perform tasks Verbal protocols: Think-aloud, cooperative Data collection: Paper and pencil, audio, video Software logs

Analysis Many approaches Task based How do users approach the problem What problems do users have Need not be exhaustive, look for interesting cases Performance based Frequency and timing of actions, errors, task completion, etc. Very time consuming!!

Usability Lab ation_room2.htm Large viewing area in this one- way mirror which includes an angled sheet of glass the improves light capture and prevents sound transmission between rooms. Doors for participant room and observation rooms are located such that participants are unaware of observers movements in and out of the observation room.

Observation Room State-of-the-art observation room equipped with three monitors to view participant, participant's monitor, and composite picture in picture. One-way mirror plus angled glass captures light and isolates sound between rooms. Comfortable and spacious for three people, but room enough for six seated observers. Digital mixer for unlimited mixing of input images and recording to VHS, SVHS, or MiniDV recorders.

Experiments Design the experiment to collect the data to test the hypotheses to evaluate the interface to refine the design Generally quantitative, experimental, with end users. See 14.4

Conducting an Experiment Determine the TASK Determine the performance measures Develop the experiment IRB approval Recruit participants Collect the data Inspect & analyze the data Draw conclusions to resolve design problems Redesign and implement the revised interface

The Task Benchmark tasks - gather quantitative data Representative tasks - add breadth, can help understand process Tell them what to do, not how to do it Issues: Lab testing vs. field testing Validity - typical users; typical tasks; typical setting? Run pilot versions to shake out the bugs

“Benchmark” Tasks Specific, clearly stated task for users to carry out Example: handler “Find the message from Mary and reply with a response of ‘Tuesday morning at 11’.” Users perform these under a variety of conditions and you measure performance

Defining Performance Based on the task Specific, objective measures/metrics Examples: Speed (reaction time, time to complete) Accuracy (errors, hits/misses) Production (number of files processed) Score (number of points earned) …others…? Preference, satisfaction, etc. (i.e. questionnaire response) are also valid measurements

Types of Variables Independent What you’re studying, what you intentionally vary (e.g., interface feature, interaction device, selection technique) Dependent Performance measures you record or examine (e.g., time, number of errors)

“Controlling” Variables Prevent a variable from affecting the results in any systematic way Methods of controlling for a variable: Don’t allow it to vary e.g., all males Allow it to vary randomly e.g., randomly assign participants to different groups Counterbalance - systematically vary it e.g., equal number of males, females in each group The appropriate option depends on circumstances

Hypotheses What you predict will happen More specifically, the way you predict the dependent variable (i.e., accuracy) will depend on the independent variable(s) “Null” hypothesis (H o ) Stating that there will be no effect e.g., “There will be no difference in performance between the two groups” Data used to try to disprove this null hypothesis

Example Do people complete operations faster with a black-and-white display or a color one? Independent - display type (color or b/w) Dependent - time to complete task (minutes) Controlled variables - same number of males and females in each group Hypothesis: Time to complete the task will be shorter for users with color display H o : Time color = Time b/w Note: Within/between design issues

Experimental Designs Within Subjects Design Every participant provides a score for all levels or conditions Color B/W P1 12 secs. 17 secs. P2 19 secs. 15 secs. P3 13 secs. 21 secs....

Experimental Designs Between Subjects Each participant provides results for only one condition Color B/W P1 12 secs. P2 17 secs. P3 19 secs. P5 15 secs. P4 13 secs. P6 21 secs....

Within vs. Between What are the advantages and disadvantages of the two techniques?

Within Subjects Designs More efficient: Each subject gives you more data - they complete more “blocks” or “sessions” More statistical “power”: Each person is their own control Therefore, can require fewer participants May mean more complicated design to avoid “order effects” e.g. seeing color then b/w may be different from seeing b/w then color

Between Subjects Designs Fewer order effects Participant may learn from first condition Fatigue may make second performance worse Simpler design & analysis Easier to recruit participants (only one session) Less efficient

What about subjects? How many? Book advice:at least 10 Other advice:6 subjects per experimental condition Real advice: depends on statistics Relating subjects and experimental conditions Within/between subjects design

Now What…? You’ve got your task, performance measures, experimental design, etc. You have hypotheses about what will happen in the experiment Now you need to gather the data Assign participants so that your variables are properly controlled Same gathering techniques as observation – emphasis on recording and logging

Data Inspection Look at the results First look at each participant’s data Were there outliers, people who fell asleep, anyone who tried to mess up the study, etc.? Then look at aggregate results and descriptive statistics

Inspecting Your Data “What happened in this study?” Keep in mind the goals and hypotheses you had at the beginning Questions: Overall, how did people do? “5 W’s” (Where, what, why, when, and for whom were the problems?)

Descriptive Statistics For all variables, get a feel for results: Total scores, times, ratings, etc. Minimum, maximum Mean, median, ranges, etc. What is the difference between mean & median? Why use one or the other?  e.g. “Twenty participants completed both sessions (10 males, 10 females; mean age 22.4, range years).”  e.g. “The median time to complete the task in the mouse-input group was 34.5 s (min=19.2, max=305 s).”

Subgroup Stats Look at descriptive stats (means, medians, ranges, etc.) for any subgroups e.g. “The mean error rate for the mouse-input group was 3.4%. The mean error rate for the keyboard group was 5.6%.” e.g. “The median completion time (in seconds) for the three groups were: novices: 4.4, moderate users: 4.6, and experts: 2.6.”

Plot the Data Look for the trends graphically

Other Presentation Methods 0 20 Mean lowhigh Middle 50% Time in secs. Age Box plot Scatter plot

Experimental Results How does one know if an experiment’s results mean anything or confirm any beliefs? Example: 40 people participated, 28 preferred interface 1, 12 preferred interface 2 What do you conclude?

Inferential (Diagnostic) Stats Tests to determine if what you see in the data (e.g., differences in the means) are reliable (replicable), and if they are likely caused by the independent variables, and not due to random effects e.g. t-test to compare two means e.g. ANOVA (Analysis of Variance) to compare several means e.g. test “significance level” of a correlation between two variables

Means Not Always Perfect Experiment 1 Group 1 Group 2 Mean: 7 Mean: 10 1,10,10 3,6,21 Experiment 2 Group 1 Group 2 Mean: 7 Mean: 10 6,7,8 8,11,11

Inferential Stats and the Data Ask diagnostic questions about the data Are these really different? What would that mean?

Hypothesis Testing Recall: We set up a “null hypothesis” e.g. there should be no difference between the completion times of the three groups Or H 0 : Time Novice = Time Moderate = Time Expert Our real hypothesis was, say, that experts should perform more quickly than novices

Hypothesis Testing “Significance level” (p): The probability that your null hypothesis was wrong, simply by chance Can also think of this as the probability that your “real” hypothesis (not the null), is wrong The cutoff or threshold level of p (“alpha” level) is often set at 0.05, or 5% of the time you’ll get the result you saw, just by chance e.g. If your statistical t-test (testing the difference between two means) returns a t- value of t=4.5, and a p-value of p=.01, the difference between the means is statistically significant

Errors Errors in analysis do occur Main Types: Type I/False positive - You conclude there is a difference, when in fact there isn’t Type II/False negative - You conclude there is no difference when there is And then there’s the True Negative…

Drawing Conclusions Make your conclusions based on the descriptive stats, but back them up with inferential stats e.g., “The expert group performed faster than the novice group t(1,34) = 4.6, p >.01.” Translate the stats into words that regular people can understand e.g., “Thus, those who have computer experience will be able to perform better, right from the beginning…”

Feeding Back Into Design What were the conclusions you reached? How can you improve on the design? What are quantitative benefits of the redesign? e.g. 2 minutes saved per transaction, which means 24% increase in production, or $45,000,000 per year in increased profit What are qualitative, less tangible benefit(s)? e.g. workers will be less bored, less tired, and therefore more interested --> better cust. service

Example: Heather’s simple experiment Designing interface for categorizing keywords in a transcript Wanted baseline for comparison Experiment comparing: Pen and paper, not real time Pen and paper, real time Simulated interface, real time

Experiment Hypothesis: fewer keywords in real time, fewer with transcript Independent variables: Time, accuracy of transcript Dependent variables: Number of keywords of each category Controlling variables: Gender, experience, etc. Between subjects design 1 hour, mentally intensive task

Results Overall Non- Real Time Rate Comparable Non-Real Time Rate Real Time Rate Error + Delay Rate Domain-specific tags Domain- independent tags Conversation tags

Example: Icon design – compare learnability of naturalistic vs. abstract icons Hypothesis? Variables? Tasks? Method? Data analysis?

Example: IM voice chat Make an evaluation plan (observation and/or experiments) Consider: Tasks What data you want to gather How you would gather What analysis you would do after