Observation & Experiments Watch, listen, and learn…

Slides:



Advertisements
Similar presentations
Chapter 14: Usability testing and field studies
Advertisements

6.811 / PPAT: Principles and Practice of Assistive Technology Wednesday, 16 October 2013 Prof. Rob Miller Today: User Testing.
Data analysis and interpretation. Agenda Part 2 comments – Average score: 87 Part 3: due in 2 weeks Data analysis.
User Testing & Experiments. Objectives Explain the process of running a user testing or experiment session. Describe evaluation scripts and pilot tests.
IAT 334 Experimental Evaluation ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY.
Chapter 14: Usability testing and field studies. Usability Testing Emphasizes the property of being usable Key Components –User Pre-Test –User Test –User.
INTERPRET MARKETING INFORMATION TO TEST HYPOTHESES AND/OR TO RESOLVE ISSUES. INDICATOR 3.05.
Observation Watch, listen, and learn…. Agenda  Observation exercise Come back at 3:40.  Questions?  Observation.
Using Statistics in Research Psych 231: Research Methods in Psychology.
1 User Centered Design and Evaluation. 2 Overview Why involve users at all? What is a user-centered approach? Evaluation strategies Examples from “Snap-Together.
Think-aloud usability experiments or concurrent verbal accounts Judy Kay CHAI: Computer human adapted interaction research group School of Information.
User Interface Testing. Hall of Fame or Hall of Shame?  java.sun.com.
Experiments Testing hypotheses…. Agenda Homework assignment Review evaluation planning Observation continued Empirical studies In-class practice.
Experiments Testing hypotheses…. Recall: Evaluation techniques  Predictive modeling  Questionnaire  Experiments  Heuristic evaluation  Cognitive.
The Experimental Approach September 15, 2009Introduction to Cognitive Science Lecture 3: The Experimental Approach.
1 User Centered Design and Evaluation. 2 Overview My evaluation experience Why involve users at all? What is a user-centered approach? Evaluation strategies.
From Controlled to Natural Settings
ICS 463, Intro to Human Computer Interaction Design: 8. Evaluation and Data Dan Suthers.
Intro to Evaluation See how (un)usable your software really is…
Today Concepts underlying inferential statistics
Using Statistics in Research Psych 231: Research Methods in Psychology.
Gathering Usability Data
Empirical Evaluation Assessing usability (with users)
Chapter 14: Usability testing and field studies
Usability Testing. Testing Methods Same as Formative Surveys/questionnaires Interviews Observation Documentation Automatic data recording/tracking Artificial/controlled.
Chapter 1: Introduction to Statistics
1 Testing the UI This material has been developed by Georgia Tech HCI faculty, and continues to evolve. Contributors include Gregory Abowd, Jim Foley,
Spring /6.831 User Interface Design and Implementation1 Lecture 15: Experiment Analysis.
Ch 14. Testing & modeling users
User Study Evaluation Human-Computer Interaction.
The Scientific Method Honors Biology Laboratory Skills.
Chapter 12 Observing Users Li, Jia Li, Wei. Outline What and when to observe Approaches to observation How to observe How to collect data Indirect observation.
Fall 2002CS/PSY Empirical Evaluation Analyzing data, Informing design, Usability Specifications Inspecting your data Analyzing & interpreting results.
Human Computer Interaction
Usability Evaluation June 8, Why do we need to do usability evaluation?
Chapter 7 Experimental Design: Independent Groups Design.
Chapter 2 Doing Social Psychology Research. Why Should You Learn About Research Methods?  It can improve your reasoning about real-life events  This.
Usability Testing Chapter 6. Reliability Can you repeat the test?
Statistics (cont.) Psych 231: Research Methods in Psychology.
COMP5047 Pervasive Computing: 2012 Think-aloud usability experiments or concurrent verbal accounts Judy Kay CHAI: Computer human adapted interaction research.
Observation, Interviews, and Questionnaires a.k.a. How to watch and talk to your users.
Hypothesis testing Intermediate Food Security Analysis Training Rome, July 2010.
Data analysis and interpretation. Project part 3 Watch for comments on your evaluation plans Finish your plan – Finalize questions, tasks – Prepare scripts.
Copyright  2003 by Dr. Gallimore, Wright State University Department of Biomedical, Industrial Engineering & Human Factors Engineering Human Factors Research.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
CS5714 Usability Engineering Formative Evaluation of User Interaction: During Evaluation Session Copyright © 2003 H. Rex Hartson and Deborah Hix.
AMSc Research Methods Research approach IV: Experimental [1] Jane Reid
Observation & Experiments Watch, listen, and learn…
EVALUATION PROfessional network of Master’s degrees in Informatics as a Second Competence – PROMIS ( TEMPUS FR-TEMPUS-JPCR)
RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.
Chapter 13 Understanding research results: statistical inference.
Research in Psychology Chapter Two 8-10% of Exam AP Psychology.
Inferential Statistics Psych 231: Research Methods in Psychology.
Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.
1 ITM 734 Introduction to Human Factors in Information Systems Cindy Corritore Testing the UI – part 2.
School of Engineering and Information and Communication Technology KIT305/607 Mobile Application Development Week 7: Usability (think-alouds) Dr. Rainer.
Day 8 Usability testing.
Evaluation through user participation
Qualitative vs. Quantitative
Understanding Results
Data analysis and interpretation
From Controlled to Natural Settings
Warm up – Unit 4 Test – Financial Analysis
Inspecting your data Analyzing & interpreting results
Observation & Experiments
Evaluation.
Experimental Evaluation
Empirical Evaluation Data Collection: Techniques, methods, tricks Objective data IRB Clarification All research done outside the class (i.e., with non-class.
Indicator 3.05 Interpret marketing information to test hypotheses and/or to resolve issues.
Presentation transcript:

Observation & Experiments Watch, listen, and learn…

Observing Users Not as easy as you think One of the best ways to gather feedback about your interface Watch, listen and learn as a person interacts with your system Qualitative & quantitative, end users, experimental or naturalistic

Observation Direct In same room Can be intrusive Users aware of your presence Only see it one time May use 1-way mirror to reduce intrusiveness Indirect Video or app recording Reduces intrusiveness, but doesn’t eliminate it Cameras focused on screen, face & keyboard Gives archival record, but can spend a lot of time reviewing it

Location Observations may be In lab - Maybe a specially built usability lab Easier to control Can have user complete set of tasks In field Watch their everyday actions More realistic Harder to control other factors

Observation Room State-of-the-art observation room equipped with three monitors to view participant, participant's monitor, and composite picture in picture. One-way mirror plus angled glass captures light and isolates sound between rooms. Comfortable and spacious for three people, but room enough for six seated observers. Digital mixer for unlimited mixing of input images and recording to VHS, SVHS, or MiniDV recorders.

Task Selection What tasks are people performing? Representative and realistic? Tasks dealing with specific parts of the interface you want to test? Problematic tasks? Don’t forget to pilot your entire evaluation!!

Engaging Users in Evaluation What’s going on in the user’s head? Use verbal protocol where users describe their thoughts Qualitative techniques Think-aloud - can be very helpful Post-hoc verbal protocol - review video Critical incident logging - positive & negative Structured interviews - good questions “What did you like best/least?” “How would you change..?”

Think Aloud User describes verbally what s/he is thinking and doing What they believe is happening Why they take an action What they are trying to do Widely used, popular protocol Potential problems: Can be awkward for participant Thinking aloud can modify way user performs task

Cooperative approach Another technique: Co-discovery learning (Constructive iteration) Join pairs of participants to work together Use think aloud Perhaps have one person be semi-expert (coach) and one be novice More natural (like conversation) so removes some awkwardness of individual think aloud Variant: let coach be from design team (cooperative evaluation)

Alternative What if thinking aloud during session will be too disruptive? Can use post-event protocol User performs session, then watches video afterwards and describes what s/he was thinking Sometimes difficult to recall Opens up door of interpretation

What if a user gets stuck? Decide ahead of time what you will do. Offer assistance or not? What kind of assistance? You can ask (in cooperative evaluation) “What are you trying to do..?” “What made you think..?” “How would you like to perform..?” “What would make this easier to accomplish..?” Maybe offer hints This is why cooperative approaches are used

Inputs Need operational prototype could use Wizard of Oz simulation Need tasks and descriptions Reflect real tasks Avoid choosing only tasks that your design best supports Minimize necessary background knowledge Pay attention to time and training required.

Capturing a Session 1. Paper & pencil Can be slow May miss things Is definitely cheap and easy Time 10:00 10:03 10:08 10:22 Task 1 Task 2 Task 3 … SeSe SeSe

Capturing a Session 2. Recording (screen, audio and/or video) Good for think-aloud Multiple cameras may be needed Good, rich record of session Can be intrusive Can be painful to transcribe and analyze

Recording 2b: With usability software (such as Morae) Combines audio/video, screen, mouse and keyboard clicks, window opening/closing, etc. Synchronizes everything Allows for annotations Can create clips or presentations

Capturing a Session 3. Software logging Modify software to log user actions Can give time-stamped key press or mouse event Two problems: Too low-level, want higher level events Massive amount of data, need analysis tools

Example logs |hrichter| |MV|START| |hrichter| |MV|QUESTION|false|false|false|false|false|false| |hrichter| |MV|TAB|AGENDA |hrichter| |MV|TAB|PRESENTATION |hrichter| |MV|SLIDECHANGE| |hrichter| |MV|SEEK|PRESENTATION-A|566|604189| |hrichter| |MV|SEEK|PRESENTATION-A|566|604189| |hrichter| |MV|SEEK|PRESENTATION-A|566|604189| |hrichter| |MV|TAB|AGENDA |hrichter| |MV|SEEK|AGENDA|566|149613| |hrichter| |MV|TAB|PRESENTATION |hrichter| |MV|SLIDECHANGE| |hrichter| |MV|SEEK|PRESENTATION|566|315796| |hrichter| |MV|PLAY|566| |hrichter| |MV|TAB|AV |hrichter| |MV|TAB|PRESENTATION |hrichter| |MV|SLIDECHANGE| |hrichter| |MV|SEEK|PRESENTATION|566|271191| |hrichter| |MV|TAB|AV |hrichter| |MV|TAB|PRESENTATION |hrichter| |MV|TAB|AGENDA |hrichter| |MV|TAB|PRESENTATION |hrichter| |MV|TAB|AV |hrichter| |MV|TAB|AGENDA |hrichter| |MV|TAB|AV |hrichter| |MV|STOP|566| |hrichter| |MV|END

Analysis Many approaches Task based How do users approach the problem What problems do users have Need not be exhaustive, look for interesting cases Performance based Frequency and timing of actions, errors, task completion, etc. Can be very time consuming!!

Experiments Testing hypotheses…

Experiments Test hypotheses in your design More controlled examination than just observation Generally quantitative, experimental, with end users.

Types of Variables Independent What you’re studying, what you intentionally vary (e.g., interface feature, interaction device, selection technique, design) Dependent Performance measures you record or examine (e.g., time, number of errors) Controlled What you do not want to affect your study

“Controlling” Variables Prevent a variable from affecting the results in any systematic way Methods of controlling for a variable: Don’t allow it to vary e.g., all males Allow it to vary randomly e.g., randomly assign participants to different groups Counterbalance - systematically vary it e.g., equal number of males, females in each group The appropriate option depends on circumstances

Hypotheses What you predict will happen More specifically, the way you predict the dependent variable (i.e., accuracy) will depend on the independent variable(s) “Null” hypothesis (H o ) Stating that there will be no effect e.g., “There will be no difference in performance between the two groups” Data used to try to disprove this null hypothesis

Defining Performance Based on the task Specific, objective measures/metrics Examples: Speed (reaction time, time to complete) Accuracy (errors, hits/misses) Production (number of files processed) Score (number of points earned) …others…? Preference, satisfaction, etc. (i.e. questionnaire response) are also valid measurements

Example Do people complete operations faster with a black- and-white display or a color one? Independent - display type (color or b/w) Dependent - time to complete task (minutes) Controlled variables - same number of males and females in each group Hypothesis: Time to complete the task will be shorter for users with color display H o : Time color = Time b/w

What about subjects? How many? Book advice:at least 10 Other advice:6 subjects per experimental condition Real advice: depends on statistics Relating subjects and experimental conditions Within/between subjects design

Experimental Designs Within Subjects Design Every participant provides a score for all levels or conditions Color B/W P1 12 secs. 17 secs. P2 19 secs. 15 secs. P3 13 secs. 21 secs....

Experimental Designs Between Subjects Each participant provides results for only one condition Color B/W P1 12 secs. P2 17 secs. P3 19 secs. P5 15 secs. P4 13 secs. P6 21 secs....

Within Subjects Designs More efficient: Each subject gives you more data - they complete more “blocks” or “sessions” More statistical “power”: Each person is their own control Therefore, can require fewer participants May mean more complicated design to avoid “order effects” e.g. seeing color then b/w may be different from seeing b/w then color

Between Subjects Designs Fewer order effects Participant may learn from first condition Fatigue may make second performance worse Simpler design & analysis Easier to recruit participants (only one session, less time) Less efficient

Now What…? Performed initial data inspection Removed outliers, have general idea what occurred Descriptive Statistics Totals, Averages, Ranges, etc. Subgroup Statistics Inferential Analysis to test hypotheses

Descriptive Statistics For all variables and subgroups, get a feel for results: Total scores, times, ratings, etc. Minimum, maximum Mean, median, ranges, etc. What is the difference between mean & median? Why use one or the other?  e.g. “Twenty participants completed both sessions (10 males, 10 females; mean age 22.4, range years).”  e.g. “The median time to complete the task in the mouse-input group was 34.5 s (min=19.2, max=305 s). The median time to complete the task in the keyboard-input group was 32.1 s (min=17.6, max=286 s)

Inferential Stats and the Data Are these really different? What would that mean?

Goal of analysis Get >95% confidence in significance of result that is, null hypothesis disproved H o : Time color = Time b/w OR, there is an influence ORR, only 1 in 20 chance that difference occurred due to random chance

Hypothesis Testing Tests to determine differences t-test to compare two means ANOVA (Analysis of Variance) to compare several means Need to determine “statistical significance” “Significance level” (p): The probability that your null hypothesis was wrong, simply by chance p (“alpha” level) is often set at 0.05, or 5% of the time you’ll get the result you saw, just by chance

Feeding Back Into Design What were the conclusions you reached? How can you improve on the design? What are quantitative benefits of the redesign? e.g. 2 minutes saved per transaction, which means 24% increase in production, or $45,000,000 per year in increased profit What are qualitative, less tangible benefit(s)? e.g. workers will be less bored, less tired, and therefore more interested --> better cust. service

Example: Resolution Effects Task: Given a filename, locate the icon with that filename on a desktop scattered with other icons (distractors) Compare task time under 3 resolutions (800 x 600, 1024 x 768, 1280 x 1024) Control for: Number of distractors (same) Icon shape/size (all text document icons) Position of target (randomized) Etc.

Resolution Experiment Task: Cursor starts in middle of screen. Filename flashes at top of screen, user has to find the file and click on it once with cursor Hypothesis: time to select target icon increases with higher resolution (because icons and font are smaller) Independent variables: 3 resolution settings Dependent variables: Time to click on target Controlling variables: Experience, # distractors, size/shape, target position, distance from start, direction to target

Resolution Experiment Within-Participants design 1 hour, 3 blocks of 15 minutes each, 100 trials in each condition 24 participants, split into 6 groups of 4 people Why 6 groups? 3 conditions, and 3! = 6 User preference questionnaire, ratings

GroupBlock 1Block 2Block 3 ARes 1Res 2Res 3 BRes 1Res 3Res 2 C Res 3Res 1 DRes 2Res 1Res 3 E Res 2Res 1 FRes 3Res 1Res 2 Res 1: 600 x 800 Res 2: 1024 x 768Res 3: 1280 x ! = 3 * 2 * 1 = 6 --> so there are six possible orderings of the blocks Divide the 24 participants evenly among these 6 groups.

Analysis Process (Within-Participants) Look at descriptive statistics, average trial time, user preferences on questionnaire, etc. Remove outliers (any trial with measure greater than 2 standard deviations from the mean). Check null hypothesis for your control variables to ensure no effects exist. Check for learning effects. Make sure the counter- balancing worked! Do an ANOVA of trial time vs. Group, should show no effect. Do ANOVA of trial time vs. resolution to test significance of findings. Hopefully there is an effect here! Do post-hoc pairwise tests

Results (trial time in seconds) 600 x x x 1024 Group A Group B … Mean trial time Hypotheses: Appears to be supported… NOTE: These are MADE UP numbers. I have not run this study!!!!!!

ANOVAs ANOVA for mean trial time, group: SourceDegreesFreedomF-RatioProbability Group ANOVA for mean trial time, resolution: Source DegreesFreedomF-RatioProbability Resolution24.337<= Low F-ratio (less than 1) means not much effect of group on trial time High probability indicates that whatever relationship is found, it had a 63% chance of being found in random data… Higher F-ratio here indicates that resolution does have an effect on trial time (doesn’t tell you what the effect is, exactly) Low probability indicates the effect is very unlikely to be found in random data Conclusion: counter- balancing worked, and resolution has a statistically significant effect on selecting a target in this experiment

Post-Hoc Pairwise Test Once you’ve found an effect using an ANOVA, you need to figure out whether each condition is significantly different from the others. You do this with post- hoc tests, such as Tukey HSD: ConditionsDifference (seconds) Probability 800x x x x x x Significant Not Significant

What could be concluded? (if this data was real…) In a laboratory setting, target search and selection tasks are faster on 800x600 screens than 1024x768 or 1280x1024, all other things being equal. There is no statistically significant difference between the two higher resolutions. What could not be concluded: Lower resolution screens are better. Why not? For other tasks, in other conditions (like different lighting, off-line distractions, etc. the other resolutions might perform better).

Another Example: Heather’s simple experiment Designing interface for categorizing keywords in a transcript Wanted baseline for comparison Experiment comparing: Pen and paper, not real time Pen and paper, real time Simulated interface, real time

Experiment Hypothesis: fewer keywords in real time, fewer with simulated Independent variables: Time, accuracy of transcript, platform Dependent variables: Number of keywords of each category Controlling variables: Gender, experience, etc. Between subjects design 1 hour, mentally intensive task

Results Non-Real Time Rate Real Time RateSimulated Rate Domain-specific tags Domain- independent tags Conversation tags For Domain-specific tags, Simulated less than RealTime, p < 0.01 For Domain-independent tags, Simulated less than RealTime, p < 0.01 Hypotheses fewer in Real Time: not supported fewer with Simulated: supported for two categories

Example: add video to IM voice chat? Compare voice chat with and without video Plan an experiment: Compare message time or difficulty in communicating or frequency… Consider: Tasks What data you want to gather How you would gather What analysis you would do after

Questions: What are independent variables? What are dependent variables? What could be hypothesis? Between or within subjects? What was controlled? What are the tasks?