How (un)usable is your software?

Slides:



Advertisements
Similar presentations
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advertisements

CS305: HCI in SW Development Evaluation (Return to…)
Data analysis and interpretation. Agenda Part 2 comments – Average score: 87 Part 3: due in 2 weeks Data analysis.
User Testing & Experiments. Objectives Explain the process of running a user testing or experiment session. Describe evaluation scripts and pilot tests.
Dialogue Design Speech, pen, and gestures Speech Output  Tradeoffs in speed, naturalness and understandability  Male or female voice? Technical issues.
1 User Centered Design and Evaluation. 2 Overview Why involve users at all? What is a user-centered approach? Evaluation strategies Examples from “Snap-Together.
Interaction – Speech and Pen Natural input Universal design Take advantage of familiarity, existing knowledge Alternative input & output Multi-modal.
Experiments Testing hypotheses…. Agenda Homework assignment Review evaluation planning Observation continued Empirical studies In-class practice.
Experiments Testing hypotheses…. Recall: Evaluation techniques  Predictive modeling  Questionnaire  Experiments  Heuristic evaluation  Cognitive.
Intro to Evaluation See how (un)usable your software really is…
Dialog Design - Gesture & Pen Interfaces, Mobile Devices IAT This material has been developed by Georgia Tech HCI faculty, and continues to evolve.
1 User Centered Design and Evaluation. 2 Overview My evaluation experience Why involve users at all? What is a user-centered approach? Evaluation strategies.
Intro to Evaluation See how (un)usable your software really is…
Empirical Evaluation Assessing usability (with users)
Intro to Evaluation See how (un)usable your software really is…
Ch 14. Testing & modeling users
Fall 2002CS/PSY Empirical Evaluation Analyzing data, Informing design, Usability Specifications Inspecting your data Analyzing & interpreting results.
Human Computer Interaction
Usability Evaluation June 8, Why do we need to do usability evaluation?
Dialog Design - Gesture & Pen Interfaces, Mobile Devices CS / Psych This material has been developed by Georgia Tech HCI faculty, and continues.
Testing & modeling users. The aims Describe how to do user testing. Discuss the differences between user testing, usability testing and research experiments.
Fall 2002CS/PSY Dialog Design 3 How to use a PDA.
Evaluation of User Interface Design 4. Predictive Evaluation continued Different kinds of predictive evaluation: 1.Inspection methods 2.Usage simulations.
Data analysis and interpretation. Project part 3 Watch for comments on your evaluation plans Finish your plan – Finalize questions, tasks – Prepare scripts.
Intro to Evaluation See how (un)usable your software really is…
Human-Computer Interaction. Overview What is a study? Empirically testing a hypothesis Evaluate interfaces Why run a study? Determine ‘truth’ Evaluate.
Chapter 15: Analytical evaluation. Aims: Describe inspection methods. Show how heuristic evaluation can be adapted to evaluate different products. Explain.
Intro to Evaluation See how (un)usable your software really is…
Introduction to Evaluation “Informal” approaches.
Fall 2002CS/PSY Predictive Evaluation (Evaluation Without Users) Gathering data about usability of a design by a specified group of users for a particular.
Observation & Experiments Watch, listen, and learn…
Usability Engineering Dr. Dania Bilal IS 587 Fall 2007.
Research Methods Systematic procedures for planning research, gathering and interpreting data, and reporting research findings.
Day 8 Usability testing.
User Interface Evaluation
SIE 515 Design Evaluation Lecture 7.
Evaluation through user participation
Qualitative vs. Quantitative
Unit 5: Hypothesis Testing
Dialog Design 3 How to use a PDA
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Part III – Gathering Data
Ubiquitous Computing and Augmented Realities
Understanding Results
Usability Evaluation, part 2
Data analysis and interpretation
Program Evaluation Essentials-- Part 2
Continued on next slide.
From Controlled to Natural Settings
Evaluation of Mobile Interfaces
Stat 217 – Day 28 Review Stat 217.
Warm up – Unit 4 Test – Financial Analysis
Usability Techniques Lecture 13.
Significance Tests: The Basics
Significance Tests: The Basics
Inspecting your data Analyzing & interpreting results
Observation & Experiments
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
From Controlled to Natural Settings
Evaluation.
HCI Evaluation Techniques
Psych 231: Research Methods in Psychology
Testing & modeling users
Psych 231: Research Methods in Psychology
Experimental Evaluation
Human and Computer Interaction (H.C.I.) &Communication Skills
Empirical Evaluation Data Collection: Techniques, methods, tricks Objective data IRB Clarification All research done outside the class (i.e., with non-class.
Indicator 3.05 Interpret marketing information to test hypotheses and/or to resolve issues.
Presentation transcript:

How (un)usable is your software? Evaluation How (un)usable is your software?

Agenda Finish slides from last week Multimodal UIs: Ted Intro to evaluation Experiments

Pen & Mobile dialog Stylus or finger Tradeoffs of each? Pen as a standard mouse (doubleclick?) Variety of platforms Desktop touch screens or input pads (Wacom) Tablet PCs Handheld and Mobile devices Electronic whiteboards Platforms often involve variety of size and other constraints

Mobile devices More common as more platforms available PDA Cell phone Ultra mobile tablets Smaller display (160x160), (320x240) Few buttons, different interactions Free-form ink Soft keyboard Numeric keyboard => text Stroke recognition Hand printing / writing recognition

http://www.intel.com/design/mobile/platform/umpc.htm Palm Z22 handheld http://www.palm.com Ultra-Mobile PC (Samsung) http://www.oqo.com/ http://www.blackberry.com/

Soft Keyboards Common on PDAs and mobile devices Tap on buttons on screen

Soft Keyboard Presents a small diagram of keyboard You click on buttons/keys with pen QWERTY vs. alphabetical Tradeoffs? Alternatives?

Numeric Keypad -T9 http://www.t9.com/ Tegic Communications developed You press out letters of your word, it matches the most likely word, then gives optional choices Faster than multiple presses per key Used in mobile phones http://www.t9.com/

Cirrin - Stroke Recognition Developed by Jen Mankoff (GT -> Berkeley CS Faculty -> CMU CS Faculty) Word-level unistroke technique UIST ‘98 paper Use stylus to go from one letter to the next ->

Quikwriting - Stroke Recogntion Developed by Ken Perlin

Quikwriting Example p l e Said to be as fast as graffiti, but have to learn more http://mrl.nyu.edu/~perlin/demos/Quikwrite2_0.html

Hand Printing / Writing Recognition Recognizing letters and numbers and special symbols Lots of systems (commercial too) English, kanji, etc. Not perfect, but people aren’t either! People - 96% handprinted single characters Computer - >97% is really good OCR (Optical Character Recognition)

Recognition Issues Boxed vs. Free-Form input Printed vs. Cursive Sometimes encounter boxes on forms Printed vs. Cursive Cursive is much more difficult Letters vs. Words Cursive is easier to do in words vs individual letters, as words create more context Usually requires existence of a dictionary Real-time vs. off-line

Special Alphabets Graffiti - Unistroke alphabet on Palm PDA What are your experiences with Graffiti? Other alphabets or purposes Gestures for commands

Pen Gesture Commands Might mean delete Insert Paragraph Define a series of (hopefully) simple drawing gestures that mean different commands in a system

Pen Use Modes Often, want a mix of free-form drawing and special commands How does user switch modes? Mode icon on screen Button on pen Button on device

Error Correction Having to correct errors can slow input tremendously Strategies Erase and try again (repetition) When uncertain, system shows list of best guesses (n-best list) Others??

Free-form Ink Ink is the data, take as is Human is responsible for understanding and interpretation Often time-stamped Applications Signature verification Notetaking Electronic whiteboards Sketching

Electronic whiteboards Smartboard and Mimio Can integrate with projection Large surface to interact with Issues? http://www.mimio.com/ http://www.smarttech.com/

Real paper Anoto digital paper and pen technology (http://www.anoto.com/) Issues? Logitech io Digital Writing System http://www.logitech.com/

General Issues – Pen input Who is in control - user or computer Initial training required Learning time to become proficient Speed of use Generality/flexibility/power Special skills - typing Gulf of evaluation / gulf of execution Screen space required Computational resources required

Other interesting interactions Gesture input Specialized hardware, or tracking 3D interaction Stereoscopic displays Virtual reality Immersive displays such as glasses, caves Augmented reality Head trackers and vision based tracking

What’s coming up Upcoming related topics Multimodal UIs: Ted 3D user interfaces: Amy Conversational agents: Evan

When to do evaluation? Summative Formative Summative or formative? assess an existing system judge if it meets some criteria Formative assess a system being designed gather input to inform design Summative or formative? Depends on maturity of system how evaluation results will be used Same technique can be used for either

Other distinctions Form of results of obtained Quantitative Qualitative Who is experimenting with the design End users HCI experts Approach Experimental Naturalistic Predictive

Evaluation techniques Predictive Evaluation Fitt’s law, Hick’s, etc. Observation Think-aloud Cooperative evaluation Watch users perform tasks with your interface Next lecture

More techniques Empirical user studies (experiments) Interviews Test hypotheses about your interface Examine dependent variables against independent variables More later… Interviews Questionnaire Focus Groups Get user feedback More next week…

Still more techniques Discount usability techniques Use HCI experts instead of users Fast and cheap method to get broad feedback Heuristic evaluation Several experts examine interface using guiding heuristics (like the ones we used in design) Cognitive Walkthrough Several experts assess learnability of interface for novices In class – two weeks from today

And still more techniques Diary studies Users relate experiences on a regular basis Can write down, call in, etc. Experience Sampling Technique Interrupt users with very short questionnaire on a random-ish basis Good to get idea of regular and long term use in the field (real world)

General Recommendations Identify evaluation goals Include both objective & subjective data e.g. “completion time” and “preference” Use multiple measures, within a type e.g. “reaction time” and “accuracy” Use quantitative measures where possible e.g. preference score (on a scale of 1-7) Note: Only gather the data required; do so with minimum interruption, hassle, time, etc.

Evaluation planning Decide on techniques, tasks, materials What are usability criteria? How much required authenticity? How many people, how long How to record data, how to analyze data Prepare materials – interfaces, storyboards, questionnaires, etc. Pilot the entire evaluation Test all materials, tasks, questionnaires, etc. Find and fix the problems with wording, assumptions Get good feel for length of study

Performing the Study Be well prepared so participant’s time is not wasted Explain procedures without compromising results Session should not be too long , subject can quit anytime Never express displeasure or anger Data to be stored anonymously, securely, and/or destroyed Expect anything and everything to go wrong!! (a little story)

Consent Why important? People can be sensitive about this process and issues Errors will likely be made, participant may feel inadequate May be mentally or physically strenuous What are the potential risks (there are always risks)?

Data Inspection Start just looking at the data Identify issues: Were there outliers, people who fell asleep, anyone who tried to mess up the study, etc.? Identify issues: Overall, how did people do? “5 W’s” (Where, what, why, when, and for whom were the problems?) Compile aggregate results and descriptive statistics

Making Conclusions Where did you meet your criteria? Where didn’t you? What were the problems? How serious are these problems? What design changes should be made? But don’t make things worse… Prioritize and plan changes to the design

Example: Heather’s evaluation Evaluate use of an interface in a realistic task Interface: video + annotated transcript Video was of a requirements gathering session Task was to create a requirements document based on the video H. Richter et al. "An Empirical Investigation of Capture and Access for Software Requirements Activities," in Graphics Interface 2005.

The Interface: TagViewer

The Setup Subjects: 12 CS grad students Task: Recording: Watch 1 hour video, take personal notes as desired Return 3-7 days later, asked to create as complete, detailed requirements document as possible in 45 minutes 2 conditions – just video, or the interface to help Recording: Video recorded subject over shoulder Software logs of both video and interface Interview afterwards Kept notes

Some results Notes Document Video/TagViewer Tagging condition   # Plays # Seeks Total Video Tagging subjects (min) Subject 2. 10 20 16.2 3. 6 7 3.6 4. 28 13 20.4 5. 4 7.5 Mean 12 11 11.9 Video-only subjects Subject 3. 3 10.1 5 23 12.7 17.5 11.4   Notes Document Video/TagViewer Tagging condition 2.1 (2.0) 25.1 (9.6) 16.9 (11.5) Video-only condition 8.7 (5.4) 32.4 (5.2) 2.2 (3.5)

Some conclusions Very different use of the video Those with the interface found it useful Those without, didn’t Used the video to clarify details, look for missing information Annotations provided efficient ways to find information, supported a variety of personal strategies Usability pretty good, will need more sophisticated searching for longer videos

Experiments Design the experiment to collect the data to test the hypotheses to evaluate the interface to refine the design A controlled way to determine impact of design parameters on user experience Want results to eliminate possiblity of chance Good for comparing things – old system/new system, competitive system/new system, etc.

Experimental Design Determine tasks Determine performance measures Need clearly stated, benchmark tasks Determine performance measures Speed (reaction time, time to complete) Accuracy (errors, hits/misses) Production (number of files processed) Score (number of points earned) Preference, satisfaction, etc. (i.e. questionnaire response) also valid Determine variables and hypotheses Validity – typical tasks, typical users? Make sure you compare apples to apples

Types of Variables Independent Dependent Controlled What you’re studying, what you intentionally vary (e.g., interface feature, interaction device, selection technique) Dependent Performance measures you record or examine (e.g., time, number of errors) Controlled Factors you want to prevent from influencing results Experimental condition = each combined set of independent variables

“Controlling” Variables Prevent a variable from affecting the results in any systematic way Methods of controlling for a variable: Don’t allow it to vary e.g., all males Allow it to vary randomly e.g., randomly assign participants to different groups Counterbalance - systematically vary it e.g., equal number of males, females in each group The appropriate option depends on circumstances

Hypotheses What you predict will happen “Null” hypothesis (Ho) More specifically, the way you predict the dependent variable (i.e., accuracy) will depend on the independent variable(s) “Null” hypothesis (Ho) Stating that there will be no effect e.g., “There will be no difference in performance between the two groups” Data used to try to disprove this null hypothesis

Example Ho: Timecolor = Timeb/w Do people complete operations faster with a black-and-white display or a color one? Independent - display type (color or b/w) Dependent - time to complete task (minutes) Controlled variables - same number of males and females in each group Hypothesis: Time to complete the task will be shorter for users with color display Ho: Timecolor = Timeb/w

Subjects How many? Relating subjects and experimental conditions Book advice: at least 10 Other advice: 6 per experimental condition Real advice: depends on statistics Relating subjects and experimental conditions Within/between subjects design The more subjects, the stronger your conclusions will be. If there’s even a small difference between 50 users, less likely that’s by chance than a small difference between 3 users.

Experimental Designs Within Subjects Design Every participant provides a score for all levels or conditions Color B/W P1 12 secs. 17 secs. P2 19 secs. 15 secs. P3 13 secs. 21 secs. ...

Experimental Designs Between Subjects Each participant provides results for only one condition Color B/W P1 12 secs. P2 17 secs. P3 19 secs. P5 15 secs. P4 13 secs. P6 21 secs. ...

Within Subjects Designs More efficient: Each subject gives you more data - they complete more “blocks” or “sessions” More statistical “power”: Each person is their own control Therefore, can require fewer participants May mean more complicated design to avoid “order effects” e.g. seeing color then b/w may be different from seeing b/w then color Participant may learn from first condition Fatigue may make second performance worse

Between Subjects Designs Fewer order effects Simpler design & analysis Easier to recruit participants (only one session) Less efficient, because more subjects

Descriptive Statistics For all variables and subgroups, get a feel for results: Total scores, times, ratings, etc. Minimum, maximum Mean, median, ranges, etc. e.g. “Twenty participants completed both sessions (10 males, 10 females; mean age 22.4, range 18-37 years).” e.g. “The median time to complete the task in the mouse-input group was 34.5 s (min=19.2, max=305 s). The median time to complete the task in the keyboard-input group was 32.1 s (min=17.6, max=286 s) What is the difference between mean & median? Why use one or the other?

Inferential Stats and the Data Are these really different? What would that mean?

Goal of analysis Get >95% confidence in significance of result that is, null hypothesis disproved Ho: Timecolor = Timeb/w OR, there is an influence ORR, only 1 in 20 chance that difference occurred due to random chance

Hypothesis Testing Tests to determine differences t-test to compare two means ANOVA (Analysis of Variance) to compare several means Need to determine “statistical significance” “Significance level” (p): The probability that your null hypothesis was wrong, simply by chance p (“alpha” level) is often set at 0.05, or 5% of the time you’ll get the result you saw, just by chance

Example: Heather’s simple experiment Designing interface for categorizing keywords in a transcript Wanted baseline for comparison Experiment comparing: Pen and paper, not real time Pen and paper, real time Simulated interface, real time H. Richter et al. “Tagging Knowledge Acquisition To Facilitate Knowledge Traceability,” International Journal on Software Engineering and Knowledge Engineering, World Scientific, 14(1).

Experiment Hypothesis: fewer keywords in real time, fewer with transcript Independent variables: Time, accuracy of transcript Dependent variables: Number of keywords of each category Controlling variables: Experience Between subjects design 1 hour, mentally intensive task

Results Non-Real Time Rate Real Time Rate Error + Delay Rate Domain-specific tags 7.5 9.4 5.1 Domain- independent tags 12 9.8 5.8 Conversation tags 1.8 3 2.5 For Domain-specific tags, Error+Delay less than RealTime, p < 0.01 For Domain-independent tags, Error+Delay less than RealTime, p < 0.01 Hypotheses fewer in Real Time: not supported fewer with Error+Delay: supported for two categories

Your turn Design an experiment for your project Compare new design to competitor Or compare new design to old way Or test specific performance metric you have Decide on tasks, metrics, hypotheses Subjects – how many, within/between

Next week Observation, Interviews, Questionnaires Evaluation plan feedback Highly recommended: prepare 1-2 slides on your project eval. plan We’ll give each other feedback The more you have, the more feedback you get