How (un)usable is your software?

How (un)usable is your software?
Evaluation How (un)usable is your software?

Agenda Finish slides from last week Multimodal UIs: Ted
Intro to evaluation Experiments

Pen & Mobile dialog Stylus or finger
Tradeoffs of each? Pen as a standard mouse (doubleclick?) Variety of platforms Desktop touch screens or input pads (Wacom) Tablet PCs Handheld and Mobile devices Electronic whiteboards Platforms often involve variety of size and other constraints

Mobile devices More common as more platforms available
PDA Cell phone Ultra mobile tablets Smaller display (160x160), (320x240) Few buttons, different interactions Free-form ink Soft keyboard Numeric keyboard => text Stroke recognition Hand printing / writing recognition

Palm Z22 handheld Ultra-Mobile PC (Samsung)

Soft Keyboards Common on PDAs and mobile devices
Tap on buttons on screen

Soft Keyboard Presents a small diagram of keyboard
You click on buttons/keys with pen QWERTY vs. alphabetical Tradeoffs? Alternatives?

Numeric Keypad -T9 http://www.t9.com/ Tegic Communications developed
You press out letters of your word, it matches the most likely word, then gives optional choices Faster than multiple presses per key Used in mobile phones

Cirrin - Stroke Recognition
Developed by Jen Mankoff (GT -> Berkeley CS Faculty -> CMU CS Faculty) Word-level unistroke technique UIST ‘98 paper Use stylus to go from one letter to the next >

Quikwriting - Stroke Recogntion
Developed by Ken Perlin

Quikwriting Example p l e
Said to be as fast as graffiti, but have to learn more

Hand Printing / Writing Recognition
Recognizing letters and numbers and special symbols Lots of systems (commercial too) English, kanji, etc. Not perfect, but people aren’t either! People - 96% handprinted single characters Computer - >97% is really good OCR (Optical Character Recognition)

Recognition Issues Boxed vs. Free-Form input Printed vs. Cursive
Sometimes encounter boxes on forms Printed vs. Cursive Cursive is much more difficult Letters vs. Words Cursive is easier to do in words vs individual letters, as words create more context Usually requires existence of a dictionary Real-time vs. off-line

Special Alphabets Graffiti - Unistroke alphabet on Palm PDA
What are your experiences with Graffiti? Other alphabets or purposes Gestures for commands

Pen Gesture Commands Might mean delete Insert Paragraph
Define a series of (hopefully) simple drawing gestures that mean different commands in a system

Pen Use Modes Often, want a mix of free-form drawing and special commands How does user switch modes? Mode icon on screen Button on pen Button on device

Error Correction Having to correct errors can slow input tremendously
Strategies Erase and try again (repetition) When uncertain, system shows list of best guesses (n-best list) Others??

Free-form Ink Ink is the data, take as is
Human is responsible for understanding and interpretation Often time-stamped Applications Signature verification Notetaking Electronic whiteboards Sketching

Electronic whiteboards
Smartboard and Mimio Can integrate with projection Large surface to interact with Issues?

Real paper Anoto digital paper and pen technology ( Issues? Logitech io Digital Writing System

General Issues – Pen input
Who is in control - user or computer Initial training required Learning time to become proficient Speed of use Generality/flexibility/power Special skills - typing Gulf of evaluation / gulf of execution Screen space required Computational resources required

Other interesting interactions
Gesture input Specialized hardware, or tracking 3D interaction Stereoscopic displays Virtual reality Immersive displays such as glasses, caves Augmented reality Head trackers and vision based tracking

What’s coming up Upcoming related topics Multimodal UIs: Ted
3D user interfaces: Amy Conversational agents: Evan

When to do evaluation? Summative Formative Summative or formative?
assess an existing system judge if it meets some criteria Formative assess a system being designed gather input to inform design Summative or formative? Depends on maturity of system how evaluation results will be used Same technique can be used for either

Other distinctions Form of results of obtained
Quantitative Qualitative Who is experimenting with the design End users HCI experts Approach Experimental Naturalistic Predictive

Evaluation techniques
Predictive Evaluation Fitt’s law, Hick’s, etc. Observation Think-aloud Cooperative evaluation Watch users perform tasks with your interface Next lecture

More techniques Empirical user studies (experiments) Interviews
Test hypotheses about your interface Examine dependent variables against independent variables More later… Interviews Questionnaire Focus Groups Get user feedback More next week…

Still more techniques Discount usability techniques
Use HCI experts instead of users Fast and cheap method to get broad feedback Heuristic evaluation Several experts examine interface using guiding heuristics (like the ones we used in design) Cognitive Walkthrough Several experts assess learnability of interface for novices In class – two weeks from today

And still more techniques
Diary studies Users relate experiences on a regular basis Can write down, call in, etc. Experience Sampling Technique Interrupt users with very short questionnaire on a random-ish basis Good to get idea of regular and long term use in the field (real world)

General Recommendations
Identify evaluation goals Include both objective & subjective data e.g. “completion time” and “preference” Use multiple measures, within a type e.g. “reaction time” and “accuracy” Use quantitative measures where possible e.g. preference score (on a scale of 1-7) Note: Only gather the data required; do so with minimum interruption, hassle, time, etc.

Evaluation planning Decide on techniques, tasks, materials
What are usability criteria? How much required authenticity? How many people, how long How to record data, how to analyze data Prepare materials – interfaces, storyboards, questionnaires, etc. Pilot the entire evaluation Test all materials, tasks, questionnaires, etc. Find and fix the problems with wording, assumptions Get good feel for length of study

Performing the Study Be well prepared so participant’s time is not wasted Explain procedures without compromising results Session should not be too long , subject can quit anytime Never express displeasure or anger Data to be stored anonymously, securely, and/or destroyed Expect anything and everything to go wrong!! (a little story)

Consent Why important? People can be sensitive about this process and issues Errors will likely be made, participant may feel inadequate May be mentally or physically strenuous What are the potential risks (there are always risks)?

Data Inspection Start just looking at the data Identify issues:
Were there outliers, people who fell asleep, anyone who tried to mess up the study, etc.? Identify issues: Overall, how did people do? “5 W’s” (Where, what, why, when, and for whom were the problems?) Compile aggregate results and descriptive statistics

Making Conclusions Where did you meet your criteria? Where didn’t you?
What were the problems? How serious are these problems? What design changes should be made? But don’t make things worse… Prioritize and plan changes to the design

Example: Heather’s evaluation
Evaluate use of an interface in a realistic task Interface: video + annotated transcript Video was of a requirements gathering session Task was to create a requirements document based on the video H. Richter et al. "An Empirical Investigation of Capture and Access for Software Requirements Activities," in Graphics Interface 2005.

The Interface: TagViewer

The Setup Subjects: 12 CS grad students Task: Recording:
Watch 1 hour video, take personal notes as desired Return 3-7 days later, asked to create as complete, detailed requirements document as possible in 45 minutes 2 conditions – just video, or the interface to help Recording: Video recorded subject over shoulder Software logs of both video and interface Interview afterwards Kept notes

Some results Notes Document Video/TagViewer Tagging condition
# Plays # Seeks Total Video Tagging subjects (min) Subject 2. 10 20 16.2 3. 6 7 3.6 4. 28 13 20.4 5. 4 7.5 Mean 12 11 11.9 Video-only subjects Subject 3. 3 10.1 5 23 12.7 17.5 11.4 Notes Document Video/TagViewer Tagging condition 2.1 (2.0) 25.1 (9.6) 16.9 (11.5) Video-only condition 8.7 (5.4) 32.4 (5.2) 2.2 (3.5)

Some conclusions Very different use of the video
Those with the interface found it useful Those without, didn’t Used the video to clarify details, look for missing information Annotations provided efficient ways to find information, supported a variety of personal strategies Usability pretty good, will need more sophisticated searching for longer videos

Experiments Design the experiment to collect the data to test the hypotheses to evaluate the interface to refine the design A controlled way to determine impact of design parameters on user experience Want results to eliminate possiblity of chance Good for comparing things – old system/new system, competitive system/new system, etc.

Experimental Design Determine tasks Determine performance measures
Need clearly stated, benchmark tasks Determine performance measures Speed (reaction time, time to complete) Accuracy (errors, hits/misses) Production (number of files processed) Score (number of points earned) Preference, satisfaction, etc. (i.e. questionnaire response) also valid Determine variables and hypotheses Validity – typical tasks, typical users? Make sure you compare apples to apples

Types of Variables Independent Dependent Controlled
What you’re studying, what you intentionally vary (e.g., interface feature, interaction device, selection technique) Dependent Performance measures you record or examine (e.g., time, number of errors) Controlled Factors you want to prevent from influencing results Experimental condition = each combined set of independent variables

“Controlling” Variables
Prevent a variable from affecting the results in any systematic way Methods of controlling for a variable: Don’t allow it to vary e.g., all males Allow it to vary randomly e.g., randomly assign participants to different groups Counterbalance - systematically vary it e.g., equal number of males, females in each group The appropriate option depends on circumstances

Hypotheses What you predict will happen “Null” hypothesis (Ho)
More specifically, the way you predict the dependent variable (i.e., accuracy) will depend on the independent variable(s) “Null” hypothesis (Ho) Stating that there will be no effect e.g., “There will be no difference in performance between the two groups” Data used to try to disprove this null hypothesis

Example Ho: Timecolor = Timeb/w
Do people complete operations faster with a black-and-white display or a color one? Independent - display type (color or b/w) Dependent - time to complete task (minutes) Controlled variables - same number of males and females in each group Hypothesis: Time to complete the task will be shorter for users with color display Ho: Timecolor = Timeb/w

Subjects How many? Relating subjects and experimental conditions
Book advice: at least 10 Other advice: 6 per experimental condition Real advice: depends on statistics Relating subjects and experimental conditions Within/between subjects design The more subjects, the stronger your conclusions will be. If there’s even a small difference between 50 users, less likely that’s by chance than a small difference between 3 users.

Experimental Designs Within Subjects Design
Every participant provides a score for all levels or conditions Color B/W P secs secs. P secs secs. P secs secs. ...

Experimental Designs Between Subjects
Each participant provides results for only one condition Color B/W P secs P secs. P secs P secs. P secs P secs. ...

Within Subjects Designs
More efficient: Each subject gives you more data - they complete more “blocks” or “sessions” More statistical “power”: Each person is their own control Therefore, can require fewer participants May mean more complicated design to avoid “order effects” e.g. seeing color then b/w may be different from seeing b/w then color Participant may learn from first condition Fatigue may make second performance worse

Between Subjects Designs
Fewer order effects Simpler design & analysis Easier to recruit participants (only one session) Less efficient, because more subjects

Descriptive Statistics
For all variables and subgroups, get a feel for results: Total scores, times, ratings, etc. Minimum, maximum Mean, median, ranges, etc. e.g. “Twenty participants completed both sessions (10 males, 10 females; mean age 22.4, range years).” e.g. “The median time to complete the task in the mouse-input group was 34.5 s (min=19.2, max=305 s). The median time to complete the task in the keyboard-input group was 32.1 s (min=17.6, max=286 s) What is the difference between mean & median? Why use one or the other?

Inferential Stats and the Data
Are these really different? What would that mean?

Goal of analysis Get >95% confidence in significance of result
that is, null hypothesis disproved Ho: Timecolor = Timeb/w OR, there is an influence ORR, only 1 in 20 chance that difference occurred due to random chance

Hypothesis Testing Tests to determine differences
t-test to compare two means ANOVA (Analysis of Variance) to compare several means Need to determine “statistical significance” “Significance level” (p): The probability that your null hypothesis was wrong, simply by chance p (“alpha” level) is often set at 0.05, or 5% of the time you’ll get the result you saw, just by chance

Example: Heather’s simple experiment
Designing interface for categorizing keywords in a transcript Wanted baseline for comparison Experiment comparing: Pen and paper, not real time Pen and paper, real time Simulated interface, real time H. Richter et al. “Tagging Knowledge Acquisition To Facilitate Knowledge Traceability,” International Journal on Software Engineering and Knowledge Engineering, World Scientific, 14(1).

Experiment Hypothesis: fewer keywords in real time, fewer with transcript Independent variables: Time, accuracy of transcript Dependent variables: Number of keywords of each category Controlling variables: Experience Between subjects design 1 hour, mentally intensive task

Results Non-Real Time Rate Real Time Rate Error + Delay Rate
Domain-specific tags 7.5 9.4 5.1 Domain- independent tags 12 9.8 5.8 Conversation tags 1.8 3 2.5 For Domain-specific tags, Error+Delay less than RealTime, p < 0.01 For Domain-independent tags, Error+Delay less than RealTime, p < 0.01 Hypotheses fewer in Real Time: not supported fewer with Error+Delay: supported for two categories

Your turn Design an experiment for your project
Compare new design to competitor Or compare new design to old way Or test specific performance metric you have Decide on tasks, metrics, hypotheses Subjects – how many, within/between

Next week Observation, Interviews, Questionnaires
Evaluation plan feedback Highly recommended: prepare 1-2 slides on your project eval. plan We’ll give each other feedback The more you have, the more feedback you get

How (un)usable is your software?

Similar presentations

Presentation on theme: "How (un)usable is your software?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How (un)usable is your software?

Similar presentations

Presentation on theme: "How (un)usable is your software?"— Presentation transcript:

Similar presentations

About project

Feedback