Presentation is loading. Please wait.

Presentation is loading. Please wait.

SAURO/LEWIS109USABILITY TESTING RESEARCH METHODS IN HCI HCI RESEARCHERS EMPLOY EMPIRICAL METHODS, TECHNIQUES FOR INVESTIGATING THE WORLD AND COLLECTING.

Similar presentations


Presentation on theme: "SAURO/LEWIS109USABILITY TESTING RESEARCH METHODS IN HCI HCI RESEARCHERS EMPLOY EMPIRICAL METHODS, TECHNIQUES FOR INVESTIGATING THE WORLD AND COLLECTING."— Presentation transcript:

1

2 SAURO/LEWIS109USABILITY TESTING RESEARCH METHODS IN HCI HCI RESEARCHERS EMPLOY EMPIRICAL METHODS, TECHNIQUES FOR INVESTIGATING THE WORLD AND COLLECTING EVIDENCE TO PROVE OR DISPROVE THEIR HYPOTHESES ABOUT HOW PEOPLE INTERACT WITH COMPUTERS, AND ABOUT THE USABILITY OF INTERFACES. LAB EXPERIMENT AN ARTIFICIAL SITUATION, CREATED BY AND HIGHLY CONTROLLED BY THE EXPERIMENTER, THAT TYPICALLY COMPARES ALTERNATIVE USER INTERFACES OR MEASURES HOW USABILITY VARIES WITH SOME DESIGN PARAMETER. EXAMPLE: A TEST OF FONT READABILITY, DONE BY BRINGING SUBJECTS INTO THE EXPERIMENTER’S LAB, ASKING THEM TO READ TEXT SELECTIONS DISPLAYED WITH DIFFERENT FONTS, AND TIMING THEIR READING SPEED. LAB EXPERIMENT AN ARTIFICIAL SITUATION, CREATED BY AND HIGHLY CONTROLLED BY THE EXPERIMENTER, THAT TYPICALLY COMPARES ALTERNATIVE USER INTERFACES OR MEASURES HOW USABILITY VARIES WITH SOME DESIGN PARAMETER. EXAMPLE: A TEST OF FONT READABILITY, DONE BY BRINGING SUBJECTS INTO THE EXPERIMENTER’S LAB, ASKING THEM TO READ TEXT SELECTIONS DISPLAYED WITH DIFFERENT FONTS, AND TIMING THEIR READING SPEED. FIELD STUDY A REAL SITUATION IN THE ACTUAL ENVIRONMENT WHERE PEOPLE USE THE INTERFACE BEING CONSIDERED, USING REAL TASKS (RATHER THAN TASKS CONCOCTED BY THE EXPERIMENTER). IN HCI, INITIAL FIELD STUDIES JUST OBSERVE WITHOUT INTERVENING (E.G., CONTEXTUAL INQUIRY), WHILE FINAL FIELD STUDIES DELIVER THE NEW UI AND SEE HOW IT’S USED. FIELD STUDY A REAL SITUATION IN THE ACTUAL ENVIRONMENT WHERE PEOPLE USE THE INTERFACE BEING CONSIDERED, USING REAL TASKS (RATHER THAN TASKS CONCOCTED BY THE EXPERIMENTER). IN HCI, INITIAL FIELD STUDIES JUST OBSERVE WITHOUT INTERVENING (E.G., CONTEXTUAL INQUIRY), WHILE FINAL FIELD STUDIES DELIVER THE NEW UI AND SEE HOW IT’S USED. SURVEY A QUESTIONNAIRE, CONDUCTED BY PAPER, PHONE, WEB, OR IN PERSON. IN GENERAL, THE RESULTS OF A SURVEY TEND TO APPLY MORE STRONGLY TO THE WHOLE POPULATION OF PEOPLE RELEVANT TO THE STUDY, SINCE IT IS FAR CHEAPER TO SURVEY A LARGE NUMBER OF PEOPLE, AND GOOD STATISTICAL SAMPLING TECHNIQUES EXIST TO MAKE THE RESULTS MORE GENERALIZABLE. SURVEY A QUESTIONNAIRE, CONDUCTED BY PAPER, PHONE, WEB, OR IN PERSON. IN GENERAL, THE RESULTS OF A SURVEY TEND TO APPLY MORE STRONGLY TO THE WHOLE POPULATION OF PEOPLE RELEVANT TO THE STUDY, SINCE IT IS FAR CHEAPER TO SURVEY A LARGE NUMBER OF PEOPLE, AND GOOD STATISTICAL SAMPLING TECHNIQUES EXIST TO MAKE THE RESULTS MORE GENERALIZABLE.

3 SAURO/LEWIS110USABILITY TESTING OBTRUSIVEUNOBTRUSIVE ABSTRACT CONCRETE FIELD STUDY SURVEYSURVEY LAB EXPERIMENT IN FIELD STUDIES, SUBJECTS DO THEIR OWN TASKS IN THEIR OWN ENVIRONMENTS IN ORDER TO MAKE STRONG STATISTICAL CLAIMS, LAB EXPERIMENTS USE SIMPLIFIED AND HIGHLY CONTROLLED TASKS SURVEYS ARE GENERALIZABLE, BUT SUBJECTS ARE AWARE THAT THEY ARE BEING STUDIED AND MAY RESPOND ACCORDINGLY

4 SAURO/LEWIS111USABILITY TESTING QUANTIFYING USABILITY USABILITY IS THE EXTENT TO WHICH USERS CAN UTILIZE A SYSTEM’S FUNCTIONALITY. LEARNABILITY (IS THE SYSTEM EASY TO LEARN?) LEARNABILITY (IS THE SYSTEM EASY TO LEARN?) EFFICIENCY (ONCE LEARNED, IS THE SYSTEM FAST TO USE?) RECOVERABILITY (ARE ERRORS FEW AND RECOVERABLE?) SATISFACTION (IS THE SYSTEM ENJOYABLE TO USE?) SATISFACTION (IS THE SYSTEM ENJOYABLE TO USE?) DIMENSIONS OF USABILITY

5 SAURO/LEWIS112USABILITY TESTING USABILITY TESTING CONSIDERATIONS NUMEROUS VARIABLES AFFECT THE VALIDITY OF USABILITY TESTS. SAMPLE SIZE HOW MANY PARTICIPANTS ARE NEEDED TO ENSURE THE VALIDITY OF THE TEST? RANDOMNESS DO NON-PARTICIPANTS HAVE FUNDAMENTALLY DIFFERENT CHARACTERISTICS THAN PARTICIPANTS? REPRESENTATIVENESS HOW WELL DOES THE SAMPLE POPULATION REPRESENT THE PARENT POPULATION? DATA COLLECTION SHOULD THE DATA BE GATHERED REMOTELY OR IN A MODERATED LAB SESSION? COMPLETION RATE HOW MANY PARTICIPANTS SUCCESSFULLY COMPLETE THE ASSIGNED TASK DURING A USABILITY TEST? TASK TIME HOW LONG DOES A USER SPEND ON AN ACTIVITY DURING A USABILITY TEST?

6 SAURO/LEWIS113USABILITY TESTING CONTROLLED EXPERIMENT 1. START WITH A TESTABLE HYPOTHESIS FOR EXAMPLE: “THE MACINTOSH MENU BAR, WHICH IS ANCHORED AT THE TOP OF THE SCREEN, IS FASTER TO ACCESS THAN THE WINDOWS MENU BAR, WHICH IS SEPARATED FROM THE TOP OF THE SCREEN BY A WINDOW TITLE BAR.” FOR EXAMPLE: “THE MACINTOSH MENU BAR, WHICH IS ANCHORED AT THE TOP OF THE SCREEN, IS FASTER TO ACCESS THAN THE WINDOWS MENU BAR, WHICH IS SEPARATED FROM THE TOP OF THE SCREEN BY A WINDOW TITLE BAR.” 2. CHOOSE THE INDEPENDENT VARIABLES TO MANIPULATE TO TEST THE HYPOTHESIS IN THIS CASE, THE Y-POSITION OF THE MENU BAR. IN THIS CASE, THE Y-POSITION OF THE MENU BAR. OTHER POSSIBILITIES: USER CLASSES (NOVICES VS. EXPERTS, MAC USERS VS. WINDOWS USERS), MENU ITEM ARRANGEMENT (ALPHABETIZED VS. FUNCTIONALLY-GROUPED). OTHER POSSIBILITIES: USER CLASSES (NOVICES VS. EXPERTS, MAC USERS VS. WINDOWS USERS), MENU ITEM ARRANGEMENT (ALPHABETIZED VS. FUNCTIONALLY-GROUPED). 3. MEASURE THE DEPENDENT VARIABLES TO TEST THE HYPOTHESIS TIME, ERROR RATE, NON-ERROR EVENT COUNT (E.G., NUMBER OF TIMES MENU ITEM IS EXPANDED), USER SATISFACTION (USUALLY VIA A QUESTIONNAIRE). TIME, ERROR RATE, NON-ERROR EVENT COUNT (E.G., NUMBER OF TIMES MENU ITEM IS EXPANDED), USER SATISFACTION (USUALLY VIA A QUESTIONNAIRE). 4. USE STATISTICAL TESTS TO ACCEPT OR REJECT THE HYPOTHESIS ANALYZE HOW CHANGES IN THE INDEPENDENT VARIABLES AFFECTED THE DEPENDENT VARIABLES, AND WHETHER THOSE EFFECTS WERE SIGNIFICANT (I.E., INDICATING A DEFINITE CAUSE-AND-EFFECT). ANALYZE HOW CHANGES IN THE INDEPENDENT VARIABLES AFFECTED THE DEPENDENT VARIABLES, AND WHETHER THOSE EFFECTS WERE SIGNIFICANT (I.E., INDICATING A DEFINITE CAUSE-AND-EFFECT).

7 SAURO/LEWIS114USABILITY TESTING SCHEMATIC VIEW OF EXPERIMENT DESIGN PROCESS Y = F (X) PROCESS X (INDEPENDENT VARIABLES) X Y (DEPENDENT VARIABLES) Y IDEALLY, THE IDEA IS TO DETERMINE THE PRECISE EFFECT THAT THE INDEPENDENT VARIABLES HAVE ON THE DEPENDENT VARIABLES. PROCESS Y = F (X, , , , ,  ) PROCESS X (INDEPENDENT VARIABLES) X Y (DEPENDENT VARIABLES) Y IN REALITY, HOWEVER, THERE ARE A NUMBER OF UNKNOWN OR UNCONTROLLED VARIABLES THAT ALSO IMPACT THE DEPENDENT VARIABLES (E.G., IN THE MENU BAR EXAMPLE, THE POINTING DEVICE BEING USED, THE ORIGINAL POSITION OF THE MOUSE POINTER, THE SURFACE ON WHICH THE MOUSE IS BEING DRAGGED, THE USER’S LEVEL OF FATIGUE, THE USER’S PREVIOUS EXPERIENCE WITH A PARTICULAR TYPE OF MENU BAR, ETC.). , , , ,  (UNKNOWN/UNCONTROLLED VARIABLES) , , , ,  (UNKNOWN/UNCONTROLLED VARIABLES) THE PURPOSE OF EXPERIMENT DESIGN IS TO ELIMINATE (OR AT LEAST TO RENDER HARMLESS) THE EFFECT OF THE UNKNOWN AND UNCONTROLLED VARIABLES, IN ORDER TO ENABLE CONCLUSIONS TO BE DRAWN REGARDING THE EFFECT OF THE INDEPENDENT VARIABLES ON THE DEPENDENT VARIABLES.

8 SAURO/LEWIS115USABILITY TESTING DESIGN OF THE MENU BAR EXPERIMENT WHAT USER POPULATION SHOULD BE SAMPLED? MAC USERS VS. WINDOWS USERS? YOUNG USERS VS. OLD USERS? LEFT-HANDED USERS VS. RIGHT-HANDED USERS? WHAT USER POPULATION SHOULD BE SAMPLED? MAC USERS VS. WINDOWS USERS? YOUNG USERS VS. OLD USERS? LEFT-HANDED USERS VS. RIGHT-HANDED USERS? HOW SHOULD THE TEST BE IMPLEMENTED? USING REAL MAC AND WINDOWS INTERFACES? IMPLEMENT A SEPARATE INTERFACE THAT AVOIDS CONFOUNDING VARIABLES (SIZE OF THE MENU BAR, READING SPEED OF THE FONT, MOUSE ACCELERATION PARAMETERS, ETC.)? HOW SHOULD THE TEST BE IMPLEMENTED? USING REAL MAC AND WINDOWS INTERFACES? IMPLEMENT A SEPARATE INTERFACE THAT AVOIDS CONFOUNDING VARIABLES (SIZE OF THE MENU BAR, READING SPEED OF THE FONT, MOUSE ACCELERATION PARAMETERS, ETC.)? WHAT TASKS SHOULD THE USERS BE ASSIGNED? REALISTIC TASKS (E.G., E-MAIL) THAT CAN BE GENERALIZED BUT MAY PRODUCE DATA “NOISE”? ARTIFICIAL TASKS THAT WOULD PRODUCE RELIABLE BUT UNREALISTIC RESULTS? WHAT TASKS SHOULD THE USERS BE ASSIGNED? REALISTIC TASKS (E.G., E-MAIL) THAT CAN BE GENERALIZED BUT MAY PRODUCE DATA “NOISE”? ARTIFICIAL TASKS THAT WOULD PRODUCE RELIABLE BUT UNREALISTIC RESULTS? HOW SHOULD THE TIME VARIABLE BE MEASURED? FROM WHEN THE USER IS TOLD WHAT TO DO (“CLICK EDIT”) TO WHEN THE TASK IS COMPLETED? FROM THE TIME THE USER STARTS TO MOVE THE MOUSE UNTIL THE TASK IS FINISHED? HOW SHOULD THE TIME VARIABLE BE MEASURED? FROM WHEN THE USER IS TOLD WHAT TO DO (“CLICK EDIT”) TO WHEN THE TASK IS COMPLETED? FROM THE TIME THE USER STARTS TO MOVE THE MOUSE UNTIL THE TASK IS FINISHED? IN WHAT ORDER SHOULD TASKS AND INTERFACE CONDITIONS BE ASSIGNED? W ILL THE USER EXPERIENCE FASTER REACTION TIMES WITH PRACTICE? WILL THE USER BECOME FATIGUED IF THE CONDITIONS DON’T VARY? IN WHAT ORDER SHOULD TASKS AND INTERFACE CONDITIONS BE ASSIGNED? W ILL THE USER EXPERIENCE FASTER REACTION TIMES WITH PRACTICE? WILL THE USER BECOME FATIGUED IF THE CONDITIONS DON’T VARY? WHAT HARDWARE SHOULD BE USED? SHOULD EVERY USER USE THE SAME COMPUTER? SHOULD THE INTERACTIVE DEVICE (MOUSE, TRACKBALL, TOUCHPAD, JOYSTICK) VARY? WHAT HARDWARE SHOULD BE USED? SHOULD EVERY USER USE THE SAME COMPUTER? SHOULD THE INTERACTIVE DEVICE (MOUSE, TRACKBALL, TOUCHPAD, JOYSTICK) VARY?

9 SAURO/LEWIS116CONFIDENCE INTERVALS CONFIDENCE USUALLY, WHEN WE WANT INFORMATION ABOUT A POPULATION (E.G., ALL AMAZON.COM USERS, ALL SENIOR CITIZENS ON FACEBOOK), THE BEST WE CAN DO IS ESTIMATE, BASED ON A MUCH SMALLER SAMPLE. A CONFIDENCE INTERVAL IS A RANGE OF VALUES WITH A SPECIFIC PROBABILITY OF CONTAINING THE ESTIMATED VALUE WE SEEK. THREE MAIN FACTORS AFFECT THE CONFIDENCE INTERVAL: 1.THE CONFIDENCE LEVEL (I.E., HOW CONFIDENT DO YOU NEED TO BE?) A 90% CONFIDENCE INTERVAL IS SIGNIFICANTLY NARROWER THAN A 95% CONFIDENCE INTERVAL, WHICH NARROWS DOWN THE RANGE OF ESTIMATED VALUES, BUT INCREASES THE CHANCES OF MAKING AN ERROR. 2.THE VARIABILITY (I.E., HOW MUCH DOES THE DATA FLUCTUATE?) ESTIMATED VIA THE SAMPLE’S STANDARD DEVIATION, THE HIGHER THE VARIABILITY IS, THE WIDER THE CONFIDENCE INTERVAL WILL BE. 3.THE SAMPLE SIZE (I.E., HOW MUCH DATA CAN YOU ACCUMULATE?) THE CONFIDENCE INTERVAL SIZE AND THE SAMPLE SIZE HAVE AN INVERSE SQUARE ROOT RELATIONSHIP (E.G., TO CUT THE CONFIDENCE IN INTERVAL IN HALF, YOU’D NEED TO QUADRUPLE THE SAMPLE SIZE).

10 SAURO/LEWIS117CONFIDENCE INTERVALS COMPLETION RATE CONFIDENCE INTERVALS THE STANDARD FORMULA FOR THE CONFIDENCE INTERVAL FOR THE PERCENTAGE OF A POPULATION THAT WILL BE ABLE TO COMPLETE A PARTICULAR TASK IS: WHERE: 0.801.28 0.901.645 0.951.96 0.992.575

11 SAURO/LEWIS118CONFIDENCE INTERVALS COMPLETION RATE EXAMPLE FORTY-EIGHT STUDENTS ARE ASKED TO FIND THE CLASS SCHEDULES PAGE ON THE NEWLY REDESIGNED SIUE WEB SITE, BUT ONLY THIRTY- FOUR ARE ABLE TO DO SO. WHAT WOULD BE THE 95% CONFIDENCE INTERVAL FOR THE PROPORTION OF THE ENTIRE STUDENT POPULATION ABLE TO PERFORM THIS TASK? SO WE CAN BE 95% CONFIDENT THAT BETWEEN 57.9% AND 83.7% OF THE STUDENTS WILL BE ABLE TO FIND THE CLASS SCHEDULES PAGE ON THE NEW SITE.

12 SAURO/LEWIS119CONFIDENCE INTERVALS A SLIGHT ADJUSTMENT RESEARCH HAS SHOWN THAT WHEN THE SAMPLE COMPLETION RATE IS EXTREME (TOO CLOSE TO 0% OR 100%), A MORE ACCURATE FORMULA FOR THE CONFIDENCE INTERVAL IS NEEDED. WHERE: FOR OUR PREVIOUS EXAMPLE, WHERE THE COMPLETION RATE WAS NOT THAT EXTREME (0.708), THE ADJUSTED 95% CONFIDENCE INTERVAL COMPUTES TO BETWEEN 57.8% AND 83.4%, NOT THAT DIFFERENT FROM THE ORIGINAL INTERVAL OF 57.9% TO 83.7%.

13 SAURO/LEWIS120CONFIDENCE INTERVALS CONTINUOUS DATA WHEN SAMPLE SIZES ARE SMALL AND DATA IS CONTINUOUS (E.G., RATINGS VALUES INSTEAD OF COMPLETION BOOLEANS), USING THE NORMAL DISTRIBUTION CAN BE VERY INACCURATE, SO THE t -DISTRIBUTION IS USED TO ACCOUNT FOR HOW WIDELY THE SAMPLE DATA FLUCTUATES. WHERE:

14 SAURO/LEWIS121CONFIDENCE INTERVALS REMEMBER THE t STATISTIC? FIRST, RECALL THAT THE z STATISTIC IS USED TO ANALYZE A SAMPLE WHEN THE POPULATION’S MEAN AND STANDARD DEVIATION ARE KNOWN. WHERE: SO, THE z STATISTIC IS THE NUMBER OF STANDARD ERROR UNITS THAT A SAMPLE’S MEAN IS FROM THE POPULATION’S MEAN, ASSUMING A NORMAL DISTRIBUTION. USING A STANDARD NORMAL DISTRIBUTION TABLE, THE CORRESPONDING p -VALUE CAN BE LOOKED UP, INDICATING THAT THE PROBABILITY IS 1- p THAT A SIZE- n SAMPLE WOULD HAVE A MEAN CLOSER TO  THAN THE SAMPLE IN QUESTION.

15 SAURO/LEWIS122CONFIDENCE INTERVALS POPULATION CRISIS THE t STATISTIC ALLOWS RESEARCHERS TO USE SAMPLE DATA TO TEST HYPOTHESES ABOUT AN UNKNOWN POPULATION MEAN. THE PARTICULAR ADVANTAGE OF THE t STATISTIC IS THAT IT DOES NOT REQUIRE ANY KNOWLEDGE OF THE STANDARD DEVIATION OF THE POPULATION. THUS, THE t STATISTIC CAN BE USED TO TEST HYPOTHESES ABOUT A COMPLETELY UNKNOWN POPULATION, I.E., BOTH μ (THE POPULATION MEAN) AND σ (THE POPULATION STANDARD DEVIATION) ARE UNKNOWN, AND THE ONLY AVAILABLE INFORMATION ABOUT THE POPULATION COMES FROM THE SAMPLE. ALL THAT IS REQUIRED FOR A HYPOTHESIS TEST WITH t IS A SAMPLE AND A REASONABLE HYPOTHESIS ABOUT THE POPULATION MEAN.

16 SAURO/LEWIS123CONFIDENCE INTERVALS THE t STATISTIC LIKE THE z STATISTIC, THE t STATISTIC FORMS A RATIO. THE NUMERATOR CONSISTS OF THE OBTAINED DIFFERENCE BETWEEN THE SAMPLE MEAN AND THE HYPOTHESIZED POPULATION MEAN. THE DENOMINATOR IS THE ESTIMATED STANDARD ERROR (BASED ON THE SAMPLE’S STANDARD DEVIATION, NOT THE POPULATION’S), WHICH MEASURES HOW MUCH DIFFERENCE IS EXPECTED BY CHANCE. NOTE THAT WHEN LOOKING UP THE p -VALUE IN A t DISTRIBUTION TABLE, THE t STATISTIC’S DEPENDENCE ON THE SAMPLE SIZE REQUIRES THAT YOU USE THE DEGREES OF FREEDOM ( n -1) TO REFERENCE THE CORRECT t STATISTIC.

17 SAURO/LEWIS124CONFIDENCE INTERVALS CONFIDENCE INTERVAL FOR RATING SCALES FOR EXAMPLE, ASSUME THAT THE SUS SCORES FOR A PARTICULAR SOFTWARE SYSTEM ARE LISTED BELOW: SO, WE CAN BE 95% CONFIDENT THAT THE POPULATION’S SUS SCORE FOR THIS SYSTEM IS BETWEEN 79.92 AND 89.37.

18 SAURO/LEWIS125CONFIDENCE INTERVALS CONFIDENCE INTERVAL FOR TASK TIMES TASK TIME DATA TENDS TO BE POSITIVELY SKEWED BECAUSE... (A) THERE’S A NATURAL LOWER BOUND FOR HOW LONG IT TAKES TO PERFORM A TASK. (B) SOME USERS WILL TAKE AN EXCEPTIONALLY LONG TIME TO COMPLETE A TASK. UNDER THESE CIRCUMSTANCES, IT IS MORE INFORMATIVE TO USE THE GEOMETRIC MEAN (I.E., THE EXPONENTIAL OF THE ARITHMETIC MEAN OF THE LOGARITHM OF THE DATA) INSTEAD OF THE ARITHMETIC MEAN. 95% CONFIDENCE INTERVAL FOR THE POPULATION MEAN (USING THE ARITHMETIC MEAN OF THE DATA) 95% CONFIDENCE INTERVAL FOR THE POPULATION MEAN (USING THE GEOMETRIC MEAN OF THE DATA) 95% CONFIDENCE INTERVAL FOR THE LOGARITHM OF THE POPULATION MEAN (USING THE ARITHMETIC MEAN OF THE LOGARITHM OF THE DATA)

19 SAURO/LEWIS126BENCHMARKS COMPARING TO BENCHMARKS FREQUENTLY, THE GOAL WHEN TESTING A SOFTWARE INTERFACE IS NOT DETERMINING A RELIABLE CONFIDENCE INTERVAL, BUT TESTING AGAINST A PARTICULAR GOAL OR BENCHMARK. FOR INSTANCE, YOU MIGHT WANT TO DETERMINE THAT A CERTAIN MINIMUM COMPLETION RATE WILL OCCUR, THAT A SPECIFIC MAXIMUM TASK TIME IS NOT EXCEEDED, OR THAT A PARTICULAR SATISFACTION SCORE WAS ACHIEVED.

20 SAURO/LEWIS127BENCHMARKS TWO-TAILED & ONE-TAILED TESTS WHEN BOTH SIDES OF A CONFIDENCE INTERVAL MATTER, A TWO-TAILED TEST IS PERFORMED, WHERE THE  CONFIDENCE INTERVAL IS SYMMETRICAL AND THE PROBABILITIES OF VALUES BEING ABOVE THE UPPER LIMIT AND OF VALUES BEING BELOW THE LOWER LIMIT ARE EACH (1-  )/2. WHEN TESTING AGAINST A BENCHMARK, ONLY ONE SIDE OF THE OUTCOME MATTERS, SO A ONE-TAILED TEST IS USED, WHICH MEANS THAT THE  VALUE MUST BE DOUBLED IN ORDER TO ACHIEVE THE APPROPRIATE CONFIDENCE INTERVAL.

21 SAURO/LEWIS128BENCHMARKS BINOMIAL DISTRIBUTION THE BINOMIAL DISTRIBUTION IS THE DISCRETE PROBABILITY DISTRIBUTION OF THE NUMBER OF SUCCESSES IN A SEQUENCE OF INDEPENDENT YES/NO EXPERIMENTS. IF THE PROBABILITY OF A SUCCESS IS p, THEN THE PROBABILITY OF GETTING k SUCCESSES IN n ATTEMPTS IS:

22 SAURO/LEWIS129BENCHMARKS BENCHMARKED COMPLETION RATES FOR REASONABLY SMALL (LESS THAN 30) SAMPLE SIZES, THE BINOMIAL DISTRIBUTION SHOULD BE USED TO DETERMINE WHETHER A BENCHMARK IS MET. WHERE: FOR EXAMPLE, THE EXCEL CALCULATION BELOW DEMONSTRATES THAT IF b=26 OUT OF n=29 USERS SUCCESSFULLY COMPLETE A CERTAIN TASK (A TEST COMPLETION RATE OF 90%), THEN THE PROBABILITY IS 86%THAT THE POPULATION COMPLETION RATE IS AT LEAST p=80%.

23 SAURO/LEWIS130BENCHMARKS LARGE-SAMPLE BENCHMARKED COMPLETION RATES FOR LARGER SAMPLE SIZES (AT LEAST 15 SUCCESSES AND AT LEAST 15 FAILURES), A NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION SHOULD BE USED TO DETERMINE WHETHER A BENCHMARK IS MET. WHERE: FOR EXAMPLE, IF 139 OF 173 VISITORS TO A WEB SITE COMPLETED A SHIPPING ADDRESS FORM CORRECTLY, THEN THE EXCEL CALCULATION BELOW DEMONSTRATES THAT THERE IS A 95% CHANCE THAT AT LEAST 75% OF ALL USERS WILL BE ABLE TO DO SO.

24 SAURO/LEWIS131BENCHMARKS BENCHMARKED SATISFACTION SCORES TO COMPARE AN INTERFACE’S SATISFACTION SCORE (E.G., FROM A SUS QUESTIONNAIRE) TO A BENCHMARK, THE T-DISTRIBUTION IS UTILIZED. FOR EXAMPLE, RECENT CPR TRAINING APPS HAVE AVERAGED SUS SCORES OF 70.7. A SAMPLE OF 14 USERS TESTED A BETA VERSION OF A NEW CPR TRAINING APPLICATION AND GAVE IT A MEAN SUS SCORE OF 73, WITH A STANDARD DEVIATION OF 11.9. A ONE-TAILED T-TEST WITH 13 DEGREES OF FREEDOM AND A T-VALUE OF 0.723 INDICATES THAT WE CAN BE 76% CONFIDENT THAT THE NEW APP HAS AN AVERAGE GREATER THAN THE INDUSTRY AVERAGE OF 70.7.

25 SAURO/LEWIS132BENCHMARKS BENCHMARKED TASK TIMES TO COMPENSATE FOR THE POSITIVE SKEWNESS OF THE TIME DATA, THE T-TEST FOR TASK TIMES IS PERFORMED WITH LOGARITHMS. SO, FOR EXAMPLE, THERE IS A 56% PROBABILITY THAT THE POPULATION’S MEAN TASK TIME WOULD BE LESS THAN TWO MINUTES.

26 SAURO/LEWIS133COMPARISONS USABILITY COMPARISON TESTS ASSUME THAT TWO EARLY PROTOTYPES OF AN INTERFACE HAVE BEEN DEVELOPED, ONE USING LEFT NAVIGATION AND THE OTHER USING TOP NAVIGATION. IF INDIVIDUALS IN ONE SAMPLE POPULATION EXPERIENCE NOTICEABLY FEWER NAVIGATION PROBLEMS THAN INDIVIDUALS IN THE OTHER SAMPLE POPULATION, THEN WE WOULD HAVE EVIDENCE THAT ONE APPROACH IS MORE EFFECTIVE THAN THE OTHER. HOWEVER, IT IS ALSO POSSIBLE THAT THE DIFFERENCE BETWEEN THE TWO SAMPLE POPULATIONS IS SIMPLY SAMPLING ERROR.

27 SAURO/LEWIS134COMPARISONS WITHIN-SUBJECTS TEST HYPOTHESIS: THE CALENDAR BUTTON ON THE LEFT NAVIGATION INTERFACE IS FASTER TO ACCESS THAN IT IS ON THE TOP NAVIGATION INTERFACE. DESIGN: WITHIN-SUBJECTS, WITH RANDOMIZED ORDER OF ASSIGNMENT OF INTERFACE TO SUBJECTS BASED ON THE TABULATED DATA, THE TOP INTERFACE SEEMS TO BE FASTER (508 MS ON AVERAGE) THAN THE LEFT INTERFACE (584 MS), BUT GIVEN THE NOISE IN THE MEASUREMENTS (I.E., SOME OF THE LEFT INTERFACE TRIALS ARE ACTUALLY SLOWER THAN SOME OF THE TOP INTERFACE TRIALS), HOW DO WE KNOW WHETHER THE LEFT INTERFACE IS REALLY FASTER? LEFT INTERFACE TOP INTERFACE 625 MS 647 MS 480 MS 503 MS 621 MS 559 MS 633 MS 586 MS 694 MS 458 MS 599 MS 380 MS 505 MS 477 MS 527 MS 409 MS 651 MS 589 MS 505 MS 472 MS THIS IS THE FUNDAMENTAL QUESTION UNDERLYING STATISTICAL ANALYSIS: ESTIMATING THE AMOUNT OF EVIDENCE IN SUPPORT OF A HYPOTHESIS, EVEN IN THE PRESENCE OF NOISE.

28 SAURO/LEWIS135COMPARISONS WITHIN-SUBJECTS TEST ANALYSIS THE P VALUE FOR THE TWO-TAILED T - TEST IS 0.025, WHICH MEANS THAT THE OBSERVED DIFFERENCE BETWEEN THE LEFT AND TOP INTERFACES IS ONLY 2.5% LIKELY TO HAPPEN PURELY BY CHANCE, LEADING TO THE CONCLUSION THAT THE DIFFERENCE BETWEEN THE INTERFACES IS STATISTICALLY SIGNIFICANT.

29 SAURO/LEWIS136COMPARISONS BETWEEN-SUBJECTS TEST AN INDEPENDENT-MEASURES OR BETWEEN-SUBJECTS EXPERIMENT DESIGN ALLOWS RESEARCHERS TO EVALUATE THE MEAN DIFFERENCE BETWEEN TWO POPULATIONS USING DATA FROM TWO SEPARATE SAMPLES. AS WITH ALL HYPOTHESIS TESTS, THE GENERAL PURPOSE OF THE INDEPENDENT- MEASURES T -TEST IS TO DETERMINE WHETHER THE SAMPLE MEAN DIFFERENCE OBTAINED IN A RESEARCH STUDY INDICATES A REAL MEAN DIFFERENCE BETWEEN THE TWO POPULATIONS OR WHETHER THE OBTAINED DIFFERENCE IS SIMPLY THE RESULT OF SAMPLING ERROR.

30 SAURO/LEWIS137COMPARISONS BETWEEN-SUBJECTS TEST ANALYSIS IF THE SAME DATA HAD BEEN ACCUMULATED FOR A BETWEEN-SUBJECTS EXPERIMENT, THEN THE P VALUE FOR THE TWO-TAILED T -TEST IS 0.047, WHICH MEANS THAT THE OBSERVED DIFFERENCE BETWEEN THE LEFT INTERFACE AND TOP INTERFACE IS ONLY 4.7% LIKELY TO HAPPEN PURELY BY CHANCE.

31 SAURO/LEWIS138COMPARISONS WEB-SCALE USABILITY RESEARCH THE WEB ENABLES EXPERIMENTS ON A LARGER SCALE, FOR LESS TIME AND MONEY, THAN EVER BEFORE. WEB SITES WITH MILLIONS OF VISITORS (E.G., GOOGLE, AMAZON, FACEBOOK) ARE CAPABLE OF ANSWERING QUESTIONS ABOUT THE DESIGN, USABILITY, AND OVERALL VALUE OF NEW FEATURES SIMPLY BY DEPLOYING THEM AND WATCHING WHAT HAPPENS. CONSIDER THESE TWO VERSIONS OF A WEB PAGE, FOR A SITE THAT SELLS CUSTOMIZED REPORTS ABOUT SEX OFFENDERS LIVING IN YOUR AREA. THE GOAL OF THE PAGE IS TO GET VISITORS TO FILL OUT THE YELLOW FORM AND BUY THE REPORT. BOTH VERSIONS CONTAIN THE SAME INFO; THEY JUST PRESENT IT IN DIFFERENT WAYS. IN FACT, THE VERSION ON THE RIGHT IS A REVISED DESIGN, WHICH WAS INTENDED TO IMPROVE THE DESIGN BY USING TWO FAT COLUMNS, SO THAT MORE CONTENT COULD BE BROUGHT “ABOVE THE FOLD” AND THE USER WOULDN’T HAVE TO DO AS MUCH SCROLLING. WHICH DESIGN IS MORE EFFECTIVE FOR THE END GOAL OF THE WEB SITE – CONVERTING VISITORS INTO SALES?

32 SAURO/LEWIS139COMPARISONS A/B TESTING TO DETERMINE WHICH DESIGN WAS MORE EFFECTIVE, THE DESIGNERS CONDUCTED AN EXPERIMENT: HALF OF THE USERS TO THEIR WEB SITE WERE RANDOMLY ASSIGNED TO SEE ONE VERSION OF THE PAGE, AND THE OTHER HALF SAW THE OTHER VERSION. THE USERS WERE THEN TRACKED TO SEE HOW MANY OF EACH ACTUALLY FILLED OUT THE FORM TO BUY THE REPORT. IN THIS CASE, THE REVISED DESIGN ACTUALLY FAILED – 244 USERS BOUGHT THE REPORT FROM THE ORIGINAL VERSION, BUT ONLY 114 USERS BOUGHT THE REPORT FROM THE REVISED VERSION. THE IMPORTANT POINT HERE IS NOT WHICH ASPECTS OF THE DESIGN CAUSED THE FAILURE (WHICH IS UNKNOWN, SINCE SEVERAL THINGS CHANGED IN THE REDESIGN); THE POINT IS THAT THE WEB SITE CONDUCTED A RANDOMIZED EXPERIMENT AND COLLECTED DATA THAT ACTUALLY TESTED THE REVISION. THIS KIND OF EXPERIMENT IS OFTEN CALLED AN A/B TEST.

33 SAURO/LEWIS140COMPARISONS ANOTHER A/B TESTING EXAMPLE IN THIS EXAMPLE, A SHOPPING CART FOR A WEB SITE, A NUMBER OF CHANGES HAVE BEEN MADE BETWEEN THE ORIGINAL VERSION (LEFT) AND THE REVISED VERSION (RIGHT). TESTING THIS REDESIGN WITH AN A/B TEST PRODUCED A STARTLING DIFFERENCE IN REVENUE: USERS WHO SAW THE CART ON THE LEFT SPENT TEN TIMES AS MUCH AS USERS WHO SAW THE CART ON THE RIGHT! THE DESIGNERS OF THIS SITE EXPLORED FURTHER AND DISCOVERED THAT THE PROBLEM WAS THE “COUPON CODE” BOX ON THE RIGHT, WHICH LED USERS TO WONDER WHETHER THEY WERE PAYING TOO MUCH IF THEY DIDN’T HAVE A COUPON, AND ABANDON THE CART. WITHOUT THE COUPON CODE BOX, THE REVISED VERSION ACTUALLY EARNED MORE REVENUE THAN THE ORIGINAL VERSION.

34 SAURO/LEWIS141COMPARISONS MICROSOFT HELP A/B TESTING EXAMPLE AT THE END OF EVERY PAGE IN MICROSOFT’S ONLINE HELP IS A QUESTION ASKING FOR FEEDBACK ABOUT THE HELP ARTICLE; IF THE USER PRESSES ANY OF THE BUTTONS, IT DISPLAYS A TEXTBOX ASKING FOR MORE DETAILS.

35 SAURO/LEWIS142COMPARISONS REVISING MICROSOFT HELP THE PROPOSED REVISION TO THIS INTERFACE AT LEFT WAS MOTIVATED BY TWO ARGUMENTS: (1)IT GIVES MORE FINE-GRAINED QUANTITATIVE FEEDBACK THAN THE YES/NO QUESTION; AND (2)IT IS MORE EFFICIENT FOR THE USER, BECAUSE IT TAKES ONLY ONE CLICK RATHER THAN THE MINIMUM TWO CLICKS OF THE ORIGINAL INTERFACE. WHEN THESE TWO INTERFACES WERE A/B TESTED ON MICROSOFT’S SITE, HOWEVER, IT TURNED OUT THAT THE 5-STAR INTERFACE PRODUCED AN ORDER OF MAGNITUDE FEWER RATINGS – AND MOST OF THEM WERE EITHER 1 STAR OR 5 STARS, SO THEY WEREN’T EVEN FINE-GRAINED.

36 SAURO/LEWIS143COMPARISONS WEB-BASED A/B TESTING IN THE CONTEXT OF USABILITY STUDIES, A/B TESTING IS SIMILAR TO CONTROLLED EXPERIMENTS. CHOOSE AN INDEPENDENT VARIABLE WITH TWO CONDITIONS. CHOOSE AN INDEPENDENT VARIABLE WITH TWO CONDITIONS. (MORE CONDITIONS ARE OKAY, E.G., A/B/C TESTING) CHOOSE DEPENDENT VARIABLE(S) TO MEASURE. CHOOSE DEPENDENT VARIABLE(S) TO MEASURE. (E.G., TIME, ERRORS, SUCCESS RATE, REVENUE) DURING A TESTING INTERVAL, RANDOMLY ASSIGN ARRIVING USERS TO ONE CONDITION OR THE OTHER. DURING A TESTING INTERVAL, RANDOMLY ASSIGN ARRIVING USERS TO ONE CONDITION OR THE OTHER. (THE WEB SITE ITSELF DOES THIS!) DO STATISTICAL TESTING. DO STATISTICAL TESTING. A/B TESTING OCCURS WITH REAL USERS ON A DEPLOYED SYSTEM, SO BUGS CAN HAVE REAL CONSEQUENCES. RATHER THAN STARTING WITH A 50/50 SPLIT BETWEEN TEST CONDITIONS, IT’S SAFER TO RAMP UP SLOWLY BY STARTING WITH 99.9/0.1, MOVING TO 99/1, ETC.

37 SAURO/LEWIS144COMPARISONS A/A TESTING TO TEST THE INFRASTRUCTURE OF AN EXPERIMENT, A/A TESTING DIVIDES USERS INTO TWO GROUPS WITH THE SAME CONDITION FOR EACH GROUP (I.E., A/B TESTING WITH A SINGLE CONDITION FOR BOTH GROUPS). A/A TESTS ILLUSTRATE HOW DATA FLUCTUATE, WITH EXPERIMENTAL RESULTS THAT MIGHT SEEM SUBSTANTIAL, BUT WHICH ARE NOT STATISTICALLY SIGNIFICANT (IF THE USERS ARE SPLIT CORRECTLY AND THERE ARE NO POTENTIALLY MISLEADING BIASES IN THE EXPERIMENT).

38 SAURO/LEWIS145COMPARISONS ISSUES WITH A/B TESTING THE WEB-SCALE NATURE OF A/B TESTING LEADS TO SEVERAL POTENTIAL ISSUES THAT ARE NOT COMMONLY ENCOUNTERED IN SMALLER-SCALE LAB EXPERIMENTS. REMOTE USABILITY TESTING, WHERE THE USER’S BEHAVIOR IS ACTUALLY MONITORED, IS STILL IN THE EARLY STAGES. REMOTE SYNCHRONOUS TESTING, USING WEBCAMS, HAS BEEN SHOWN TO BE JUST AS EFFECTIVE AS FACE-TO- FACE TESTING. REMOTE SYNCHRONOUS TESTING, USING WEBCAMS, HAS BEEN SHOWN TO BE JUST AS EFFECTIVE AS FACE-TO- FACE TESTING. REMOTE ASYNCHRONOUS TESTING, WHERE USERS REPORT CRITICAL USABILITY PROBLEMS THEMSELVES, TENDS TO SLOW THE USERS DOWN TREMENDOUSLY AND RESULT IN FEWER REPORTED ERRORS. REMOTE ASYNCHRONOUS TESTING, WHERE USERS REPORT CRITICAL USABILITY PROBLEMS THEMSELVES, TENDS TO SLOW THE USERS DOWN TREMENDOUSLY AND RESULT IN FEWER REPORTED ERRORS. AN ALTERNATIVE REMOTE ASYNCHRONOUS TESTING APPROACH, WITH INSTRUMENTATION INSTALLED ON THE WEB SITE TO TRACK EACH USER’S ACTIONS, SHOWS THE DETAILS OF THE INTERACTION, BUT REVEALS LITTLE ABOUT THE USER’S GOALS OR INTENTIONS. AN ALTERNATIVE REMOTE ASYNCHRONOUS TESTING APPROACH, WITH INSTRUMENTATION INSTALLED ON THE WEB SITE TO TRACK EACH USER’S ACTIONS, SHOWS THE DETAILS OF THE INTERACTION, BUT REVEALS LITTLE ABOUT THE USER’S GOALS OR INTENTIONS.

39 SAURO/LEWIS146SAMPLE SIZES DETERMINING SAMPLE SIZE WHEN CONDUCTING A USABILITY TEST, HOW LARGE SHOULD YOU MAKE THE SAMPLE SIZE? UNLIKE THE z -VALUE, HOWEVER, WHICH USES A NORMAL DISTRIBUTION, ESTIMATING THE t -VALUE COMPLICATES MATTERS BY ALSO BEING DEPENDENT ON THE DEGREES OF FREEDOM (FOR A ONE-SAMPLE t -TEST, df = n -1). TO OVERCOME THIS PROBLEM, AN ITERATIVE PROCEDURE IS SUGGESTED…

40 SAURO/LEWIS147SAMPLE SIZES DETERMINING SAMPLE SIZE: ITERATIVE PROCEDURE 1.USE THE Z-SCORE WITH THE DESIRED LEVEL OF CONFIDENCE (FROM A UNIT NORMAL TABLE) AS AN INITIAL ESTIMATE OF THE T-VALUE. 2.SOLVE THE ABOVE EQUATION FOR N. 3.USE A T-DISTRIBUTION TABLE TO FIND THE T-SCORE FOR THAT VALUE OF N (WITH DF = N-1). 4.RECALCULATE N BY USING THIS NEW T-VALUE IN THE EQUATION ABOVE. 5.REVISE THE T-SCORE FROM THE T-DISTRIBUTION TABLE. 6.CONTINUE THIS ITERATION UNTIL TWO CONSECUTIVE CYCLES YIELD THE SAME N VALUE.

41 SAURO/LEWIS148SAMPLE SIZES SAMPLE SIZE EXAMPLE ASSUME THAT YOU HAVE BEEN USING A 100-POINT ITEM AS A POST-TASK MEASURE OF EASE-OF-USE IN PAST USABILITY TESTS. ONE OF THE TASKS THAT YOU ROUTINELY CONDUCT IS SOFTWARE INSTALLATION. FOR THE MOST RECENT USABILITY STUDY OF THE CURRENT VERSION OF THE SOFTWARE PACKAGE, THE VARIABILITY OF THIS MEASUREMENT ON THE 100-POINT SCALE IS 25 (I.E., s =5). YOU’RE PLANNING YOUR FIRST USABILITY STUDY WITH A NEW VERSION OF THE SOFTWARE, AND YOU WANT TO GET AN ESTIMATE OF THIS MEASURE WITH 90% CONFIDENCE AND TO BE WITHIN  2.5 POINTS OF THE TRUE VALUE. LET’S CALCULATE HOW MANY PARTICIPANTS YOU NEED TO RUN IN THE STUDY.

42 SAURO/LEWIS149SAMPLE SIZES SAMPLE SIZE EXAMPLE (CONTINUED) FOR TWO-SIDED TESTING WITH A 90% CONFIDENCE INTERVAL (I.E., 5% IN EACH TAIL), A UNIT NORMAL TABLE INDICATES THAT A z -VALUE OF 1.645 WOULD MAKE A GOOD FIRST ESTIMATE FOR THE t -VALUE. USING THE ABOVE FORMULA, THIS YIELDS AN n -VALUE OF 10.8241, WHICH ROUNDS UP TO 11. SWITCHING TO A t -DISTRIBUTION TABLE, n = 11 (I.E., df = 10) GIVES US A t -VALUE OF 1.812 FOR A 2-TAILED 90% CONFIDENCE INTERVAL, WHICH PRODUCES AN n -VALUE OF 13.133376 IN THE FORMULA, ROUNDING UP TO 14. USING n = 14 ( df = 13) YIELDS A t -VALUE OF 1.771, YIELDING AN n -VALUE OF 12.545764, ROUNDING UP TO 13. USING n = 13 ( df = 12) YIELDS A t -VALUE OF 1.782, YIELDING AN n -VALUE OF 12.702096, AGAIN ROUNDING UP TO 13. THEREFORE, THE FINAL SAMPLE ESTIMATE SIZE FOR THIS STUDY IS 13 PARTICIPANTS.

43 SAURO/LEWIS150SAMPLE SIZES WEAK ARGUMENTS FOR LARGE SAMPLES “IF THE POPULATION IS LARGE, THEN THE SAMPLE NEEDS TO BE LARGE.” THE VARIANCE IN STATISTICAL SAMPLING IS DETERMINED BY THE SAMPLE SIZE, NOT THE SIZE OF THE OVERALL POPULATION. THE EVALUATION OF A DESIGN ELEMENT’S QUALITY IS INDEPENDENT OF HOW MANY PEOPLE ARE GOING TO USE IT. “THE MORE FEATURES IN THE INTERFACE, THE LARGER THE SAMPLE SIZE.” WHEN THE INTERFACE IS LOADED WITH FEATURES, MORE TESTS ARE NEEDED, NOT MORE USERS IN EACH TEST. TEST SUBJECTS WILL BE OVERWHELMED IF ASKED TO EVALUATE TOO MANY FEATURES. “THE INTERFACE IS BEING DESIGNED TO ACCOMMODATE MANY TARGET AUDIENCES.” THIS ONLY REQUIRES LARGER SAMPLE SIZES IF THE DIFFERENT TARGET AUDIENCES WILL USE THE INTERFACE IN VERY DIFFERENT WAYS (E.G., BUYERS VS. SELLERS, TEACHERS VS. STUDENTS, DOCTORS VS. PATIENTS).

44 SAURO/LEWIS151USABILITY QUESTIONNAIRES USING STANDARDIZED QUESTIONNAIRES FOR USABILITY STUDIES OFFERS SEVERAL ADVANTAGES. OBJECTIVITY USABILITY PRACTITIONERS ARE ABLE TO INDEPENDENTLY VERIFY THE MEASUREMENT STATEMENTS OF OTHERS. REPLICABILITY STUDIES CAN EASILY BE REPLICATED, IMPROVING THEIR RELIABILITY. QUANTIFICATION RESULTS CAN BE REPORTED IN FINER DETAIL AND MORE OBJECTIVITY. ECONOMY DEVELOPING STANDARDIZED MEASURES TAKES WORK, BUT REUSING THEM IS INEXPENSIVE. COMMUNICATION STANDARDIZED MEASURES FACILITATE COMMUNICATION BETWEEN PRACTITIONERS.

45 SAURO/LEWIS152USABILITY QUESTIONNAIRES POST-STUDY USABILITY QUESTIONNAIRES THE PSSUQ IS A 16-ITEM SURVEY THAT MEASURES USERS’ PERCEIVED SATISFACTION WITH A PRODUCT OR SYSTEM. The Post-Study System Usability Questionnaire (Version 3) Strongly Agree Strongly Disagree 1234567NA 1.Overall, I am satisfied with how easy it is to use this system.  2.It was simple to use this system.  3.I was able to complete the tasks and scenarios quickly using this system.  4.I felt comfortable using this system.  5.It was easy to learn to use this system.  6.I believe I could become productive quickly using this system.  7.The system gave error messages that clearly told me how to fix problems.  8.Whenever I made a mistake using the system, I could recover easily and quickly.  9.The information (such as on-line help, on-screen messages, and other documentation) provided with this system was clear.  10.It was easy to find the information I needed.  11.The information was effective in helping me complete the tasks and scenarios.  12.The organization of information on the system screens was clear.  13.The interface of this system was pleasant.  14.I liked using the interface of this system.  15.This system has all the functions and capabilities I expect it to have.  16.Overall, I am satisfied with this system.  AN OVERALL SATISFACTION SCORE IS OBTAINED BY AVERAGING THE SUB-SCALES OF SYSTEM QUALITY (ITEMS 1-6), INFORMATION QUALITY (ITEMS 7-12), AND INTERFACE QUALITY (ITEMS 13-16). THE PSSUQ IS SUSCEPTIBLE TO “ACQUIESCE BIAS”, THE FACT THAT PEOPLE ARE MORE LIKELY TO AGREE WITH A STATEMENT THAN TO DISAGREE WITH IT.

46 SAURO/LEWIS153USABILITY QUESTIONNAIRES INTERPRETING QUESTIONNAIRE RESULTS PSYCHOMETRIC ANALYSIS OF USABILITY QUESTIONNAIRES IS CONDUCTED TO DETERMINE THEIR RELIABILITY, VALIDITY, AND SENSITIVITY. PSSUQ-3 Norms (Means and 99% Confidence Intervals) Lower Limit Mean Upper Limit 1.Overall, I am satisfied with how easy it is to use this system. 2.602.853.09 2.It was simple to use this system. 2.452.692.93 3.I was able to complete the tasks and scenarios quickly using this system. 2.863.163.45 4.I felt comfortable using this system. 2.402.662.91 5.It was easy to learn to use this system. 2.072.272.48 6.I believe I could become productive quickly using this system. 2.542.863.17 7.The system gave error messages that clearly told me how to fix problems. 3.363.704.05 8.Whenever I made a mistake using the system, I could recover easily and quickly. 2.933.213.49 9.The information (such as on-line help, on-screen messages, and other documentation) provided with this system was clear. 2.652.963.27 10.It was easy to find the information I needed. 2.793.093.38 11.The information was effective in helping me complete the tasks and scenarios. 2.462.743.01 12.The organization of information on the system screens was clear. 2.412.662.92 13.The interface of this system was pleasant. 2.062.282.49 14.I liked using the interface of this system. 2.182.422.66 15.This system has all the functions and capabilities I expect it to have. 2.512.793.07 16.Overall, I am satisfied with this system. 2.552.823.09 FOR EXAMPLE, THE PSSUQ-3 NORMS AT LEFT SHOW THAT MOST ITEMS HAVE MEANS THAT FALL BELOW THE SCALE MIDPOINT OF 4, INDICATING THAT THE SCALE MIDPOINT SHOULD NOT BE USED EXCLUSIVELY AS A REFERENCE FROM WHICH TO JUDGE PARTICIPANTS’ PERCEPTIONS ON USABILITY. ALSO NOTE THE RELATIVELY POOR RATINGS ASSOCIATED WITH ITEM 7, WHICH REFLECT THE DIFFICULTY OF PROVIDING USABLE ERROR MESSAGES IN A SOFTWARE PRODUCT, AS WELL AS THE OVERALL DISSATISFACTION THAT SUCH ERRORS CAUSE IN USERS.

47 SAURO/LEWIS154USABILITY QUESTIONNAIRES POST-TASK USABILITY QUESTIONNAIRES WHILE POST-STUDY SURVEYS PROVIDE INFORMATION REGARDING THE GENERAL SATISFACTION OF USERS WITH AN INTERFACE, BRIEF MINI-SURVEYS OF USER REACTION TO SPECIFIC TASKS IN SPECIFIC SCENARIOS ARE OFTEN MORE USEFUL WHEN ATTEMPTING TO DIAGNOSE MORE FOCUSED PROBLEMS. The After-Scenario Questionnaire (Version 1) Strongly Agree Strongly Disagree 1234567NA 1.Overall, I am satisfied with the ease of completing the tasks in this scenario.  2.Overall, I am satisfied with the amount of time it took to complete the tasks in this scenario.  3.Overall, I am satisfied with the support information (online help, messages, documentation) when completing the tasks.  EXAMPLE SCENARIOS AND TASKS FOR OFFICE SOFTWARE SYSTEMS: MAIL SCENARIO #1 OPEN A NOTE OPEN A NOTE SEND REPLY SEND REPLY DELETE NOTE DELETE NOTE MAIL SCENARIO #2 OPEN A NOTE OPEN A NOTE FORWARD W/REPLY FORWARD W/REPLY SAVE RESPONSE SAVE RESPONSE DELETE ORIGINAL DELETE ORIGINAL ADDRESS SCENARIO CREATE NEW LISTING CREATE NEW LISTING MODIFY OLD LISTING MODIFY OLD LISTING DELETE UNMODIFIED LISTING DELETE UNMODIFIED LISTING FILE SCENARIO RENAME FILE RENAME FILE COPY FILE COPY FILE DELETE FILE DELETE FILE EDITOR SCENARIO LOCATE DOCUMENT LOCATE DOCUMENT EDIT DOCUMENT EDIT DOCUMENT OPEN NOTE OPEN NOTE COPY NOTE’S TEXT INTO DOCUMENT COPY NOTE’S TEXT INTO DOCUMENT SAVE DOCUMENT SAVE DOCUMENT PRINT DOCUMENT PRINT DOCUMENT

48 SAURO/LEWIS155USABILITY QUESTIONNAIRES TRIANGULATION ANY GIVEN RESEARCH METHOD HAS ADVANTAGES AND LIMITATIONS. LAB EXPERIMENTS ARE ABSTRACT AND OBTRUSIVE, AND MAY NOT BE REPRESENTATIVE OF THE REAL WORLD. LAB EXPERIMENTS ARE ABSTRACT AND OBTRUSIVE, AND MAY NOT BE REPRESENTATIVE OF THE REAL WORLD. FIELD STUDIES CANNOT BE CONTROLLED, SO IT’S HARD TO MAKE STRONG, PRECISE CLAIMS REGARDING COMPARATIVE USABILITY. FIELD STUDIES CANNOT BE CONTROLLED, SO IT’S HARD TO MAKE STRONG, PRECISE CLAIMS REGARDING COMPARATIVE USABILITY. SELF-REPORTING (VIA QUESTIONNAIRES) IS OFTEN BIASED BY REACTIVITY (E.G., THE SUBJECTS TRY TO BE POLITE OR TO SAY WHAT THEY THINK THEY SHOULD SAY, INSTEAD OF THE TRUTH). SELF-REPORTING (VIA QUESTIONNAIRES) IS OFTEN BIASED BY REACTIVITY (E.G., THE SUBJECTS TRY TO BE POLITE OR TO SAY WHAT THEY THINK THEY SHOULD SAY, INSTEAD OF THE TRUTH). ONE WAY TO DEAL WITH THIS PROBLEM IS VIA TRIANGULATION, USING MULTIPLE METHODS TO TACKLE THE SAME RESEARCH QUESTION. IF THEY ALL SUPPORT YOUR CLAIM, THEN YOU HAVE MUCH STRONGER EVIDENCE, WITHOUT AS MANY BIASES.


Download ppt "SAURO/LEWIS109USABILITY TESTING RESEARCH METHODS IN HCI HCI RESEARCHERS EMPLOY EMPIRICAL METHODS, TECHNIQUES FOR INVESTIGATING THE WORLD AND COLLECTING."

Similar presentations


Ads by Google