Quantitative Evaluation John Kelleher, IT Sligo
Definition Methods Without measurement, success is undefined Performance/Predictive Modeling GOMS/KLM Fitts’ Law Controlled Experiments & Statistical Analysis Without measurement, success is undefined Formal Usability Study to compare two designs on measurable aspects time required number of errors effectiveness for achieving very specific tasks
GOMS Model Card, Moran & Newell (1983) Model the knowledge and cognitive processes involved when users interact with systems. Goals refer to particular state the user wants to achieve Operators refer to the cognitive processes and physical actions that need to be performed in order to attain those goals Methods are learned procedures for accomplishing the goals, consisting of exact sequence of steps required Selection Rules Are used to determine which method to select when there is more than one available for a given stage of a task.
GOMS: Example of deleting word in MS Word Goal: delete a word in a sentence Method for accomplishing goal of deleting a word using menu option: Step 1: Recall that word to be deleted has to be highlighted Step 2: Recall that command is ‘cut’ Step 3: Recall that command ‘cut’ is in edit menu Step 4: Accomplish goal of selecting and executing the ‘cut’ command Step 5: Return with goal accomplished
GOMS: Example of deleting word in MS Word Method for accomplishing goal of deleting a word using delete key: Step 1: Recall where to position cursor in relation to word to be deleted Step 2: Recall which key is delete key Step 3: Press ‘delete’ key to delete each letter Step 4: Return with goal accomplished
GOMS: Example of deleting word in MS Word Operators to use in above methods: Click mouse Drag cursor over text Select menu Move cursor to command Press keyboard key Selection Rules to decide which method to use: 1: Delete text using mouse and selecting from menu if large amount of text is to be deleted 2: Delete text using delete key if small number of letters is to be deleted
Keystroke Level Model Well-known analytic evaluation technique Derived from MHP1 Provides detailed quantitative (numerical) information of user performance Sufficient for predicting speed of interaction with a user interface Basic time prediction components empirically derived 1 Model Human Processor by Card, Moran, Newell (1983)
KLM Constants Operator Name Description Time (Sec) K Pressing a single key or button Skilled typist (55 wpm) Average typist (40 wpm) User unfamiliar with the keyboard Pressing shift or control key 0.35 (average) 0.22 0.28 1.20 0.08 P Point with a mouse or other device to a target on a display Clicking the mouse or similar device 1.10 0.20 H Homing hands on the keyboard or other device 0.40 D Draw a line using a mouse Variable depending on the length of line M Mentally prepare to do something (e.g. make a decision) 1.35 R(t) System response time – counted only if it causes the user to wait when carrying out their task t
Task in Text Editor Using GOMS Create new file Type in “Hello, World.” Save document as “Hello” Print document Exit editor Assume system response is 0, or comparable across systems (constant) Average typist (55wpm) (K = 0.2) Editor is started, hands in lap
All Mouse
KLM Applicability Caveats User interface w/ limited number of features Repetitive task execution Really only useful for comparative study among alternatives albeit sensitive to minor changes Project Ernestine Caveats assumes expert behaviour – no errors tolerated user already knows the sequence of operations that he or she is going to perform time estimates best followed-up by empirical studies ambiguity regarding M operator assumes serial processing
Fitts’ Law Predicts time taken to reach a target using a pointing device T = k log2(D/S + 0.5), k ~ 100 msec. where T = time to move the hand to a target D = distance between hand and target S = size of target Highlights corners of screen as good targets
Performance measures Time: easy to measure and suitable for statistical analysis. E.g. learning time, task completion time. Errors: shows where problem exist within a system. Suggests the cause of a difficulty. Patterns of system use: study the patterns of use in different sections. Preference and avoidance of sections in a system. Amount of work done in a given time.
Other measures Subjective impression measures Composite measures Attitude measures: Use questionnaires or interviews Rated aesthetics Rated ease of learning Stated decision to purchase Composite measures Weighted averages of the above E.g. efficiency = throughput / number of errors
Controlled experiments Designed to test predictions arising from an explicit hypothesis that arises out of an underlying theory Allows comparison of systems, fine-tuning of details ... Strives for lucid and testable hypothesis quantitative measurement measure of confidence in results obtained (statistics) replicability of experiment control of variables and conditions removal of experimenter bias
Ben Shneiderman (Univ. Maryland US) Experiments have: Two Parents: ‘a practical problem’ ‘a theoretical foundation’ Three Children: ‘Help in resolving the practical problems’ ‘refinements to the theory’ ‘advice to future experimenters who work on the same problem’
Designing Experiments Formulating the hypotheses Developing predictions from the hypotheses Choosing a means to test the predictions Identifying all the variables that might affect the results of the experiment Deciding which are the independent variables, dependent variables and which variables need to be controlled by some means
Usability Laboratory
Usability Laboratory
Designing Experiments (contd.) Designing the experimental task and method Subject selection Deciding the experimental design, data collection method and controlling confounding variables Deciding on the appropriate statistical or other analysis Carrying out a pilot study
The Experimental Method a) Begin with a lucid, testable hypothesis Example 1: “ there is no difference in the number of cavities in children and teenagers using crest and no-teeth toothpaste”
The Experimental Method Example 2: “ there is no difference in user performance (time and error rate) when selecting a single item from a pop-up or a pull down menu, regardless of the subject’s previous expertise in using a mouse or using the different menu types”
The Experimental Method b) Explicitly state the independent variables that are to be altered independent variable the things you manipulate independent of how a subject behaves determines a modification to the conditions the subjects undergo may arise from subjects being classified into different groups In toothpaste experiment toothpaste type: uses Crest or No-teeth toothpaste age: <= 11 years or > 11 years In menu experiment menu type: pop-up or pull-down menu length: 3, 6, 9, 12, 15 subject type (expert or novice)
The Experimental Method c) Carefully choose the dependent variables that will be measured Dependent variables Measures to demonstrate the effects of the independent variables Properties Readily observable Stable and reliable so that they do not vary under constant experimental conditions Sensitive to the effects of the independent variables Readily related to some scale of measurement
Dependent variables Some commonly used dependent variables Number of errors made Time taken to complete a given task Time taken to recover from an error In menu experiment time to select an item selection errors made In toothpaste experiment number of cavities frequency of brushing
What is an experiment? Three criteria The experimenter must systematically manipulate one or more independent variables in the domain under investigation The manipulation must be made under controlled conditions, such that all variables which could affect the outcome of the experiment are controlled see confounding variables, next. The experimenter must measure some un-manipulated feature that changes, or is assumed to change, as a function of the manipulated independent variable
Confounding variables Variables that are not independent variables but are permitted to vary along in the experiment “The logic of experiments is to hold variables-not-of-interest constant among conditions, systematically manipulate independent variables, and observe the effects of the manipulation on the dependent variables.”
Sources of variation Variations in the task performed The effect of the treatment (i.e. the user interface improvements that we made) Individual differences between experimental subjects (e.g. IQ) Different stimuli for each task Distractions during the trial (sneezing, dropping things) Motivation of the subject Accidental hints or intervention by the experimenter Other random factors.
Examples of Confounding Order effects Tasks done early in testing are slower and more prone to error. Tasks done late in testing may be affected by user fatigue. Carry-over effects A difference occurs if one condition follows another. E.g. Learning text editor commands. Experience factors People in one condition have more/less relevant experience than in others. Experimenter/subject bias The experimenter systematically treats some subjects different from others, or when subjects have different motivation levels. Other uncontrolled variables Time of day, system load.
Confounding Prevention Randomization Negates the order effect. Random assignment to conditions is used to ensure that any effect due to unknown differences among users or conditions is random. Counterbalancing Order and carry-over effect. Test half of the users in condition 1 first, and the other half in condition II first. Different permutations of condition order can be used.
Allocation of participants Judiciously select and assign subjects to groups to control variability a) Between-Groups Experiment Two groups of test users, same tasks for both groups. Randomly assign users to two equally-sized groups. Group A uses only system A, group B only system B. b) Within-Groups Experiment One group of test users Each user performs equivalent tasks on both systems. Randomly assign users to two equally-sized pools. Pool A uses system A first, pool B system B first. c) Matched-pairs
Example Designs Between Groups System A System B John Dave James May Mary Ann Stuart Phil Within Groups Participant Sequence Elizabeth A,B Michael B,A Steven Richard Requires more participants No transfer of learning effects Less arduous on participants large individual variation in user skills Is more powerful statistically (can compare the same person across different conditions, thus isolating effects of individual differences) Requires fewer participants than between-groups Learning effects Fatigue effects
Experimental Details Order of tasks choose one simple order (simple -> complex) unless doing within groups experiment Training depends on how real system will be used What if someone doesn’t finish assign very large time & large # of errors Pilot study helps you fix problems with the study do 2, first with colleagues, then with real users
Sample Size Depends on desired confidence level and confidence interval. Confidence level of 95% often used for research, 80% ok for practical development. Rule of thumb: 16-20 test users.
Analysing the numbers Example: trying to get task time <=30 min. test gives: 20, 15, 35, 80, 10, 20 mean (average) = 30 looks good! wrong answer, not certain of anything always chart results Factors contributing to our uncertainty small number of test users (n = 6) results are very variable (standard deviation = 32) std. dev. measures dispersal from the mean
Experimental Evaluation Advantages Disadvantages Powerful method (depending on the effects investigated) Quantitative data for statistical analysis Can compare different groups of users Reliability and validity good Replicable High resource demands Requires knowledge of experimental method Time spent on experiments can mean evaluation is difficult to integrate into design cycle Tasks can be artificial and restricted Cannot always generalise to full system in typical working situation all human behaviour variables cannot be controlled little recognition of work, time, motivational & social context subject’s ideas, thoughts, beliefs largely ignored (Preece Ch 31 pp641 - 649) This method involves users carrying out specified tasks under controlled conditions and may make use of a mixture of some of the other methods used so far. For example, questionnaires/interviews might be used to establish the users previous experience and, after the ‘experiment’, to elicit, say, their subjective judgements of the interface. The experiment itself might make use of techniques such as observation (including timing of performance), talk-aloud, data-logging, and so on. An important consideration in the design of the experiment is cost : cost of setting up the controlled conditions, finding and paying the subjects, running the experiment, analysing the results, and so on. Advantages of the method are that the results are usually reliable and valid (assuming a good design) and can be replicated any number of times. It can be used to compare the performance and reactions of different groups of users who have been subjected to the same experimental conditions. Disadvantages are that it can be costly, it requires people who are knowledgeable about experimental method, and it can be difficult to fit in to the design cycle because of the time it normally takes. Again, the experimental tasks can sometimes appear artificial and restricted so it is difficult to generalise and to know for sure how the interface, and users, will perform in a real, typical working environment.
Summary Allows comparison of alternative designs Collects objective, quantitative data (bottom-line data) Needs significant number of test users (16-20) Usable only later in development process Requires administrator expertise Cannot provide why-information (process data) Formal studies can reveal detailed information but take extensive time/effort Applicability: system location dangerous or impractical for constrained single user systems to allow controlled manipulation of use
Summary (contd.) Suitable... Advantages and Dis-advantages system location dangerous or impractical for constrained single user systems to allow controlled manipulation of use Advantages and Dis-advantages sophisticated & expensive equipment uninterrupted environment Hawthorne principle