Mini_UPA, 2009 Rating Scales: What the Research Says Joe DumasTom Tullis UX ConsultantFidelity Investments

Slides:



Advertisements
Similar presentations
COMP 110: Introduction to Programming Tyler Johnson Feb 11, 2009 MWF 11:00AM-12:15PM Sitterson 014.
Advertisements

COMP 110: Introduction to Programming Tyler Johnson Feb 18, 2009 MWF 11:00AM-12:15PM Sitterson 014.
COMP 110: Introduction to Programming Tyler Johnson Feb 25, 2009 MWF 11:00AM-12:15PM Sitterson 014.
COMP 110: Introduction to Programming Tyler Johnson Mar 16, 2009 MWF 11:00AM-12:15PM Sitterson 014.
COMP 110: Introduction to Programming Tyler Johnson Apr 20, 2009 MWF 11:00AM-12:15PM Sitterson 014.
COMP 110: Introduction to Programming Tyler Johnson Apr 13, 2009 MWF 11:00AM-12:15PM Sitterson 014.
COMP 110: Introduction to Programming Tyler Johnson January 12, 2009 MWF 11:00AM-12:15PM Sitterson 014.
COMP 110: Introduction to Programming Tyler Johnson Mar 25, 2009 MWF 11:00AM-12:15PM Sitterson 014.
COMP 110: Introduction to Programming Tyler Johnson Apr 8, 2009 MWF 11:00AM-12:15PM Sitterson 014.
COMP 110: Introduction to Programming Tyler Johnson Apr 1, 2009 MWF 11:00AM-12:15PM Sitterson 014.
Large Scale Integration of Senses for the Semantic Web Jorge Gracia, Mathieu dAquin, Eduardo Mena Computer Science and Systems Engineering Department (DIIS)
Producing monthly estimates of labour market indicators exploiting the longitudinal dimension of the LFS microdata R. Gatto, S. Loriga, A. Spizzichino.
Is it true that university students sleep late into the morning and even into the afternoon? Suppose we want to find out what time university students.
Wouter Noordkamp The assessment of new platforms on operational performance and manning concepts.
Privately Querying Location-based Services with SybilQuery Pravin Shankar, Vinod Ganapathy, and Liviu Iftode Department of Computer Science Rutgers University.
Ziehm Academy - User Guide for online registration portal Nuremberg, February 2009.
Test Development.
October FUEL PRICE EVALUATION Comparing different fuel costs is a complex issue requiring an in-depth knowledge of fuel properties and characteristics,
Clarity Chromatography Software
Automation Solutions for Ladle Gate Applications
1 Cathay Life Insurance Ltd. (Vietnam) 27/11/20091.
Feb Alten Group Started in France in 1988 Currently more than people Presence in 10 countries Active in The Netherlands since 2002.
Developing a Questionnaire
Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh.
Presented by: Yaseen Ali, Suraj Bhardwaj, Rohan Shah Yaseen Ali, Suraj Bhardwaj, Rohan Shah Mechatronics Engineering Group 302 Instructor: Dr. K. Sartipi.
Themis IA testing: Lessons learnt from performing study Scott Rippon, User Experience Consultant ITS > Web Services > User Experience Design team.
ATUG Roundtable – November 2009 NBN Architecture Reference Model.
You can use this presentation to: Gain an overall understanding of the purpose of the revised tool Learn about the changes that have been made Find advice.
1. (c) Alan Rowley Associates Laboratory Accreditation Dr Alan G Rowley Quality Policy based on Quality Objectives Quality Management System Communicate.
23/11/2007Asian School of Business, Trivandrum Business Research Scales and Questionnaire.
Historical Changes in Stay-at-Home Mothers: 1969 to 2009 American Sociological Association Annual Meeting Atlanta, GA August 14-17, 2010 Rose M. Kreider,
30 min Scratch July min intro to Scratch A Quick-and-Dirty approach Leaving lots of exploration for the future. (5 hour lesson plan available)
ASSESSING RESPONSIVENESS OF HEALTH MEASUREMENTS. Link validity & reliability testing to purpose of the measure Some examples: In a diagnostic instrument,
Flexible Scheduling of Software with Logical Execution Time Constraints* Stefan Resmerita and Patricia Derler University of Salzburg, Austria *UC Berkeley,
Module 2: Creating a Likert Scale
Findings from Fall 2001 UM.CourseTools Survey Michelle Bejian, UM Media Union Findings from UM.CourseTools Satisfaction Survey Fall 2001.
What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.
Agile Usability Testing Methods
QUESTIONNAIRES ORANGE BOOK CHAPTER 9. WHAT DO QUESTIONNAIRES GATHER? BEHAVIOR ATTITUDES/BELIEFS/OPINIONS CHARACTERISTICS (AGE / MARITAL STATUS / EDUCATION.
CH. 9 MEASUREMENT: SCALING, RELIABILITY, VALIDITY
Asking Users and Experts
Patricia C. Post, Psy.D., Licensed Psychologist
Non Comparative Scaling Techniques
The art and science of measuring people l Reliability l Validity l Operationalizing.
Selecting Your Evaluation Tools Chapter Five. Introduction  Collecting information  Program considerations  Feasibility  Acceptability  Credibility.
ANALYZING AND USING TEST ITEM DATA
1 Debriefing, Recommendations CSSE 376, Software Quality Assurance Rose-Hulman Institute of Technology May 3, 2007.
Executive Functioning Skills Deficits in university students with Developmental Co-ordination Disorder (DCD) Kirby, A., Thomas, M. & Williams, N.
Software Construction and Evolution - CSSE 375 Software Documentation 1 Shawn & Steve Right – For programmers, it’s a cultural perspective. He’d feel almost.
Assessing the Heritage Planning Process: the Views of Citizens Assessing the Heritage Planning Process: the Views of Citizens Dr. Michael MacMillan Department.
Workforce Engagement Survey Accessing your survey results and focussing on key messages in the survey data.
CSc640: Summary of Usability Testing D. Petkovic.
Toward a Contingent Approach to Evaluation of WWW Site Usability: A Comparative Study B325 Research Study and Assignment Tutorial Honours Research Project.
NASA Earth Observing System Data and Information Systems
Biostatistics Case Studies Peter D. Christenson Biostatistician Session 5: Analysis Issues in Large Observational Studies.
Questionnaires Judy Kay CHAI: Computer Human Adapted Interaction research group Human Centred Technology Cluster for Teaching and Research School of Information.
Easy Chair Online Conference Submission, Tracking and Distribution Process: Getting Started + Information for Reviewers AMS World Marketing Congress /

1 Cronbach’s Alpha It is very common in psychological research to collect multiple measures of the same construct. For example, in a questionnaire designed.
Developing a Tool to Measure Health Worker Motivation in District Hospitals in Kenya Patrick Mbindyo, Duane Blaauw, Lucy Gilson, Mike English.
- Improving the User Experience
EDTECH Module 7 Technology Survey by J.D. Winterhalter.
Evaluating Results of Learning Blaž Zupan
Usability of E-Commerce Web Sites Longwood College Taryn L. Fox.
Day 8 Usability testing.
Time stability of user perception of website aesthetics
Standard Metrics and Scenarios for Usable Authentication
Data Analysis of EnchantedLearning.com vs. Invent.org
Client Needs Analysis & Competitors
Asist. Prof. Dr. Duygu FIRAT Asist. Prof.. Dr. Şenol HACIEFENDİOĞLU
Presentation transcript:

Mini_UPA, 2009 Rating Scales: What the Research Says Joe DumasTom Tullis UX ConsultantFidelity Investments

Mini_UPA, The Scope of the Session u Discussion of literature about rating scales in usability methods, primarily usability testing u Brief review of recommendations from older literature u Focus on recent studies u Recommendations for practitioners

Mini_UPA, Table of Contents u Types of rating scales u Guidelines from past studies u How to evaluate a rating scale u Guidelines from recent studies u Additional advantages of rating scales

Types of Rating Scales

Mini_UPA, Formats u One question format u Before-after format u Multiple question format

Mini_UPA, One Question Formats u Original Likert scale format: I think that I would like to use this system frequently: ___ Strongly Disagree ___ Disagree ___ Neither agree not disagree ___ Agree ___ Strongly Agree Rensis Likert

Mini_UPA, One Question Formats u Likert-like scales: Characters on the screen are: Hard to readEasy to read

Mini_UPA, One Question Formats One more Likert-like scale (used in SUMI): I would recommend this software to my colleagues: I would recommend this software to my colleagues: __ Agree__ Undecided__ Disagree

Mini_UPA, One Question Formats SubjectiveMentalEffortScale(SMEQ)

Mini_UPA, One Question Formats u Semantic Differential: u Magnitude estimation: u Use any positive number

Mini_UPA, Before-After Ratings u Before the task: How easy or difficult do you expect this task to be: Very easyVery difficult u After the task: How easy or difficult was task to do: Very easyVery difficult

Mini_UPA, Multiple Question Formats (Selected List) u Software Usability Scale (SUS) – 10 ratings u *Questionnaire for User-Interface Satisfaction – QUIS 71 (long form), 26 (short form) ratings u *Software Usability Measurement Inventory (SUMI) – 50 ratings u After Scenario Questionnaire (ASQ) – three ratings * Requires a license

Mini_UPA, More Multiple Question Formats u Post Study System Usability Questionnaire (PSSOQ) - 19 ratings. Electronic version called the Computer System Usability Questionnaire (CSUQ) u *Website Analysis and MeasureMent Inventory (WAMMI) – 20 ratings of website usability * Requires a license

Guidelines from Past Studies

Mini_UPA, Guidelines u Have 5-9 levels in a rating u You gain no additional information by having more than 10 levels u Include a neutral point in the middle of the scale u Otherwise you lose information by forcing some participants to take sides u People from some Asian cultures are more likely to choose the midpoint

Mini_UPA, Guidelines u Use positive integers as numbers u 1-7 instead of -3 to +3 (Participants are less likely to go below 0 than they are to use 1-3) u Or don’t show numbers at all u Use word labels for at least the end points. u Hard to create labels for every point beyond 5 levels u Having labels on the end points only also makes the data more “interval-like”.

Mini_UPA, Guidelines u Most word labels produce a bipolar scale u In a 1 to 7 scale from easy to difficult, what is increasing with the numbers? Is ease the absence of difficulty? u This may be one reason why participants are reluctant to move to the difficult end – it is a different concept that lack of ease u One solution – scale from “not at all easy” to “very easy”

Evaluating a Rating Scale

Mini_UPA, Statistical Criteria u Is it valid? Does it measure what it’s suppose to measure? u For example, does it correlate with other usability measures. u Is it sensitive? u Can it discriminate between tasks or products with small samples

Mini_UPA, Practical Criteria u Is it easy for the participant to understand and use? u Do they get what it means? u Is it easy for the tester to present – online or paper – and score? u Do you need a widget to present it? u Can scoring be done automatically?

Guidelines from Recent Studies

Mini_UPA, Post-Task Ratings u The simpler the better u Tedesco and Tullis found this format the most sensitive Overall this task was: Very Easy Very Difficult u Sauro and Dumas found SMEQ just as sensitive as Likert

Mini_UPA, More on Post-Task Ratings u They provide diagnostic information about usability issues with tasks u They correlate moderately well with other measures especially time, and their correlations are higher than for post-test ratings

Mini_UPA, More on Post-Task Ratings Even post-task ratings may be inflated (Teague et al., 2001). Ratings made during a task were significantly lower than after the task and even higher if given only after the task Concurrent During task Concurrent After task Post-taskOnly Ease

Mini_UPA, Post-Test Ratings u Home grown questionnaires perform more poorly than standardized ones u Tullis and Stetson and others have found SUS most sensitive. Many testers are using it. u Some of the standardized questionnaires have industry norms to compare against - SUMI and WAMMI u But no one knows what the database of norms contains

Mini_UPA, More on Post-Test Ratings u The lowest correlations among all measures used in testing are with post-task ratings (Sauro and Lewis) u Why? – they are tapping into factors that don’t effect other measures such as demand characteristics, need to please, need to appear competent, lack of understanding of what an “overall” rating means, etc.

Mini_UPA, Examine the Distribution See how the average would miss how bimodal the distribution is. Some participants find it very hard to use. Why? See how the average would miss how bimodal the distribution is. Some participants find it very hard to use. Why?

Mini_UPA, Low Sensitivity with Small Samples u Three recent studies have all shown that post-task and post-test ratings do not discriminate well with sample sizes below about u For sample sizes typical of laboratory formative tests, ratings are not reliable u Ratings can be used as an opportunity to get participant to talk about why they have chosen a value

Mini_UPA, The Value of Confidence Intervals Actual data from an online study comparing the NASA & Wikipedia sites for finding info on the Apollo space program.

Little Known Advantages of Rating Scales

Mini_UPA, Ratings Can Help Prioritize Work “Fix it Fast” “Promote It” “Big Opportunity” “Don’t Touch It” 1=Difficult … 7=Easy

Mini_UPA, Ratings Can Help Identify “Disconnects” This “disconnect” between the accuracy and task ease ratings is worrisome– it indicates users didn’t realize they were screwing up on Task 2!

Mini_UPA, Ratings Can Help You Make Comparisons You can be very pleased if you get an average SUS score of 83 (which is the 94 th percentile of this distribution). But you should be worried if you get an average SUS score of 48 (the 12 th percentile).

Mini_UPA, In Closing… u These slides, a bibliography of readings, and associated examples, can be downloaded from: Feel free to contact us with questions!