Evaluation: Controlled Experiments Chris North cs3724: HCI.

Evaluation: Controlled Experiments Chris North cs3724: HCI

Presentations dan constantin, grant underwood, mike gordon Vote: UI Hall of Fame/Shame?

Next Apr 4: Proj 2, final implementation Presentations: UI critique or HW2 results Thurs: matt ketner, sam altman Next Tues: karen molye, steve kovalak Next Thurs:

Review 3 approaches for navigating large information spaces? detail only Zoom Overview+detail Focus+context

Review: Visualizing Trees 2 approaches: Connection Containment Hyperbolic: 100s nodes + structure TreeMap: 1000s nodes + attributes 3D: infovis design is critical, not just VRML

Process Design EvaluateDevelop Continuous iteration

UI Evaluation Early evaluation: Wizard of Oz Role playing and scenarios Mid evaluation: Expert reviews Heuristic evaluation Usability testing Controlled Experiments Late evaluation: Data logging Online surveys

Controlled Experiments Scientific experiment with real users Typical HCI goal: which UI is better?

What is Science? Measurement Modeling

Scientific Method 1.Form Hypothesis 2.Collect data 3.Analyze 4.Accept/reject hypothesis

Deep Questions Is ‘computer science’ science? How can you “prove” a hypothesis with science?

Empirical Experiment Typical question: Which UI is better in which situations? LifelinesPerspectiveWall (zooming) (focus+context)

More Rigorous Question Does UI (Lifelines or PerspWall) have an effect on user performance time for task X for suchnsuch users? Null hypothesis: No effect Lifelines = PerspWall Want to disprove, provide counter-example, show an effect

Variables Independent Variables (what you vary) and treatments (the variable values): User Interface »Lifelines, Perspective Wall, Text UI Task type »Find, count, pattern, compare Data size (# of items) »100, 1000, 1000000 Dependent Variables (what you measure) User performance time Errors Subjective satisfaction (survey), retention, learning time HCI metrics

Example: 2 x 3 design n users per cell Task1Task2Task3 Life- Lines Persp. Wall Ind Var 1: UI Ind Var 2: Task Type Measured user performance times (dep var)

Groups “Between subjects” variable 1 group of users for each variable treatment Group 1: 20 users, Lifelines Group 2: 20 users, PerspWall Total: 40 users, 20 per cell “With-in subjects” (repeated) variable All users perform all treatments Counter-balancing order effect Group 1: 20 users, Lifelines then PerspWall Group 2: 20 users, PerspWall then Lifelines Total: 40 users, 40 per cell

Issues Fairness Randomized Identical procedures Bias User privacy, data security Legal permissions

Procedure For each user: Sign legal forms Pre-Survey: demographics Instructions »Do not reveal true purpose of experiment Training runs Actual runs Post-Survey: subjective measures * n users

Data Measured dependent variables Spreadsheet Lifelines task 1, 2, 3, PerspWall task 1, 2, 3

Averages Task1Task2Task3 Life- Lines 37.254.5103.7 Persp. Wall 29.853.2145.4 Ind Var 1: UI Ind Var 2: Task Type Measured user performance times (dep var)

PerspWall better than Lifelines? Problem with Averages: lossy Compares only 2 numbers What about the 40 data values? (Show me the data!) Lifelines PerspWall Avg Task1 perf time (secs)

The real picture Need stats that take all data into account Lifelines PerspWall Perf time (secs)

Statistics t-test Compares 1 dep var on 2 treatments of 1 ind var (2 cells) ANOVA: Analysis of Variance Compares 1 dep var on n treatments of m ind vars (n x m cells) Result: “significant difference” between treatments? p = significance level (confidence) typical cut-off: p < 0.05

p < 0.05 Woohoo! Found a “statistically significant difference” Averages indicate which is ‘better’ Conclusion: UI has an “effect” on user performance for task1 PerspWall better user performance than Lifelines for task1 “95% confident that PerspWall better than Lifelines” Not “PerspWall beats Lifelines 95% of time” Found a counter-example to the null-hypothesis Null-hypothesis: Lifelines = PerspWall Hence: Lifelines  PerspWall

p > 0.05 Hence, same? UI has no effect on user performance for task1? Lifelines = PerspWall ? NOT! We did not detect a difference, but could still be different Did not find a counter-example to null hypothesis Provides evidence for Lifelines = PerspWall, but not proof Boring! Basically found nothing How? Not enough users Need better tasks, data, …

Evaluation: Controlled Experiments Chris North cs3724: HCI.

Similar presentations

Presentation on theme: "Evaluation: Controlled Experiments Chris North cs3724: HCI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation: Controlled Experiments Chris North cs3724: HCI.

Similar presentations

Presentation on theme: "Evaluation: Controlled Experiments Chris North cs3724: HCI."— Presentation transcript:

Similar presentations

About project

Feedback