HCI 510 : HCI Methods I HCI Methods –Controlled Experiments.

Slides:



Advertisements
Similar presentations
Evaluation - Controlled Experiments What is experimental design? What is an experimental hypothesis? How do I plan an experiment? Why are statistics used?
Advertisements

Controlled experiments Traditional scientific method Reductionist –clear convincing result on specific issues In HCI: –insights into cognitive process,
Controlled Experiments
Statistical Decision Making
Statistical Tests Karen H. Hagglund, M.S.
Hypothesis testing Week 10 Lecture 2.
Quantitative Evaluation
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Lecture 9: One Way ANOVA Between Subjects
Inference.ppt - © Aki Taanila1 Sampling Probability sample Non probability sample Statistical inference Sampling error.
PY 427 Statistics 1Fall 2006 Kin Ching Kong, Ph.D Lecture 6 Chicago School of Professional Psychology.
Today Concepts underlying inferential statistics
Statistics made simple Modified from Dr. Tammy Frank’s presentation, NOVA.
Chapter 14 Inferential Data Analysis
Richard M. Jacobs, OSA, Ph.D.
Inferential Statistics
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Inferential Statistics
Inferential Statistics
Example 10.1 Experimenting with a New Pizza Style at the Pepperoni Pizza Restaurant Concepts in Hypothesis Testing.
Understanding Research Results
AM Recitation 2/10/11.
Hypothesis Testing:.
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Comparing Means From Two Sets of Data
Evaluation - Controlled Experiments What is experimental design? What is an experimental hypothesis? How do I plan an experiment? Why are statistics used?
1 Today Null and alternative hypotheses 1- and 2-tailed tests Regions of rejection Sampling distributions The Central Limit Theorem Standard errors z-tests.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 22 Using Inferential Statistics to Test Hypotheses.
Chapter 15 Data Analysis: Testing for Significant Differences.
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
User Study Evaluation Human-Computer Interaction.
Associate Professor Arthur Dryver, PhD School of Business Administration, NIDA url:
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests.
Section 9-1: Inference for Slope and Correlation Section 9-3: Confidence and Prediction Intervals Visit the Maths Study Centre.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Chapter 9 Three Tests of Significance Winston Jackson and Norine Verberg Methods: Doing Social Research, 4e.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Chapter Seventeen. Figure 17.1 Relationship of Hypothesis Testing Related to Differences to the Previous Chapter and the Marketing Research Process Focus.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
Chapter Eight: Using Statistics to Answer Questions.
Chapter 6: Analyzing and Interpreting Quantitative Data
Tuesday, April 8 n Inferential statistics – Part 2 n Hypothesis testing n Statistical significance n continued….
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
URBDP 591 I Lecture 4: Research Question Objectives How do we define a research question? What is a testable hypothesis? How do we test an hypothesis?
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Chapter 13 Understanding research results: statistical inference.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Hypothesis Testing and Statistical Significance
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Extension: How could researchers use a more powerful measure of analysis? Why do you think that researchers do not just rely on descriptive statistics.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Descriptive Statistics Report Reliability test Validity test & Summated scale Dr. Peerayuth Charoensukmongkol, ICO NIDA Research Methods in Management.
Controlled Experiments Part 2. Basic Statistical Tests Lecture /slide deck produced by Saul Greenberg, University of Calgary, Canada Notice: some material.
Agenda n Probability n Sampling error n Hypothesis Testing n Significance level.
Outline Sampling Measurement Descriptive Statistics:
Controlled Experiments
Controlled experiments Traditional scientific method Reductionist clear convincing result on specific issues In HCI: insights into cognitive process,
Qualitative vs. Quantitative
Hypothesis Testing Review
Elements of a statistical test Statistical null hypotheses
Presentation transcript:

HCI 510 : HCI Methods I HCI Methods –Controlled Experiments

HCI Methods Controlled Experiments –Statistical Analysis –Results Interpretation

Statistical analysis Apply statistical methods to data analysis –confidence limits : the confidence that your conclusion is correct “the hypothesis that computer experience makes no difference is rejected at the.05 level” means: –a 95% chance that your statement is correct –a 5% chance you are wrong

Interpretation Interpret your results –what you believe the results really mean –their implications to your research –their implications to practitioners –how generalisable they are –limitations and critique

Statistical analysis Calculations that tell us –mathematical attributes about our data sets mean, amount of variance,... –how data sets relate to each other whether we are “sampling” from the same or different distributions –the probability that our claims are correct “statistical significance”

Statistical vs practical significance When n is large, even a trivial difference may show up as a statistically significant result –eg menu choice: mean selection time of menu a is 3.00 seconds; menu b is 3.05 seconds Statistical significance does not imply that the difference is important! –a matter of interpretation –statistical significance often abused and used to misinform

Example: Differences between means Given: –two data sets measuring a condition height difference of males and females time to select an item from different menu styles... Question: –is the difference between the means of this data statistically significant? Null hypothesis: –there is no difference between the two means –statistical analysis: can only reject the hypothesis at a certain level of confidence Condition one: 3, 4, 4, 4, 5, 5, 5, 6 Condition two: 4, 4, 5, 5, 6, 6, 7, 7

Example: Is there a significant difference between these means? Condition one: 3, 4, 4, 4, 5, 5, 5, 6 Condition two: 4, 4, 5, 5, 6, 6, 7, Condition Condition mean = 4.5 mean =

Problem with visual inspection of data Will almost always see variation in collected data –Differences between data sets may be due to: normal variation –eg two sets of ten tosses with different but fair dice »differences between data and means are accountable by expected variation real differences between data –eg two sets of ten tosses for with loaded dice and fair dice »differences between data and means are not accountable by expected variation

T-test A simple statistical test –allows one to say something about differences between means at a certain confidence level Null hypothesis of the T-test: –no difference exists between the means of two sets of collected data Possible results: – I am 95% sure that null hypothesis is rejected (there is probably a true difference between the means) –I cannot reject the null hypothesis the means are likely the same

Different types of T-tests Comparing two sets of independent observations –usually different subjects in each group –number per group may differ as well Condition 1 Condition 2 S1–S20 S21–43 Paired observations –usually a single group studied under both experimental conditions –data points of one subject are treated as a pair Condition 1 Condition 2 S1–S20 S1–S20

Different types of T-tests Non-directional vs directional alternatives –non-directional (two-tailed) no expectation that the direction of difference matters –directional (one-tailed) Only interested if the mean of a given condition is greater than the other

T-test... Assumptions of t-tests –data points of each sample are normally distributed but t-test very robust in practice –population variances are equal t-test reasonably robust for differing variances deserves consideration –individual observations of data points in sample are independent must be adhered to Significance level –decide upon the level before you do the test –typically stated at the.05 or.01 level

Two-tailed Unpaired T-test Unpaired t-test DF: 14 Unpaired t Value: Prob. (2-tail):.0824 Group:Count:Mean:Std. Dev.:Std. Error: one two Condition one: 3, 4, 4, 4, 5, 5, 5, 6 Condition two: 4, 4, 5, 5, 6, 6, 7, 7 Looking up critical value of t Use table for two-tailed t-test, at p=.05, df=14 critical value = because t=1.871 < 2.145, there is no significant difference therefore, we cannot reject the null hypothesis i.e., there is no difference between the means df …

Significance levels and errors Type 1 error – reject the null hypothesis when it is, in fact, true Type 2 error – accept the null hypothesis when it is, in fact, false Effects of levels of significance –high confidence level (eg p<.0001) greater chance of Type 2 errors –low confidence level (eg p>.1) greater chance of Type 1 errors You can ‘bias’ your choice depending on consequence of these errors

Type I and Type II Errors FalseTrue Type I error  False  Type II error Decision “Reality” Type 1 error – reject the null hypothesis when it is, in fact, true Type 2 error – accept the null hypothesis when it is, in fact, false

Example: The SpamAssassin Spam Rater A SPAM rater gives each a SPAM likelihood –0: definitely valid … –1: –2: … –9: –10: definitely SPAM  SPAM likelihood Spam Rater  1 1    3 3  7 7 759

Example: The SpamAssassin Spam Rater A SPAM assassin deletes mail above a certain SPAM threshold –what should this threshold be? –‘Null hypothesis’: the arriving mail is SPAM  Spam Rater  1 1    3 3  7 7 75 <=X >X 9

Example: The SpamAssassin Spam Rater Low threshold = many Type I errors –many legitimate s classified as spam –but you receive very few actual spams High threshold = many Type II errors –many spams classified as –but you receive almost all your valid s  Spam Rater  1 1    3 3  7 7 75 <=X >X 9

Which is Worse? Type I errors are considered worse because the null hypothesis is meant to reflect the incumbent theory. BUT You must use your judgement to assess actual risk of being wrong in the context of your study.

Significance levels and errors There is no difference between Pie and traditional pop-up menus What is the consequence of each error type? –Type 1: extra work developing software people must learn a new idiom for no benefit –Type 2: use a less efficient (but already familiar) menu Which error type is preferable? 1.Redesigning a traditional GUI interface Type 2 error is preferable to a Type 1 error 2.Designing a digital mapping application where experts perform extremely frequent menu selections Type 1 error preferable to a Type 2 error New Open Close Save NewOpen Close Save

Scales of Measurements Four major scales of measurements –Nominal –Ordinal –Interval –Ratio

Nominal Scale Classification into named or numbered unordered categories –country of birth, user groups, gender… Allowable manipulations –whether an item belongs in a category –counting items in a category Statistics –number of cases in each category –most frequent category –no means, medians…

Nominal Scale Sources of error –agreement in labeling, vague labels, vague differences in objects Testing for error –agreement between different judges for same object

Ordinal Scale Classification into named or numbered ordered categories –no information on magnitude of differences between categories –e.g. preference, social status, gold/silver/bronze medals Allowable manipulations –as with interval scale, plus –merge adjacent classes –transitive: if A > B > C, then A > C Statistics –median (central value) –percentiles, e.g., 30% were less than B Sources of error –as in nominal

Interval Scale Classification into ordered categories with equal differences between categories –zero only by convention –e.g. temperature (C or F), time of day Allowable manipulations –add, subtract –cannot multiply as this needs an absolute zero Statistics –mean, standard deviation, range, variance Sources of error –instrument calibration, reproducibility and readability –human error, skill…

Ratio Scale Interval scale with absolute, non-arbitrary zero –e.g. temperature (K), length, weight, time periods Allowable manipulations –multiply, divide

Example: Apples Nominal: –apple variety Macintosh, Delicious, Gala… Ordinal: –apple quality US. Extra Fancy U.S. Fancy, U.S. Combination Extra Fancy / Fancy U.S. No. 1 U.S. Early U.S. Utility U.S. Hail

Example: Apples Interval: –apple ‘Liking scale’ Marin, A. Consumers’ evaluation of apple quality. Washington Tree Postharvest Conference After taking at least 2 bites how much do you like the apple? Dislike extremely Neither like or dislike Like extremely Ratio: –apple weight, size, …

Correlation Measures the extent to which two concepts are related –eg years of university training vs computer ownership per capita How? –obtain the two sets of measurements –calculate correlation coefficient +1: positively correlated 0: no correlation (no relation) –1: negatively correlated

Correlation condition 1 condition Condition 1 r 2 =.668

Correlation Dangers –attributing causality a correlation does not imply cause and effect cause may be due to a third “hidden” variable related to both other variables –drawing strong conclusion from small numbers unreliable with small groups be wary of accepting anything more than the direction of correlation unless you have at least 40 subjects

Correlation r 2 =.668 Pickles eaten per month Salary per year (*10,000) Pickles eaten per month Salary per year (*10,000) Possible Conclusions …

Correlation r 2 =.668 Pickles eaten per month Salary per year (*10,000) Pickles eaten per month Salary per year (*10,000) Which conclusion could be correct? -Eating pickles causes your salary to increase -Making more money causes you to eat more pickles -Pickle consumption predicts higher salaries because older people tend to like pickles better than younger people, and older people tend to make more money than younger people

Correlation Cigarette Consumption Crude Male death rate for lung cancer in 1950 per capita consumption of cigarettes in 1930 in various countries. While strong correlation (.73), can you prove that cigarrette smoking causes death from this data? Possible hidden variables: –age –poverty

Other Tests: Regression Calculates a line of “best fit” Use the value of one variable to predict the value of the other e.g., 60% of people with 3 years of university own a computer Condition 1 y =.988x , r2 = condition 1 condition 2 Condition 2

Single Factor Analysis of Variance Compares three or more means e.g. comparing mouse-typing on three keyboards: –Possible results: mouse-typing speed is –fastest on a qwerty keyboard –the same on an alphabetic & dvorak keyboards Qwerty Alphabetic Dvorak S1-S10 S11-S20S21-S30

Analysis of Variance (Anova) Compares relationships between many factors –Provides more informed results considers the interactions between factors –example beginners type at the same speed on all keyboards, touch-typist type fastest on the qwerty Qwerty Alphabetic Dvorak S1-S10 S11-S20 S21-S30 S31-S40 S41-S50 S51-S60 cannot touch type can touch type