DTC Quantitative Research Methods Statistical Inference II: Statistical Testing Thursday 7th November 2014
Hypothesis testing Imagine that we know that the mean income of university graduates is £16,500. We then do a survey of 64 sociology graduates and find that they earn a mean income of £15,400 with a standard deviation of £4,000. Can we say that this is convincing evidence that sociology graduate students earn less than other graduate students? The null hypothesis here is that sociology graduates earn the same as other graduates. This is a hypothesis of no difference. The alternative hypothesis is that there is a difference. The null hypothesis (or Ho) is usually of no difference. And the alternative hypothesis (or Ha) is usually of difference. When we carry out statistical tests, we attempt, as here, to reject the null hypothesis at a 95% level of confidence (or sometimes at a 99% or 99.9% level).
Statistical significance A conclusion (e.g. that a difference or relationship exists) is statistically significant if the probability that the conclusion would be drawn if it is, in fact, erroneous falls below the significance level chosen (in social science research this is often 5% = 0.05 = 1 in 20). The significance level is sometimes referred to as alpha (α).
Hypothesis testing So, thinking about the example again: Imagine that we know that the mean income of university graduates is £16,500. We then do a survey of 64 sociology graduates and find that they earn a mean income of £15,400 with a standard deviation of £4,000. Can we say that this is convincing evidence that sociology graduate students earn less than other graduate students? If we construct a 95% confidence interval for the population mean income of sociology graduates it will look like this: 15,400 plus or minus 1.96 x (4,000 / 64) 15,400 plus or minus 1.96 x (4,000 / 8) 15,400 plus or minus 980 £14,420 to £16,380 The top point of this range is still below the mean income for graduates generally – there is no overlap. This means that there is less than a 5% chance that a difference as big as £1,100 would have occurred if there is no difference between sociology graduates’ mean income and the mean income for all graduates.
p-values A p-value quantifies (more precisely) the statistical significance of a result. More precisely, it quantifies how likely a difference or relationship of equal or greater magnitude to that observed would be to have occurred if there is no difference/relationship in the population (i.e. if the null hypothesis is correct)
Back to the example… In the example, the standard error (i.e. the standard deviation of the sample mean) is equal to (4,000 / 64) = 500. Thus the sample mean is 1,100/500 = 2.2 standard errors away from the suggested population mean. Statistical theory tells us that 95% of sample means are within 1.96 standard errors of the population mean. And also tells us that 97.2% of sample means are within 2.2 standard errors of the population mean. Hence the p-value for the difference of 2.2 standard errors (which is a test statistic) is (100-97.2)/100 = 0.028 Since p < 0.05, it is statistically significant at the conventional 5% significance level.
Hypothesis testing Theory You test out particular hypotheses with reference to your sample statistics. However these hypotheses are about underlying population characteristics (parameters) Procedure Set up ‘null’ (and ‘alternative’) hypothesis Note sample size and design Establish sampling distribution under the assumption that the null hypothesis is true Identify decision rule (i.e. what constitutes acceptance/rejection of the null hypothesis) Compute sample statistic(s), and apply the decision rule (N.B. This is where Type I and Type II errors can occur).
Truth about population Error Types Decision (based on hypothesis test) Truth about population H0 true Ha true Reject H0 Type I error Correct decision Do not reject H0 Type II error Note: Reducing the chance of one type of error occurring increases the chance that the other type will!
(Statistical) Power Power is defined as the probability that a test will correctly reject the null hypothesis, i.e. correctly conclude that there is a difference, relationship, etc. The probability of a Type II error is sometimes labelled beta (β), hence power equals 1-β. The power of a test depends on the size of the effect (which is, of course, unknown!)
What is the point of power? Power also depends on the sample size and the significance level chosen. So if we want to use the usual 5% significance level (to obtain ‘95% confidence’ in our results) and we want to be able to identify an effect of a given size, we can calculate how likely, for a given sample size, we are to find an effect of that size, assuming such an effect exists. If the power of a test is low, there is little point in applying it, which suggests a need for a larger sample.
Never innocent… Rather deciding between ‘guilty’ and ‘innocent’, statistical tests decide between ‘guilty’ and ‘not proven’. In other words, a statistically insigificant or non-significant result (sometimes indicated by NS rather than, say p > 0.05) does not indicate that a difference or relationship does not exist, but simply that there is insufficient evidence to conclude that one does exist! This leaves open the possibility of a small difference or weak relationship, which the the statistical test was insufficiently powerful to identify…
Applying the logic of a statistical test… There are a large number of different statistical tests that use inferential methods to ask questions about different forms of differences/relationships: Is the sample mean sufficiently different from the suggested population mean that it is implausible that the suggested population mean is correct? Testing the plausibility of a suggested population mean (via a z-test). [This is what we’ve just done]. Are the means from two samples sufficiently different for it to be implausible that the populations from which they come are actually the same? Test via a two-sample t-test, or if comparing more than two (sub-) samples (i.e. more than two groups) testing for differences via Analysis of Variance (usually referred to as ANOVA). Are the observed frequencies in a cross-tabulation sufficiently different from what one would have expected to have seen if there were no relationship in the population for the idea that there is no relationship in the population to be implausible? Test this via a chi-square test. In each instance we are asking whether the difference between the actual (observed) data and what one would have expected to have seen, given some hypothesis Ho, is sufficiently large that the hypothesis is implausible. Thus we are always trying to disprove a (null) hypothesis.
(Two sample) t-tests Test the null hypothesis, which is: H0: 1 = 2 or H0: 1- 2 = 0 i.e. the equality of means The alternative hypothesis is: Ha: 1 2 or Ha: 1- 2 0
What does a t-test measure? Note: T = treatment group and C = control group. (The above depicts a comparison in experimental research; in most discussions the groups tend just to be labelled as groups 1 and 2, indicating different groups.)
Population of Australian children Population of British children Example We want to compare the average amounts of television watched by Australian and by British children. We have a sample of Australian and a sample of British children. We could say that what we have and want to do are something like this: Population of Australian children Want to compare Population of British children inference inference Sample of Australian children Sample of British children
t distribution critical values Example (continued) Here the dependent variable is number of hours of TV watched each night And the independent variable is nationality (or, perhaps, national context). When we are comparing means SPSS calls the independent variable the grouping variable and the dependent variable the test variable. For a more detailed view of statistics go all the way to Australia: SurfStat
Example (continued) If the null hypothesis, hypothesising no difference between the two groups, was correct (and children thus watch the same average amount of television in Australia as in Britain), we would assume that if we took repeated samples from the two groups the difference in means between them would generally be small or zero. However it is highly likely that the difference between any two particular samples will not be zero. Therefore we acquire a knowledge of the sampling distribution of the difference between the two sample means. We use this distribution to determine the probability of getting an observed difference (of a given size) between two sample means from populations with no difference.
If we take a large number of random samples and calculate the difference between each pair of sample means, we will end up with a sampling distribution that has the following properties: It will be a t-distribution, with The mean of the difference between sample means will be zero if the null hypothesis is correct. Mean (M1 – M2) = 0 The ‘average’ spread of scores around this mean of zero (the standard error) will be defined by the formula: This estimate ‘pools’ the variance in the groups – just take it at face value!
Back to the example… When we are choosing the test of significance it is important to note that: We are making an inference from TWO samples (of Australian and of British children). And these samples are independent (the number of hours of TV watched by British children doesn’t affect the number of hours watched by Australian children, and vice versa) Therefore we need an two-sample test (what SPSS calls an ‘independent samples’ t-test) The two samples are being compared in terms of an interval-ratio variable (hours of TV watched). Therefore the relevant descriptive statistic is the mean. These facts lead us to select the two sample t-test for the equality of means as the relevant test of significance. Table 1. Descriptive statistics for the samples Descriptive statistic Australian sample British sample Mean 166 minutes 187 minutes Standard deviation 29 minutes 30 minutes Sample size 20
t-test of independent means: formulae Note: 1 + 1 = N1 + N2 N1 N2 N1 N2 Where: M = mean SDM = Standard error of the difference between means N = number of subjects in a group s = Sample standard deviation of a group df = degrees of freedom
What are ‘degrees of freedom’? Degrees of freedom can be thought of as the ‘sources of variation’ in a particular situation. If we are comparing groups of 20, then within each group there are 19 (independent) sources of difference between the values for that group. Thus for the two groups combined there are 19+19 = 38 degrees of freedom (d.f.)
Example: Calculating the t-value Descriptive statistic Australian sample British sample Mean 166 minutes 187 minutes Std. dev. 29 minutes 30 minutes Sample size 20 S DM = (20-1)292 + (20-1)302 20+20 = 9.3 20 + 20 – 2 20 x 20 tsample = 166 – 187 = – 2.3 9.3
Example: Obtaining a p-value for a t-value To obtain the p-value for this t-value (score) we could consult a table of critical values for the t-distribution. Such a table may not have a row of probabilities for 38 degrees of freedom (d.f.) In that case we (to be cautious) would refer to the row for the nearest reported number of degrees of freedom below the desired number. Here that might be 30. For 30 degrees of freedom and a two-tailed test, the tabulated t-scores for p=0.05 and p=0.02 are 2.042 and 2.457. The (absolute magnitude) of the t-statistic, falls between these scores, hence the p-value linked to this t-statistic is therefore between 0.02 and 0.05. Therefore the p-value is statistically significant at the 5% (0.05) level but not at the 2% or 1% (0.02 or 0.01) level. Of course, SPSS is set up to calculate exact p-values for test statistics such as the t-statistic (in this case the exact value is p=0.030).
Example: Reporting the results “The mean number of minutes of TV watched by the sample of 20 British children is 187 minutes, which is 21 minutes higher than the mean of 166 minutes for the sample of 20 Australian children; this difference is statistically significant at the 0.05 level (t(38)= -2.3, p = 0.03, two-tailed test). Based on these results we can reject the hypothesis that British and Australian children watch the same average amount of television every night.”
Some final thoughts… ANOVA (Analysis of Variance) works on broadly similar principles, but is a technique allowing one to look simultaneously at differences between the means of more than two groups. Both t-tests and ANOVA make an assumption of homogeneity of variance (i.e. that the spread of values in each of the groups being considered is consistent). We will look at ANOVA in more detail later in the module. What are crucial to remember from this session are the principles of hypothesis testing: That we start with a null hypothesis (of no difference in the population). That, using our sample we can test whether this is plausible. The p-values that we get (and that we report) show the likelihood of the observed results given no difference. Therefore (to simplify), the lower the p-value the more likely it is that there is a real difference between the groups. A reminder: The three things that affect the test statistic are the sample size (of each group), the size of the differences in the means (between groups) and the variability of scores (within each group).