Hypothesis testing. Parametric tests Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Outline Statistical inference Hypothesis testing Type I and type II errors Student t test ANOVA Parametric vs non-parametric tests
Importance of biostatistics Diabetes type 2 study Experimental group: Mean blood sugar level: 103 mg/dl Control group: Mean blood sugar level: 107 mg/dl Pancreatic cancer study Experimental group: 1-year survival rate: 23% Control group: 1-year survival rate: 20% Is there a difference? Statistics are needed to quantify differences that are too small to recognize through clinical experience alone.
Statistical inference Diabetes type 2 study Experimental group: Mean blood sugar level: 103 mg/dl Control group: Mean blood sugar level: 107 mg/dl Increased sample size: Experimental group: Mean blood sugar level: 99 mg/dl Control group: Mean blood sugar level: 112 mg/dl
Statistical inference assesses whether the means of two samples are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two samples/ conditions mean arithmetic average a hypothetical value that can be calculated for a data set; it doesn’t have to be a value that is actually observed in the data set calculated by adding up all scores and dividing them by number of scores assumptions of a t-test: from a parametric population not (seriously) skewed no outliers independent samples µ1 µ2 X1 X2 Compare the mean between 2 samples/ conditions if 2 means are statistically different, then the samples are likely to be drawn from 2 different populations, ie they really are different
Statistical inference Diabetes type 2 study Experimental group: Mean blood sugar level: 103 mg/dl Control group: Mean blood sugar level: 107 mg/dl Increased sample size: Experimental group: Mean blood sugar level: 105 mg/dl Control group: Mean blood sugar level: 106 mg/dl
Statistical inference assesses whether the means of two samples are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two samples/ conditions mean arithmetic average a hypothetical value that can be calculated for a data set; it doesn’t have to be a value that is actually observed in the data set calculated by adding up all scores and dividing them by number of scores assumptions of a t-test: from a parametric population not (seriously) skewed no outliers independent samples X1 µ X2 Compare the mean between 2 samples / conditions if 2 samples are taken from the same population, then they should have fairly similar means
Hypothesis testing The general idea of hypothesis testing involves: Making an initial assumption; Collecting evidence (data); Based on the available evidence (data), deciding whether to reject or not reject the initial assumption. Every hypothesis test — regardless of the population parameter involved — requires the above three steps.
Criminal trial Criminal justice system assumes the defendant is innocent until proven guilty. That is, our initial assumption is that the defendant is innocent. In the practice of statistics, we make our initial assumption when we state our two competing hypotheses – the null hypothesis (H0) and the alternative hypothesis (HA). Here, our hypotheses are: H0: Defendant is not guilty (innocent) HA: Defendant is guilty In statistics, we always assume the null hypothesis is true. That is, the null hypothesis is always our initial assumption.
Null hypothesis – H0 This is the hypothesis under test, denoted as H0. The null hypothesis is usually stated as the absence of a difference or an effect. The null hypothesis says there is no effect. The null hypothesis is rejected if the significance test shows the data are inconsistent with the null hypothesis.
Alternative hypothesis – H1 This is the alternative to the null hypothesis. It is denoted as H', H1, or HA. It is usually the complement of the null hypothesis. If, for example, the null hypothesis says two population means are equal, the alternative says the means are unequal
Criminal trial The prosecution team then collects evidence with the hopes of finding sufficient evidence to make the assumption of innocence refutable. In statistics, the data are the evidence. The jury then makes a decision based on the available evidence: If the jury finds sufficient evidence — beyond a reasonable doubt — to make the assumption of innocence refutable, the jury rejects H0 and deems the defendant guilty. We behave as if the defendant is guilty. If there is insufficient evidence, then the jury does not reject H0. We behave as if the defendant is innocent.
Making the decision Recall that it is either likely or unlikely that we would observe the evidence we did given our initial assumption. If it is likely, we do not reject the null hypothesis. If it is unlikely, then we reject the null hypothesis in favor of the alternative hypothesis. Effectively, then, making the decision reduces to determining likely or unlikely.
Making the decision In statistics, there are two ways to determine whether the evidence is likely or unlikely given the initial assumption: We could take the critical value approach (favored in many of the older textbooks). Or, we could take the P-value approach (what is used most often in research, journal articles, and statistical software).
Making the decision Suppose we find a difference between two groups in survival: patients on a new drug have a survival of 15 months; patients on the old drug have a survival of 18 months. So, the difference is 3 months. Do we accept or reject the hypothesis of no true difference between the groups (the two drugs)? Is a difference of 3 a lot, statistically speaking – a huge difference that is rarely seen? Or is it not much – the sort of thing that happens all the time?
Probability A measure of the likelihood that a particular event will happen. It is expressed by a value between 0 and 1. First, note that we talk about the probability of an event, but what we measure is the rate in a group. If we observe that 5 babies in every 1 000 have congenital heart disease, we say that the probability of a (single) baby being affected is 5 in 1000 or 0.005. 0.0 1.0 Cannot happen Sure to happen
Making the decision A statistical test tells you how often you would get a difference of 3, simply by chance, if the null hypothesis is correct – no real difference between the two groups. Suppose the test is done and its result is that P = 0.32. This means that you would get a difference of 3 quite often just by the play of chance – 32 times in 100 – even when there is in reality no true difference between the groups.
Making the decision A statistical test tells you how often you’d get a difference of 3, simply by chance, if the null hypothesis is correct – no real difference between the two groups. On the other hand if we did the statistical analysis and P = 0.0001, then we say that you’d only get a difference as big as 3 by the play of chance 1 time in 10 000. That’s so rarely that we want to reject our hypothesis of no difference: there is something different about the new therapy.
Hypothesis testing Somewhere between 0.32 and 0.0001 we may not be sure whether to reject the null hypothesis or not. Mostly we reject the null hypothesis when, if the null hypothesis were true, the result we got would have happened less than 5 times in 100 by chance. This is the conventional cutoff of 5% or P < 0.05. This cutoff is commonly used but it’s arbitrary i.e. no particular reason why we use 0.05 rather than 0.06 or 0.048 or whatever.
Hypothesis testing Decision: Reject null hypothesis Do not reject null hypothesis Null hypothesis is true Type I error No error Null hypothesis is false Type II error
Type I and II errors A type I error is the incorrect rejection of a true null hypothesis (also known as a false positive finding). The probability of a type I error is denoted by the Greek letter (alpha). A type II error is incorrectly retaining a false null hypothesis (also known as a false negative finding). The probability of a type II error is denoted by the Greek letter (beta).
Level of significance Level of significance (α) – the threshold for declaring if a result is significant. If the null hypothesis is true, α is the probability of rejecting the null hypothesis. α is decided as part of the research design, while P-value is computed from data. α = 0.05 is most commonly used. Small α value reduces the chance of Type I error, but increases the chance of Type II error. Trade-off based on the consequences of Type I (false-positive) and Type II (false-negative) errors.
Power Power – the probability of rejecting a false null hypothesis. Statistical power is inversely related to β or the probability of making a Type II error (power is equal to 1 – β). Power depends on the sample size, variability, significance level and hypothetical effect size. You need a larger sample when you are looking for a small effect and when the standard deviation is large.
Common misconceptions P-value is different from the level of significance α. P-value is computed from data, while α is decided as part of the experimental design. P-value is not the probability of the null hypothesis being true. P-value answers the following question: If the null hypothesis is true, what is the chance that random sampling will lead to a difference as large as or larger than observed in the study. A statistically significant result does not necessarily mean that the finding is clinically important. Look at the size of the effect and its precision. Lack of difference may be a meaningful result too!
Choosing a statistical test Choice of a statistical test depends on: Level of measurement for the dependent and independent variables; Number of groups or dependent measures; Number of units of observation; Type of distribution; The population parameter of interest (mean, variance, differences between means and/or variances).
Choosing a statistical test Multiple comparison – two or more data sets, which should be analyzed repeated measurements made on the same individuals; entirely independent samples. Degrees of freedom – the number of scores, items, or other units in the data set, which are free to vary One- and two tailed tests one-tailed test of significance used for directional hypothesis; two-tailed tests in all other situations. Sample size – number of cases, on which data have been obtained Which of the basic characteristics of a distribution are more sensitive to the sample size?
Student t-test 27 27
1-sample t-test Comparison of sample mean with a population mean It is known that the weight of young adult male has a mean value of 70.0 kg with a standard deviation of 4.0 kg. Thus the population mean, µ= 70.0 and population standard deviation, σ= 4.0. Data from random sample of 28 males of similar ages but with specific enzyme defect: mean body weight of 67.0 kg and the sample standard deviation of 4.2 kg. Question: Whether the studied group have a significantly lower body weight than the general population?
2-sample t-test Aim: Compare two means Example: Comparing pulse rate in people taking two different drugs Assumption: Both data sets are sampled from Gaussian distributions with the same population standard deviation Effect size: Difference between two means Null hypothesis: The two population means are identical Meaning of P value: If the two population means are identical, what is the chance of observing such a difference (or a bigger one) between means by chance alone?
Paired t-test Aim: Compare a continuous variable before and after an intervention Example: Comparing pulse rate before and after taking a drug Assumption: The population of paired differences is Gaussian Effect size: Mean of the paired differences Null hypothesis: The population mean of paired differences is zero Meaning of P value: If there is no difference in the population, what is the chance of observing such a difference (or a bigger one) between means by chance alone?
One-way ANOVA Aim: Compare three or more means Example: Comparing pulse rate in 3 groups of people, each group taking a different drug Assumption: All data sets are sampled from Gaussian distributions with the same population standard deviation Effect size: Fraction of the total variation explained by variation among group means Null hypothesis: All population means are identical Meaning of P value: If the population means are identical, what is the chance of observing such a difference (or a bigger one) between means by chance alone?
Parametric and non-parametric tests Parametric test – the variable we have measured in the sample is normally distributed in the population to which we plan to generalize our findings Non-parametric test – distribution free, no assumption about the distribution of the variable in the population
Parametric and non-parametric tests Type of test Non-parametric Parametric Scale Nominal Ordinal Ordinal, Interval, Ratio 1 group χ2 goodness of fit test Wilcoxon signed rank test 1-sample t-test 2 unrelated groups χ2 test Mann–Whitney U test 2-sample t-test 2 related groups McNemar test Paired t-test K unrelated groups Kruskal–Wallis H test ANOVA K related groups Friedman matched samples test ANOVA with repeated measurements