Hypothesis testing. Association and regression Georgi Iskrov, PhD Department of Social Medicine
Outline Hypothesis testing Type I and type II errors Student t test ANOVA Parametric vs non-parametric tests Normality tests Rank-based tests Chi-square test Fisher’s exact test Correlation analysis Regression analysis
Importance of biostatistics Diabetes type 2 study Experimental group: Mean blood sugar level: 103 mg/dl Control group: Mean blood sugar level: 107 mg/dl Pancreatic cancer study Experimental group: 1-year survival rate: 23% Control group: 1-year survival rate: 20% Is there a difference? Statistics are needed to quantify differences that are too small to recognize through clinical experience alone.
Statistical inference Diabetes type 2 study Experimental group: Mean blood sugar level: 103 mg/dl Control group: Mean blood sugar level: 107 mg/dl Increased sample size: Experimental group: Mean blood sugar level: 99 mg/dl Control group: Mean blood sugar level: 112 mg/dl
Statistical inference assesses whether the means of two samples are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two samples/ conditions mean arithmetic average a hypothetical value that can be calculated for a data set; it doesn’t have to be a value that is actually observed in the data set calculated by adding up all scores and dividing them by number of scores assumptions of a t-test: from a parametric population not (seriously) skewed no outliers independent samples µ1 µ2 X1 X2 Compare the mean between 2 samples/ conditions if 2 means are statistically different, then the samples are likely to be drawn from 2 different populations, ie they really are different
Statistical inference Diabetes type 2 study Experimental group: Mean blood sugar level: 103 mg/dl Control group: Mean blood sugar level: 107 mg/dl Increased sample size: Experimental group: Mean blood sugar level: 105 mg/dl Control group: Mean blood sugar level: 106 mg/dl
Statistical inference assesses whether the means of two samples are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two samples/ conditions mean arithmetic average a hypothetical value that can be calculated for a data set; it doesn’t have to be a value that is actually observed in the data set calculated by adding up all scores and dividing them by number of scores assumptions of a t-test: from a parametric population not (seriously) skewed no outliers independent samples X1 µ X2 Compare the mean between 2 samples / conditions if 2 samples are taken from the same population, then they should have fairly similar means
Hypothesis testing The general idea of hypothesis testing involves: Making an initial assumption; Collecting evidence (data); Based on the available evidence (data), deciding whether to reject or not reject the initial assumption. Every hypothesis test — regardless of the population parameter involved — requires the above three steps.
Criminal trial Criminal justice system assumes the defendant is innocent until proven guilty. That is, our initial assumption is that the defendant is innocent. In the practice of statistics, we make our initial assumption when we state our two competing hypotheses – the null hypothesis (H0) and the alternative hypothesis (HA). Here, our hypotheses are: H0: Defendant is not guilty (innocent) HA: Defendant is guilty In statistics, we always assume the null hypothesis is true. That is, the null hypothesis is always our initial assumption.
Null hypothesis – H0 This is the hypothesis under test, denoted as H0. The null hypothesis is usually stated as the absence of a difference or an effect. The null hypothesis says there is no effect. The null hypothesis is rejected if the significance test shows the data are inconsistent with the null hypothesis.
Alternative hypothesis – H1 This is the alternative to the null hypothesis. It is denoted as H', H1, or HA. It is usually the complement of the null hypothesis. If, for example, the null hypothesis says two population means are equal, the alternative says the means are unequal
Criminal trial The prosecution team then collects evidence with the hopes of finding sufficient evidence to make the assumption of innocence refutable. In statistics, the data are the evidence. The jury then makes a decision based on the available evidence: If the jury finds sufficient evidence — beyond a reasonable doubt — to make the assumption of innocence refutable, the jury rejects H0 and deems the defendant guilty. We behave as if the defendant is guilty. If there is insufficient evidence, then the jury does not reject H0. We behave as if the defendant is innocent.
Making the decision Recall that it is either likely or unlikely that we would observe the evidence we did given our initial assumption. If it is likely, we do not reject the null hypothesis. If it is unlikely, then we reject the null hypothesis in favor of the alternative hypothesis. Effectively, then, making the decision reduces to determining likely or unlikely.
Making the decision In statistics, there are two ways to determine whether the evidence is likely or unlikely given the initial assumption: We could take the critical value approach (favored in many of the older textbooks). Or, we could take the P-value approach (what is used most often in research, journal articles, and statistical software).
Making the decision Suppose we find a difference between two groups in survival: patients on a new drug have a survival of 15 months; patients on the old drug have a survival of 18 months. So, the difference is 3 months. Do we accept or reject the hypothesis of no true difference between the groups (the two drugs)? Is a difference of 3 a lot, statistically speaking – a huge difference that is rarely seen? Or is it not much – the sort of thing that happens all the time?
Making the decision A statistical test tells you how often you would get a difference of 3, simply by chance, if the null hypothesis is correct – no real difference between the two groups. Suppose the test is done and its result is that P = 0.32. This means that you would get a difference of 3 quite often just by the play of chance – 32 times in 100 – even when there is in reality no true difference between the groups.
Making the decision A statistical test tells you how often you’d get a difference of 3, simply by chance, if the null hypothesis is correct – no real difference between the two groups. On the other hand if we did the statistical analysis and P = 0.0001, then we say that you’d only get a difference as big as 3 by the play of chance 1 time in 10 000. That’s so rarely that we want to reject our hypothesis of no difference: there is something different about the new therapy.
Hypothesis testing Somewhere between 0.32 and 0.0001 we may not be sure whether to reject the null hypothesis or not. Mostly we reject the null hypothesis when, if the null hypothesis were true, the result we got would have happened less than 5 times in 100 by chance. This is the conventional cutoff of 5% or P < 0.05. This cutoff is commonly used but it’s arbitrary i.e. no particular reason why we use 0.05 rather than 0.06 or 0.048 or whatever.
Hypothesis testing Decision: Reject null hypothesis Do not reject null hypothesis Null hypothesis is true Type I error No error Null hypothesis is false Type II error
Type I and II errors A type I error is the incorrect rejection of a true null hypothesis (also known as a false positive finding). The probability of a type I error is denoted by the Greek letter (alpha). A type II error is incorrectly retaining a false null hypothesis (also known as a false negative finding). The probability of a type II error is denoted by the Greek letter (beta).
Level of significance Level of significance (α) – the threshold for declaring if a result is significant. If the null hypothesis is true, α is the probability of rejecting the null hypothesis. α is decided as part of the research design, while P-value is computed from data. α = 0.05 is most commonly used. Small α value reduces the chance of Type I error, but increases the chance of Type II error. Trade-off based on the consequences of Type I (false-positive) and Type II (false-negative) errors.
Power Power – the probability of rejecting a false null hypothesis. Statistical power is inversely related to β or the probability of making a Type II error (power is equal to 1 – β). Power depends on the sample size, variability, significance level and hypothetical effect size. You need a larger sample when you are looking for a small effect and when the standard deviation is large.
Common mistakes P-value is different from the level of significance α. P-value is computed from data, while α is decided as part of the experimental design. P-value is not the probability of the null hypothesis being true. P-value answers the following question: If the null hypothesis is true, what is the chance that random sampling will lead to a difference as large as or larger than observed in the study.
Common mistakes Do not focus only on whether a result is statistically significant. Look at the size of the effect and its precision as quantified by the confidence interval. A statistically significant result does not necessarily mean that the finding is scientifically or clinically important. Lack of difference may a meaningful result too! If you repeat an experiment, expect the P-value to be different. P-values are much less reproducible.
Choosing a statistical test Choice of a statistical test depends on: Level of measurement for the dependent and independent variables; Number of groups or dependent measures; Number of units of observation; Type of distribution; Population parameter of interest (mean, variance, differences between means and/or variances).
Student t-test 26 26
1-sample t-test Comparison of sample mean with a population mean It is known that the weight of young adult male has a mean value of 70.0 kg with a standard deviation of 4.0 kg. Thus the population mean, µ= 70.0 and population standard deviation, σ= 4.0. Data from random sample of 28 males of similar ages but with specific enzyme defect: mean body weight of 67.0 kg and the sample standard deviation of 4.2 kg. Question: Whether the studied group have a significantly lower body weight than the general population?
2-sample t-test Aim: Compare two means Example: Comparing pulse rate in people taking two different drugs Assumption: Both data sets are sampled from Gaussian distributions with the same population standard deviation Effect size: Difference between two means Null hypothesis: The two population means are identical Meaning of P value: If the two population means are identical, what is the chance of observing such a difference (or a bigger one) between means by chance alone?
Paired t-test Aim: Compare a continuous variable before and after an intervention Example: Comparing pulse rate before and after taking a drug Assumption: The population of paired differences is Gaussian Effect size: Mean of the paired differences Null hypothesis: The population mean of paired differences is zero Meaning of P value: If there is no difference in the population, what is the chance of observing such a difference (or a bigger one) between means by chance alone?
One-way ANOVA Aim: Compare three or more means Example: Comparing pulse rate in 3 groups of people, each group taking a different drug Assumption: All data sets are sampled from Gaussian distributions with the same population standard deviation Effect size: Fraction of the total variation explained by variation among group means Null hypothesis: All population means are identical Meaning of P value: If the population means are identical, what is the chance of observing such a difference (or a bigger one) between means by chance alone?
Parametric and non-parametric tests Parametric test – the variable we have measured in the sample is normally distributed in the population to which we plan to generalize our findings Non-parametric test – distribution free, no assumption about the distribution of the variable in the population
Parametric and non-parametric tests Type of test Non-parametric Parametric Scale Nominal Ordinal Ordinal, Interval, Ratio 1 group χ2 goodness of fit test Wilcoxon signed rank test 1-sample t-test 2 unrelated groups χ2 test Mann–Whitney U test 2-sample t-test 2 related groups McNemar test Paired t-test K unrelated groups Kruskal–Wallis H test ANOVA K related groups Friedman matched samples test ANOVA with repeated measurements
Normality test Normality tests are used to determine if a data set is modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed. In descriptive statistics terms, a normality test measures a goodness of fit of a normal model to the data – if the fit is poor then the data are not well modeled in that respect by a normal distribution, without making a judgment on any underlying variable. In frequentist statistics statistical hypothesis testing, data are tested against the null hypothesis that it is normally distributed.
Normality test Graphical methods An informal approach to testing normality is to compare a histogram of the sample data to a normal probability curve. The empirical distribution of the data (the histogram) should be bell-shaped and resemble the normal distribution. This might be difficult to see if the sample is small.
Normality test Graphical methods
Normality test Frequentist tests Tests of univariate normality include the following: D'Agostino's K-squared test Jarque–Bera test Anderson–Darling test Cramér–von Mises criterion Lilliefors test Kolmogorov–Smirnov test Shapiro–Wilk test Etc.
Normality test Kolmogorov–Smirnov test K–S test is a nonparametric test of the equality of distributions that can be used to compare a sample with a reference distribution (1-sample K–S test), or to compare two samples (2-sample K–S test). K–S statistic quantifies a distance between the empirical distribution of the sample and the cumulative distribution of the reference distribution, or between the empirical distributions of two samples. The null hypothesis is that the sample is drawn from the reference distribution (in the 1-sample case) or that the samples are drawn from the same distribution (in the 2-sample case).
Normality test Kolmogorov–Smirnov test In the special case of testing for normality of the distribution, samples are standardized and compared with a standard normal distribution. This is equivalent to setting the mean and variance of the reference distribution equal to the sample estimates, and it is known that using these to define the specific reference distribution changes the null distribution of the test statistic.
Mann–Whitney U test Ordinal data independent samples. H0: Two sampled populations are equivalent in location (they have the same mean ranks or medians). The observations from both groups are combined and ranked, with the average rank assigned in the case of ties. If the populations are identical in location, the ranks should be randomly mixed between the two samples.
Mann–Whitney U test Aim: Compare the average ranks or medians of two unrelated groups. Example: Comparing pain relief score of patients undergoing two different physiotherapy programmes. Effect size: Difference between the two medians (mean ranks). Null hypothesis: The two population medians (mean ranks) are identical. Meaning of P value: If the two population medians (mean ranks) are identical, what is the chance of observing such a difference (or a bigger one) between medians (mean ranks) by chance alone?
Kruskal–Wallis H test Ordinal data independent samples. H0: K sampled populations are equivalent in location (they have the same mean ranks). The observations from all groups are combined and ranked, with the average rank assigned in the case of ties. If the populations are identical in location, the ranks should be randomly mixed between the K samples.
Wilcoxon signed rank test Ordinal data two related samples. H0: Two sampled populations are equivalent in location (they have the same mean ranks). Takes into account information about the magnitude of differences within pairs and gives more weight to pairs that show large differences than to pairs that show small differences. Based on the ranks of the absolute values of the differences between the two variables.
Is there an association? Chi-square χ2 test Chi-square χ2 test is used to check for an association between 2 categorical variables. H0: There is no association between the variables. HA: There is an association between the variables. If two categorical variables are associated, it means the chance that an individual falls into a particular category for one variable depends upon the particular category they fall into for the other variable. Is there an association?
Chi-square χ2 test Let’s say that we want to determine if there is an association between Place of birth and Alcohol consumption. When we test if there is an association between these two variables, we are trying to determine if coming from a particular area makes an individual more likely to consume alcohol. If that is the case, then we can say that Place of birth and Alcohol consumption are related or associated. Assumptions: A large sample of independent observations; All expected counts should be ≥ 1 (no zeros); At least 80% of expected counts should ≥ 5.
Chi-square χ2 test The following table presents the data on place of birth and alcohol consumption. The two variables of interest, place of birth and alcohol consumption, have r = 4 and c = 2, resulting in 4 x 2 = 8 combinations of categories. Place of birth Alcohol No alcohol Big city 620 75 Rural 240 41 Small town 130 29 Suburban 190 38
Expected counts For i taking values from 1 to r (number of rows) and j taking values from 1 to c (number of columns), denote: Ri = total count of observations in the i-th row. Cj = total count of observations in the j-th column. Oij = observed count for the cell in the i-th row and the j-th column. Eij = expected count for the cell in the i-th row and the j-th column if the two variables were independent, i.e if H0 was true. These counts are calculated as
Expected counts E11 = (695 x 1180) / 1363 E12 = (695 x 183) / 1363 Place of birth Alcohol No alcohol Total Big city O11 = 620 O12 = 75 R1 = 695 Rural O21 = 240 O22 = 41 R2 = 281 Small town O31 = 130 O32 = 29 R3 = 159 Suburb O41 = 190 O42 = 38 R4 = 228 C1 = 1180 C2 = 183 n=1363 E11 = (695 x 1180) / 1363 E12 = (695 x 183) / 1363 E21 = (281 x 1180) / 1363 E22 = (281 x 183) / 1363 E31 = (159 x 1180) / 1363 E32 = (159 x 183) / 1363 E41 = (228 x 1180) / 1363 E42 = (228 x 183) / 1363
Chi-square χ2 test The test statistic measures the difference between the observed the expected counts assuming independence. If the statistic is large, it implies that the observed counts are not close to the counts we would expect to see if the two variables were independent. Thus, 'large' χ2 gives evidence against H0 and supports HA. To get the corresponding p-value we need to use a χ2 distribution with (r-1) x (c-1) df.
Association is not causation. Beware! Association is not causation. The observed association between two variables might be due to the action of a third, unobserved variable.
Limitations No categories should be less than 1 No more than 1/5 of the expected categories should be less than 5 To correct for this, can collect larger samples or combine your data for the smaller expected categories until their combined value is 5 or more Yates Correction* When there is only 1 degree of freedom, regular chi-test should not be used Apply the Yates correction by subtracting 0.5 from the absolute value of each calculated O-E term, then continue as usual with the new corrected values
Special case In a lot of cases the categorical variables of interest have two levels each. In this case, we can summarize the data using a contingency table having 2 rows and 2 columns (2x2 table): In this case, the χ2 statistic has a simplified form: Under the null hypothesis, χ2 statistic has a χ2 distribution with (2-1) x (2-1) = 1 degrees of freedom. Column 1 Column 2 Total Row 1 A B R1 Row 2 C D R2 C1 C2 n
Special case Gender Alcohol No alcohol Total Male 540 52 592 Female 325 31 356 865 83 948
Relationship between χ2 test and 2-sample t-test for comparing proportions When do we use χ2 test and when do we use 2-sample t-test? Situation 1: Both categorical variables of interest have exactly 2 levels. Question: Is there a relationship between the variables, or is there a difference in the proportions? Answer: Either χ2 test or two sided 2-sample t-test will lead to the same conclusion! In this case, the χ2 statistic = (t-statistic)2, and the p-values of the two tests are equal.
Relationship between χ2 test and 2-sample t-test for comparing proportions When do we use χ2 test and when do we use 2-sample t-test? Situation 2: Both categorical variables of interest have exactly 2 levels. Question: Is one proportion greater/smaller than the other. Answer: This is a one-sided test and you must use a 2-sample t-test. Situation 3: At least one of the two categorical variables of interest has more than 2 levels. Question: Is there a relationship between the variables? Answer: You must use a χ2 test.
Example Gender Smokers Non-smokers Male 540 52 Female 325 31 Q1: Is there a difference in the proportion of males and females that are smokers? Solution: Either a Chi-Square or Test of 2 proportions is fine. 2-proportions Chi-Square H0: pm – pf = 0 H0: There is no relationship between Gender and Smoking. Ha: pm – pf ≠ 0 Ha: There is a relationship between Gender and Smoking. Q2: Is the proportion of males who smoke greater than the proportion of females who smoke? Solution: Test of 2 proportions, because the alternative is one sided! 2-proportions H0: pm – pf = 0 vs Ha: pm – pf > 0
Example Residence Smokers Non-smokers Big city 620 75 Rural 240 41 Small town 130 29 Suburban 190 38 Q: Is there a relationship between Place of birth and Smoking? Is there a difference in the proportion smokers of difference residence? Solution: Chi-Square because Race has more than 2 levels! Chi-Square Test H0: There is no relationship between Residence and Smoking. Ha: There is a relationship between Residence and Smoking.
Fisher exact test This test is only available for 2 x 2 tables. For small n, the probability can be computed exactly by counting all possible tables that can be constructed based on the marginal frequencies. Thus, the Fisher exact test computes the exact probability under the null hypothesis of obtaining the current distribution of frequencies across cells, or one that is more uneven.
Fisher exact test Gender Dieting Non-dieting Total Male 1 11 12 Female 9 3 10 14 24 Gender Dieting Non-dieting Total Male a c a + c Female b d b + d a + b c + d a + b + c + d
Fisher exact test Gender Dieting Non-dieting Total Male 1 11 12 Female 9 3 10 14 24
Correlation Correlation quantifies the linear association between two variables. The direction and magnitude of the correlation is expressed by the correlation coefficient r. Its value can range between –1 and 1: r = 0 => no linear association; if r is positive, the two variables tend to increase or decrease together; if r is negative, the two variables are inversely related; If r is equal to 1 or –1, there is a perfect linear association between the two variables.
Correlation The most widely-used type of correlation coefficient is Pearson r, also called linear or product-moment correlation. For Pearson correlation, both X and Y values must be sampled from populations that follow normal distribution. Spearman rank correlation rs does not make this assumption. This non-parametric method separately ranks X and Y values and then calculates the correlation between the two sets of ranks.
Correlation Pearson correlation quantifies the linear relationship between X and Y. As X goes up, does Y go up a consistent amount? Spearman correlation quantifies the monotonic relationship between X and Y. As X goes up, does Y go up as well (by any amount)?
Coefficient of determination Coefficient of determination r2 is the proportion of the variance in the dependent variable that is predictable from the independent variable/variables. r = 0.70 (association between height and weight) r2 = 0.49 49% of the variance in weight is explained by / predictable from the height 51% of the variance in weight is not explained by / predictable from the height
Mistakes Correlation and coincidence Do not attempt to interpret a correlation coefficient without looking at the corresponding scatterplot!
Regression Those who are higher tend to weight more…
Regression Origin of term regression – children of tall parents tend be shorter than their parents and vice versa; children “regressed” toward the mean height of all children. Model – equation that describes the relationship between variables. Parameters (regression coefficients) Simple linear regression Y = a + b x X Outcome = Intercept + Slope x Predictor Multiple regression Y = a + b1 x X1+ b2 x X2 + … + bn x Xn
Linear regression X-axis Y-axis independent dependent predictor predicted carrier response input output
Linear regression y is the predicted (dependent) variable a is the intercept (estimated value of y when x = 0) b is the slope (the average change in y for each change of 1 unit in x) x is the predictor (independent variable)
Linear regression The association looks like it could be described by a straight line. There are many straight lines that could be drawn through the data. How to choose among them?
Least squares method Residual – the difference between the actual Y value and the Y value predicted by the model. Least squares method finds the values of the slope and intercept that minimize the sum of the squares of the residuals. Coefficient of determination r2 gives information about the goodness of fit of a model. In regression, r2 is a statistical measure of how well the regression line approximates the real-world data. An r2 of 1 indicates that the regression line perfectly fits the data.