Statistics
What is the point? Statistics are used to help makes sense of data Can I trust this result – is it a real effect Or could it be due to chance? Unfortunately, hypothesis testing and P<0.05 has come to be viewed as the holy grail of scientific research But why are results such as P=0.051 and P=0.49 treated so differently?
The null hypothesis The null hypothesis refers to a general statement that there is no relationship between two measured phenomena or no difference among groups “There is no difference in the levels of proliferation between the compound-X treated and untreated cells” Once the data has been obtained, statistical testing aids in the rejection of the null hypothesis, through the production of a P-value
What is a P-value? P<0.05 means that there is a less than 5% chance of observing a difference as large as you observed even if the two population means were identical P<0.05 means that there is a greater than 95% chance that the difference observed is a real difference and a less than 5% chance that it is simply due to chance P<0.05 means the experiment has worked, the results are significant
What does a P-value let you do? The widely accepted P-value threshold (known as α) is 0.05 Although many statisticians would argue 0.01 is a better threshold When a P-value <0.05 is obtained, the null hypothesis can be rejected However, if P>0.05, it does not mean the null hypothesis can be accepted, but just that there is insufficient evidence to reject it There is always a chance of an error being made when rejecting the null hypothesis: Type I error – chance of obtaining a false positive, or incorrectly rejecting the null hypothesis Type II error – chance of obtaining a false negative, or failure to reject the null hypothesis Familywise error – chance of making a type I error when performing multiple comparisons
Parametric vs. nonparametric tests Misconception: the decision to use a nonparametric test over a parametric test is solely based upon the normality of data/Gaussian distribution Considerations when deciding: Is it only approximately Gaussian? Will transforming the data make it Gaussian? Is the data set too small to detect non-Gaussian distributions? Or is the data set large meaning normality tests are too sensitive? Is the data non-continuous?
Types of variable An independent variable is experimentally manipulated in order to observe an effect on a dependent variable Categorical variables are discrete or qualitative variables Nominal – variables with two or more categories that do not have an intrinsic order Dichotomous – nominal variables which only have two categories or levels e.g. male or female Ordinal – variables with two or more categories that can be ordered or ranked Continuous variables are quantitative Interval – measured along a continuum, with a numerical value e.g. temperature measured in °C or °F Ratio – like an interval variable, but with 0 meaning there is none of that variable e.g. height or weight
Test selection Comparison of means Parametric (means) Non-parametric (medians) Differences between the means of two independent groups Unpaired Student’s t-test Mann-Whitney U test Differences between paired (matched) samples Paired Student’s t-test Wilcoxon signed rank test Differences in the means of >3 independent groups for one variable One-way ANOVA (+ multiple comparisons?) Kruskal-Wallis test (+ multiple comparisons?) Differences between >3 groups on the same subject Repeated measures ANOVA Friedman test Relationships between variables Parametric Non-parametric Strength of a relationship between 2 continuous variables Pearson’s correlation coefficient Spearman’s correlation coefficient Predicting the value of one variable given the value of a predictor variable Linear regression Assessing the relationship between 2 categorical variables Chi-squared test Assessing survival Comparing the survival of two groups Kaplan-Meier + logrank (Mantel-Cox) test or Gehan-Breslow-Wilcoxon test Analysing the effect of several risk factors on survival Proportional hazards regression (Cox regression)
Comparison of means Parametric (means) Non-parametric (medians) Differences between the means of two independent groups Unpaired Student’s t-test Mann-Whitney U test Differences between paired (matched) samples Paired Student’s t-test Wilcoxon signed rank test Differences in the means of >3 independent groups for one variable One-way ANOVA (+ multiple comparisons?) Kruskal-Wallis test (+ multiple comparisons?) Differences between >3 groups on the same subject Repeated measures ANOVA Friedman test Relationships between variables Parametric Non-parametric Strength of a relationship between 2 continuous variables Pearson’s correlation coefficient Spearman’s correlation coefficient Predicting the value of one variable given the value of a predictor variable Linear regression Assessing the relationship between 2 categorical variables Chi-squared test Assessing survival Comparing the survival of two groups Kaplan-Meier + logrank (Mantel-Cox) test or Gehan-Breslow-Wilcoxon test Analysing the effect of several risk factors on survival Proportional hazards regression (Cox regression)
Student’s t-test Comparison of means Parametric (means) Non-parametric (medians) Differences between the means of two independent groups Unpaired Student’s t-test Mann-Whitney U test Differences between paired (matched) samples Paired Student’s t-test Wilcoxon signed rank test Differences in the means of >3 independent groups for one variable One-way ANOVA (+ multiple comparisons?) Kruskal-Wallis test (+ multiple comparisons?) Differences between >3 groups on the same subject Repeated measures ANOVA Friedman test Relationships between variables Parametric Non-parametric Strength of a relationship between 2 continuous variables Pearson’s correlation coefficient Spearman’s correlation coefficient Predicting the value of one variable given the value of a predictor variable Linear regression Assessing the relationship between 2 categorical variables Chi-squared test Assessing survival Comparing the survival of two groups Kaplan-Meier + logrank (Mantel-Cox) test or Gehan-Breslow-Wilcoxon test Analysing the effect of several risk factors on survival Proportional hazards regression (Cox regression)
Spearman’s correlation Comparison of means Parametric (means) Non-parametric (medians) Differences between the means of two independent groups Unpaired Student’s t-test Mann-Whitney U test Differences between paired (matched) samples Paired Student’s t-test Wilcoxon signed rank test Differences in the means of >3 independent groups for one variable One-way ANOVA (+ multiple comparisons?) Kruskal-Wallis test (+ multiple comparisons?) Differences between >3 groups on the same subject Repeated measures ANOVA Friedman test Relationships between variables Parametric Non-parametric Strength of a relationship between 2 continuous variables Pearson’s correlation coefficient Spearman’s correlation coefficient Predicting the value of one variable given the value of a predictor variable Linear regression Assessing the relationship between 2 categorical variables Chi-squared test Assessing survival Comparing the survival of two groups Kaplan-Meier + logrank (Mantel-Cox) test or Gehan-Breslow-Wilcoxon test Analysing the effect of several risk factors on survival Proportional hazards regression (Cox regression)
Kaplan-Meier curve and log-rank test
Problem: multiple t-tests Student aka William Sealey Gosset Problem: multiple t-tests One of the most frequent errors I’ve come across is authors using t-tests when there are >2 groups This raises the familywise error rate E.g. There are 4 groups, meaning 6 comparisons in total Family wise error rate = 1 – (1 – 0.05)6 = 0.265 So that’s a 26.5% chance of identifying at least one significant result! Therefore, tests should be used which adjust for the familywise error rate, or a correction should be applied to the P-values Additionally, authors often state they performed an ANOVA but do not mention a multiple comparisons test The ANOVA itself only reports that there is a significant effect, but does not indicate which groups are significantly different. Therefore a multiple comparisons test should be stated.
Problem: univariate then multivariate analysis Frequently in clinical studies, authors perform univariate analysis and then subsequently perform multivariate analysis only on the variables which show a significant effect. This is inappropriate – the univariate analysis can be misleading and should not be used as a method of selecting variables for multivariate analysis. Is it acceptable to first perform univariate analysis to identify significant effects, and then perform multivariate analysis on the significant variables? No, however, this is a strategy commonly used in many clinical studies. The reason why, is that the results of a univariate analysis can be misleading, resulting in the reporting of a significant effect where none exists (or only a weak relationship exists). This is not an appropriate method by which to select variable for multivariate analysis. Therefore, multivariate analysis should be performed on all the variables the authors have measured – there is a reason they were selected as measurements in the first place, so they may be contributing to the effect being measured and should not be excluded from the multivariate analysis.
Interpreting error bars Do overlapping error bars mean that the difference between the groups is not significant? Not necessarily. Whether error bars overlap us not a foolproof way to judge significant differences, and should only be used as a rule of thumb as it depends on: - whether the sample sizes are equal - whether the error bars are showing standard deviation (SD), standard error (SEM) or 95% confidence intervals (CI) - Cannot draw a conclusion based on SD bars, overlapping SEM bars indicate P>0.05 and non overlapping 95% CI bars indicate P<0.05 - whether multiple comparisons are being performed - With multiple comparisons following an ANOVA, the significance level is normally higher to adjust for the family-wise error rate, but the error bars are graphed individually for each group, therefore cannot conclude anything from error bars when multiple comparisons have been used
Guidance to authors For statistical analyses: when statistical analyses have been performed, the following information should be provided: the name of the statistical test used (and statement of the normality of the data, for when the test is only appropriate for normal data), the n number for each analysis, the comparisons of interest, the alpha level and the actual P-value for each test (not merely P<0.05). It should be clear which statistical test was used to generate every P-value. Error bars on graphs should be clearly labeled, and it should be stated whether the number following the ± sign is a standard deviation or a standard error. The word ‘significant’ should only be used when referring to statistically significant results, and should be accompanied by the relevant P-value. Significance indicators should be used on graphs and tables, and should be described in the figure or table legend with it clear which groups are being compared.
What to be looking out for: Authors should be stating the following: The software used for analysis – name, version and supplier Whether standard deviation or standard error of the mean is being presented Which statistical tests were performed: Are all the tests used stated? Are the tests appropriate? When parametric tests (e.g. t-test, ANOVA) have been used, have they tested for normality (bearing in mind sample size)? In many papers, they will have to perform different tests for different data sets, therefore which test was used for what should be clearly stated (and clear in the results/figure legends) What was the significance threshold they used What was the sample size Are actual P-values reported (not merely P<0.05)?