Download presentation
Presentation is loading. Please wait.
Published byHilda Cain Modified over 8 years ago
1
Statistical Analysis and Data Interpretation: What is Important for the Athlete and Statistician Will G Hopkins Institute of Sport and Recreation Research AUT University Auckland, NZ
2
Background Research is all about outcomes or effects. Example: the effect of a supplement on performance. An effect = a relationship between a predictor variable (supplement) and a dependent variable (performance). We summarize an effect with a value of an effect statistic. You have to understand the various effect statistics and know how to interpret their values in practical terms. That's the first half of this talk. Research is also all about studying a sample. You can't study everyone! But another sample would give a different value for the effect. So there is uncertainty about the true value of the effect. We deal with this uncertainty by making an inference about the true value in practical terms. That's the second half of this talk.
3
Effect Statistics and their Magnitudes strength post pre Age Activity r = - 0.57 Types of variables and models Effect statistics Difference between means "Slope" Correlation Risk, odds, hazard ratio Controlling for something Effect of errors Proportion injured (%) 100 Time 0 males females t1t1 t2t2
4
Effect statistics depend on the nature of variables. There are four kinds of variable… Continuous: mass, distance, time, current; measures derived therefrom, such as force, concentration, voltage. Counts: such as number of injuries in a season. Ordinal: values are levels with a sense of rank order, such as a 4-pt Likert scale for injury severity (none, mild, moderate, severe). Nominal: values are levels representing names, such as injured (no, yes), and type of sport (baseball, football, hockey). Continuous, counts, ordinals can be treated as numerics, but… As dependents, counts need special treatment. If ordinal has only a few levels or subjects are stacked at one end, analyze as nominal. Nominals with >2 levels are best dichotomized by comparing or combining levels appropriately. Hard to define magnitude when comparing >2 levels at once.
5
Effect statistics also depend on the relationship you model between the dependent and predictor. The model is almost always linear or can be made so. Linear model: sum of predictors and/or their products, plus error. Seems too simple, but linear models work well. Effect statistics for linear models, with examples of variables: "slope" (difference per unit of predictor); correlation numeric "slope" (difference or ratio per unit of predictor) numericnominal regression; general linear; mixed; generalized linear difference in meansnominalnumeric logistic regression; generalized linear; proportional hazards diffs or ratios of proportions, odds, rates, mean event time nominal Linear modelEffectPredictorDependent SelectedNYFitness InjuredNYSex ActivityAge StrengthTrial
6
You consider the difference or change in the mean for pairwise comparisons of levels of the predictor. Interpret magnitudes using your clinical or practical experience. Otherwise use the standardized difference or change. Also known as Cohen’s effect size or Cohen's d statistic. You express the difference or change in the mean as a fraction of the between-subject standard deviation ( mean/SD). For many measures use the log-transformed dependent variable. The smallest important effect is ±0.2. Complete (+ive) scale: difference or change in meansnominalnumeric StrengthTrial EffectPredictorDependent Trial Strength prepost <0.2 trivial 0.2-0.6 small 0.6-1.2 moderate 2.0-4.0 very large >4.0 awesome 1.2-2.0 large
7
Magnitudes of effects on athletic performance need special care. For team-sport athletes, use standardized difference or change in means to get smallest important and other magnitudes. For solo athletes, smallest important effect is 0.3 of a top athlete's typical event-to-event variation in performance. Beware: smallest effect depends on how it's measured, because… A percent change in an athlete's ability to output power results in different percent changes in performance in different tests. Example: a 1% change in endurance power output produces… 1% in running time-trial speed or time; ~0.4% in road-cycling time-trial time; 0.3% in rowing-ergometer time-trial time; ~15% in time to exhaustion in a constant-power test lasting 10 min. A hard-to-interpret change in any test following a pre-load. Complete (+ive) scale for solo athletic performance: <0.3 trivial 0.3-0.9 small 0.9-1.6 moderate >0.5 large 2.5-4.0 very large >4.0 awesome 1.6-2.5 large
8
A slope is more practical than a correlation. But unit of predictor is arbitrary, so it's hard to define smallest effect for a slope. Example: -2% per year may seem trivial, yet -20% per decade may seem large. For consistency with interpretation of correlation, better to express slope as difference per two SDs of predictor. Fits with smallest important effect of 0.2 SD for the dependent. But underestimates magnitude of larger effects. Easier to interpret the correlation, using another Cohen's scale. Smallest important correlation is ±0.1. Complete (+ive) scale: EffectPredictorDependent "slope" (difference per unit of predictor); correlation numeric ActivityAge Activity <0.1 trivial 0.1-0.3 small 0.3-0.5 moderate >0.5 large 0.7-0.9 very large >0.9 awesome 0.5-0.7 large r = - 0.57
9
Subjects all start off "N", but different proportions end up "Y". Proportion on "Y" at given time = risk = a for males and b for females, so… Risk difference = a - b. Good measure for an individual, but time dependent. Example: a - b = 83% - 50% = 33%, so extra chance of one in three of injury if you are a male. Smallest effects for common events: Time (months) 0 100 Proportion injured (%) differences or ratios of proportions, odds, rates, mean event time nominal InjuredNYSex EffectPredictorDependent a b males females <10% trivial 10-30% small 30-50% moderate >0.5 large 70-90% very large >90% awesome 50-70% large
10
Risk ratio or relative risk = a/b. Good measure for public health, but time dependent. Smallest effect for uncommon events: 1.1 (or 1/1.1). Based on smallest effect for hazard ratio. Odds = (proportion on "Y")/(proportion on "N") = a/c for males and b/d for females. Odds are needed for the calculations with some modeling procedures. Using risks doesn't work properly. Odds ratio = (a/c)/(b/d). Hard to interpret, but it approximates relative risk when a<10% and b<10% (which is often). Can convert exactly to relative risk if know a or b. a b Time (months) 0 100 Proportion injured (%) males females d c
11
Hazard or incidence-rate ratio = e/f. Hazard = instantaneous risk rate = proportion per infinitesimal of time. Hazard ratio is best statistical measure. Hazard ratio = risk ratio = odds ratio for low risks (short times). Hazard ratio of 2.5 means 2.5 times as likely to happen in the next moment (although actual chance is very small!). Also means mean time to event differs by a factor of 2.5. Hazard ratio is not dependent on time if incident rates are constant. And even if both rates change, often OK to assume their ratio is constant (e.g., uninjured males are always twice as likely to get injured as uninjured females)… …which is the basis of proportional-hazards modeling. Smallest ratio for uncommon injuries/illness: 1.1 or 1/1.1, because = 10% increase or decrease in the workload of a hospital ward. and because 10% of an afflicted group can blame the risk factor. Time (months) 0 100 Proportion injured (%) males females e f
12
Difference in mean time to event = t 2 - t 1. Possibly best measure for individual to understand. Can standardize with SD of time to event. Therefore smallest effect = 0.20 of SD, small effect = 0.2 - 0.6, etc. Hence this hazard-ratio scale for an individual: These correspond reasonably well with the scale for risk difference for common injuries or other common events, where most individuals are eventually affected. Proportion injured (%) 100 Time 0 males females t1t1 t2t2 <1.3 trivial 1.3-2.0 small 2.0-4.5 moderate 4.5-13 large 13-180 very large >180 awesome
13
Researchers derive and interpret the slope, not a correlation. Has to be modeled as odds ratio or hazard ratio per unit of predictor. Example: odds ratio for selection (or for injury or whatever) = 3.0 per unit of fitness. How to interpret odds ratio per unit: Imagine your fitness makes your chances of selection low, say 1%. Each unit of fitness increases your chances of selection 3.0 times. So an increase of one unit increases your chances to 3%. An increase of two units increases your chances to 9%, etc. This simple approach stops working well when chances are >10%. You can work out the exact chances in such cases (but it's harder). EffectPredictorDependent "slope" (odds ratio per unit of predictor) numericnominal SelectedNYFitness 0 100 Chances selected (%) predicted chances
14
Magnitudes of Effects When Controlling for Something Control for = hold it equal or constant or adjust for it. Example: the effect of a supplement on performance adjusted for competitive level. Adjustment is especially important if, for example, athletes using the supplement have a mean competitive level different from those not using it. You control for something by adding it to the model as a predictor. Effect of original predictor changes. No problem for a difference in means or a slope. But the appropriate correlation is now a partial correlation. Partial = effect of the predictor within a virtual subgroup of subjects who all have the same values of the other predictors.
15
Magnitudes of Effects When Variables Have Errors Error usually take the form of random noise (+ive and –ive) added to a measurement. If you add a different random value every time you measure the same subject, it's within-subject noise and the measure has poor reliability (reproducibility). If you add a different random value every time you measure a different subject, it's between-subject noise and the measure has poor validity. Poor reliability implies poor validity, but not necessarily vice versa. In general noise reduces the magnitude of an effect and makes it less precise (wider confidence interval). Exception: poor reliability in a dependent variable doesn't reduce the magnitude of differences or changes in the mean. Take error into account, especially when assessing unclear outcomes.
16
Summary An effect statistic summarizes the outcome of a study and depends on the nature of its dependent and predictor variables. Numeric dependent, nominal predictor: difference in means Numeric dependent, numeric predictor: slope or correlation Nominal dependent, nominal predictor: risk, odds, hazard, time to event differences or ratios Nominal dependent, numeric predictor: ditto, per unit of predictor. Some smallest important effects: difference in means, 0.20 SD; correlation, 0.10; common events: risk difference, 10%; hazard ratio 1.3%; rare costly events: risk or hazard ratio, 1.1. Researchers control or adjust for something by including it as another predictor in the analysis. Errors in variables make an effect less precise and usually reduce its magnitude.
17
Making Inferences about True Values harmful trivial beneficial probability value of effect statistic Hypothesis testing, p values, statistical significance Confidence limits Chances of benefit/harm
18
This slideshow is an updated version of the slideshow in: Batterham AM, Hopkins WG (2005). Making meaningful inferences about magnitudes. Sportscience 9, 6-13. See link at sportsci.org. Other resources: Hopkins WG (2007). A spreadsheet for deriving a confidence interval, mechanistic inference and clinical inference from a p value. Sportscience 11, 16-20. See sportsci.org. Hopkins WG, Marshall SW, Batterham AM, Hanin J (2009). Progressive statistics for studies in sports medicine and exercise science. Medicine and Science in Sports and Exercise 41, 3-12. (Also available at sportsci.org: Sportscience 13, 55-70, 2009.)
19
Background A major aim of research is to make an inference about an effect in a population based on study of a sample. Null-hypothesis testing via the P value and statistical significance is the traditional but flawed approach to making an inference. Precision of estimation via confidence limits is an improvement. But what's missing is some way to make inferences about the clinical, practical or mechanistic significance of an effect. I will explain how to do it via confidence limits using values for the smallest beneficial and harmful effect. I will also explain how to do it by calculating and interpreting chances that an effect is beneficial, trivial, and harmful.
20
Hypothesis Testing, P Values and Statistical Significance Based on the notion that we can disprove, but not prove, things. Therefore, we need a thing to disprove. Let's try the null hypothesis: the population or true effect is zero. If the value of the observed effect is unlikely under this assumption, we reject (disprove) the null hypothesis. Unlikely is related to (but not equal to) the P value. P < 0.05 is regarded as unlikely enough to reject the null hypothesis (that is, to conclude the effect is not zero or null). We say the effect is statistically significant at the 0.05 or 5% level. Some folks also say there is a real effect. P > 0.05 means there is not enough evidence to reject the null. We say the effect is statistically non-significant. Some folks also accept the null and say there is no effect.
21
Problems with this philosophy… We can disprove things only in pure mathematics, not in real life. Failure to reject the null doesn't mean we have to accept the null. In any case, true effects are always "real", never zero. So… The null hypothesis is always false! Therefore, to assume that effects are zero until disproved is illogical and sometimes impractical or unethical. 0.05 is arbitrary. The P value is not a probability of anything in reality. Some useful effects aren't statistically significant. Some statistically significant effects aren't useful. Non-significant is usually misinterpreted as unpublishable. So good data don't get published. Solution: clinical significance or magnitude -based inferences via confidence limits and chances of benefit and harm. Statistical significance = null -based inferences.
22
Start with confidence limits, which define a range within which we infer the true, population or large-sample value is likely to fall. Likely is usually a probability of 0.95 (for 95% limits). Clinical Significance via Confidence Limits Representation of the limits as a confidence interval : Area = 0.95 upper likely limitlower likely limit observed value probability value of effect statistic 0 positivenegative probability distribution of true value, given the observed value value of effect statistic 0 positivenegative likely range of true value
23
For clinical significance, we interpret confidence limits in relation to the smallest clinically beneficial and harmful effects. These are usually equal and opposite in sign. They define regions of beneficial, trivial, and harmful values. The next slide is the key to clinical or practical significance. All you need is these two things: the confidence interval and a sense of what is important (e.g., beneficial and harmful). The rest is icing on the cake. trivial harmful beneficial smallest clinically harmful effect smallest clinically beneficial effect value of effect statistic 0 positivenegative
24
Put the confidence interval and these regions together to make a decision about clinically significant, clear or decisive effects. UNDERSTAND THIS SLIDE! 0 value of effect statistic positivenegative trivial harmful beneficial Yes: use it.Yes Yes: use it.Yes Yes: depends.No Yes: don't use it.Yes Yes: don't use it.No Yes: don't use it.No Yes: don't use it.Yes Yes: don't use it.Yes No: need more research. No Clinically decisive? Statistically significant? Why hypothesis testing is unethical and impractical! Yes: use it.No
25
Making a crude call on magnitude. Declare the observed magnitude of clinically clear effects. 0 value of effect statistic positivenegative trivial harmful beneficial Beneficial Unclear Beneficial Trivial Harmful
26
We calculate probabilities that the true effect could be clinically beneficial, trivial, or harmful (P beneficial, P trivial, P harmful ). These Ps are NOT the proportions of positive, non- and negative responders in the population. Calculating the Ps is easy. Put the observed value, smallest beneficial/harmful value, and P value into a spreadsheet at newstats.org. The Ps allow a more detailed call on magnitude, as follows… Clinical Significance via Clinical Chances smallest harmful value P harmful = 0.05 P trivial = 0.15 probability distribution of true value smallest beneficial value P beneficial = 0.80 0 probability value of effect statistic positivenegative observed value
27
Making a more detailed call on magnitudes using chances of benefit and harm. 0.01/0.9/99Almost certainly beneficial 0 value of effect statistic positivenegative trivial harmful beneficial 0.1/7/93Probably beneficial 1/59/40Possibly trivial 0.2/97/3Very likely trivial 2/94/4Probably trivial 28/70/2Possibly trivial 74/26/0.2Possibly harmful 97/3/0.01Very likely harmful 9/60/31Unclear (could be beneficial and harmful) 2/33/65Possibly beneficial Chances (%) that the effect is harmful/trivial/beneficial
28
Use this table for the plain-language version of chances: An effect should be almost certainly not harmful ( 25%) before you decide to use it. But you can tolerate higher chances of harm if chances of benefit are much higher: e.g., 3% harm and 76% benefit = clearly useful. I use an odds ratio of benefit/harm of >66 in such situations. The effect… beneficial/trivial/harmful is almost certainly not… Probability <0.005 Chances <0.5% Odds <1:199 is very unlikely to be… 0.005–0.050.5–5%1:999–1:19 is unlikely to be…, is probably not… 0.05–0.255–25%1:19–1:3 is possibly (not)…, may (not) be… 0.25–0.7525–75%1:3–3:1 is likely to be…, is probably… is very likely to be… is almost certainly… 0.75–0.95 0.95–0.995 >0.995 75–95% 95–99.5% >99.5% 3:1–19:1 19:1–199:1 >199:1
29
P value 0.03 value of statistic 1.5 Conf. level (%) 90 deg. of freedom 18 positivenegative 1 threshold values for clinical chances Confidence limits lowerupper 0.42.6 2.40.20 -0.75.519018 prob (%)odds 783:1 likely, probable clinically positive Chances (% or odds) that the true value of the statistic is783:1 likely, probable 191:4 unlikely, probably not 31:30 very unlikely Two examples of use of the spreadsheet for clinical chances: prob (%)odds 221:3 unlikely, probably not clinically trivial prob (%)odds 01:2071 almost certainly not clinically negative Both these effects are clinically decisive, clear, or significant.
30
How to Publish Clinical Chances Example of a table from a randomized controlled trial: Mean improvement (%) and 90% confidence limits 3.1; ±1.6 2.6; ±1.2Very likely beneficial 0.5; ±1.4Unclear Compared groups Slow - control Explosive - control Slow - explosive Qualitative outcome a Almost certainly beneficial a with reference to a smallest worthwhile change of 0.5%. TABLE 1–Differences in improvements in kayaking sprint speed between slow, explosive and control training groups.
31
Problem: what's the smallest clinically important effect? If you can't answer this question, quit the field. This problem applies also with hypothesis testing, because it determines sample size you need to test the null properly. Example: in many solo sports, ~0.5% change in power output changes substantially a top athlete's chances of winning. The default for most other populations and effects is Cohen's set of smallest values. These values apply to clinical, practical and/or mechanistic importance… Standardized changes or differences in the mean: 0.20 of the between-subject standard deviation. In a controlled trial, it's the SD of all subjects in the pre-test, not the SD of the change scores. Correlations: 0.10. Injury or health risk, odds or hazard ratios: 1.1-1.3.
32
Problem: hardly anyone is using these new approaches yet. So you have to use a spreadsheet to convert a published P value into a more meaningful magnitude-based inference. If the authors state “P<0.05” you can’t do it properly. If they state “P>0.05” or “NS”, you can’t do it at all. And it’s hard to publish your own research using this approach. Problem: these approaches, and hypothesis testing, deal with uncertainty about an effect in a population. Which is OK for effects like correlations or mean differences. But effects like risk of injury or changes in physiology or performance can apply to individuals. Alas, more information and analyses are needed to make inferences about effects on individuals. Researchers almost always ignore this issue, because… they don’t know how to deal with it, and/or… they don’t have enough data to deal with it properly.
33
Summary Show the observed magnitude of the effect. Attend to precision of estimation by showing 90% confidence limits of the true value. Do NOT show P values, do NOT test a hypothesis and do NOT mention statistical significance. Attend to clinical, practical or mechanistic significance by… stating, with justification, the smallest worthwhile effect, then… interpreting the confidence limits in relation to this effect, or… estimating probabilities that the true effect is beneficial, trivial, and/or harmful (or substantially positive, trivial, and/or negative). Make a qualitative statement about the clinical or practical significance of the effect, using unlikely, very likely, and so on. Remember, it applies to populations, not individuals.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.