Download presentation
Presentation is loading. Please wait.
Published byLiliana Boone Modified over 9 years ago
1
Making Inferences about Effects of Hypoxia/Altitude Will Hopkins (will@clear.net.nz, sportsci.org/will) Victoria University, Melbourne, Australia Making inferences (decisions or conclusions) Sampling variation; true effects Confidence limits; significance tests; magnitude-based inference Individual differences Important magnitudes of effect statistics Differences and changes in means Slopes and correlations Differences and ratios for proportions and counts Inferences when monitoring individual athletes Adapted from a lecture for the IOC’s post-graduate diploma in sports medicine, available at Sportscience (Analysis and Interpretation: Introduction)
2
Making Inferences (Decisions or Conclusions) Every sample gives a different value for a statistic, owing to sampling variation. So, the value of a sample statistic is only an estimate of the true (right, real, actual, very large sample, or population) value. But people want to make an inference about the true value. The best inferential statistic for this purpose is the confidence interval: the range within which the true value is likely to fall. "Likely" is usually 95%, so there is a 95% chance the true value is included in the confidence interval (and a 5% chance it is not). Confidence limits are the lower and upper ends of the interval. The limits represent how small and how large the effect "could" be. All effects should be shown with a confidence interval or limits. Example: the altitude camp produced an average increase in peak power of 2.2 % (95% confidence interval -0.6 to 5.0 %). But confidence limits alone don't provide an inference.
3
Statistical significance is the traditional way to make inferences. It’s also known as the null-hypothesis significance test. The inference is all about whether the effect could be zero or "null". If the 95% confidence interval includes zero, the effect "could be zero". The effect is "statistically non-significant (at the 5% level)": If the confidence interval does not include zero, the effect "couldn't be zero". The effect is "statistically significant (at the 5% level)". A p value is used to decide whether an effect is significant (p<0.05). positive negative 95% confidence interval statistically non-significant statistically significant (p=0.12) (p=0.02) (p=0.003) value of effect statistic (e.g., change in peak power) zero or null Researchers using p values should show exact values.
4
The exact definition of the p value is hard to understand. Useful interpretation: if you observe a positive effect, and the p value is 0.08, the probability is 0.04 that the effect is actually negative. People usually interpret significant as "a real (important) effect" and non-significant as "no real effect". These interpretations apply only if the study was done with the right sample size. Even then they are misleading: they don't convey the uncertainty or the magnitude. And you hardly ever know if the sample size is right. And I think the approach is too conservative. Attempts to address this problem with post-hoc power calculations are rare, generally wrong, and too hard to understand. So the only safe interpretation is whether the effect could be zero (p>0.05) or not (p<0.05).
5
But the issue for the practitioner is not whether the effect could be zero, but whether the effect could be important. Important has two meanings for clinical or practical effects: beneficial and harmful. One approach to this issue with significance testing is to assess the magnitude of an effect only if the effect is significant. So, if it’s significant, look at the value and decide whether it is beneficial, harmful or even trivial. Yes, a statistically significant trivial effect is trivial indeed! But if it’s non-significant, you can’t conclude anything. But some non-significant effects are likely to be trivial. Other non-significant effects have a good chance of being beneficial or harmful. And you often get non-significant effects, so you would like some way to deal with them. The confidence interval solves the problem, when important values for benefit and harm are taken into account.
6
Clinical inferences with the confidence interval The smallest clinically or practically important values define ranges of values of the effect that are beneficial, harmful and trivial. Smallest effects for benefit and harm are equal and opposite. Infer (decide) the outcome from the confidence interval, as follows: trivial harmful beneficial value of effect statistic (e.g., change in peak power) Clear: use it. Clear: maybe use it. Clear: don't use it. Clinical decision Clear: use it. smallest clinically harmful effect smallest clinically beneficial effect But p>0.05! P values fail here. But p<0.05! Unclear: more data needed.
7
This approach eliminates statistical significance. And you should eliminate all mention of hypotheses, hypothesis testing, p values and statistical significance from your manuscripts. If you show whether the effect is significant or not, readers will ignore what the confidence interval tells them about the effect. Yes, but reviewers and editors sometimes insist on p values. That’s changing. Some reviewers and editors are now insisting on interpretation of confidence intervals, regardless of the p value. The only issue is what level to make the confidence interval. To be careful about avoiding harm, you can make a conservative 99% confidence interval on the harm side. And to use effects only when there is a reasonable chance of benefit, you can make a 50% interval on the benefit side. But that's hard to understand. Consider this equivalent approach…
8
Clinical inferences with chances of benefit and harm The uncertainty in an effect can be expressed as chances that the true effect is beneficial and the risk that it is harmful. You would decide to use an effect with a reasonable chance of benefit, provided it had a sufficiently low risk of harm. I have opted for possibly beneficial (>25% chance of benefit) and most unlikely harmful (<0.5% chance of harm). An effect with >25% chance of benefit and >0.5% risk of harm is therefore unclear. You'd like to use it, but you daren't. Everything else is either clearly useful or clearly not worth using. However, if the chance of benefit is high (e.g., 80%), you could accept a higher risk of harm (e.g., 5%). I have formalized this less conservative approach by comparing the chance of benefit and risk of harm with an odds ratio (odds of benefit to odds of harm). If the odds ratio is >66, you may declare the effect beneficial.
9
When an effect has no obvious benefit or harm (e.g., a comparison of males and females), the inference is only about whether the effect could be substantially positive or negative. For such non-clinical inferences, use a symmetrical confidence interval, usually 90%, to decide whether the effect is clear. You can also use the chances that the effect is substantially positive or negative. One of these has to be <5% for the effect to be clear. Examples of ways to report magnitude-based inferences : possibly small benefit likely moderately harmful a trivial difference, clear at the 99% level a trivial-moderate increase [the lower and upper confidence limits] a small but unclear effect–more data are needed. Whatever, you should show confidence limits and interpret the uncertainty in a way readers (especially coaches and athletes) can understand.
10
Magnitude-based inference (MBI) has been criticized recently in an article in Medicine and Science in Sports and Exercise. The authors claimed that it is wrong to interpret the confidence interval as the range of possible values of the true effect. But Bayesian statisticians claim that this interpretation is correct. And other statisticians had already shown that magnitude-based inference is a legitimate form of Bayesian inference. More importantly, the authors claimed that magnitude-based inference has an unacceptable high rate of Type-I errors when the sample size is small. That is, they said that when the true effect is zero, with MBI you often claim incorrectly that the true effect is substantial. But the authors misunderstood inference in MBI. If the true effect is zero, it is obviously not wrong to claim that the effect could be trivial or substantial. And if you are worried about that, with many statistically significant effects, the true effect could be trivial or substantial.
11
Alan Batterham (my co-conspirator) and I have recently submitted an article to MSSE, in which we show that the Type-I error rate in non-clinical MBI is always ≤5%. With significance testing, the Type-I rate is always ≥5%. Significance testing is all “black and white”. MBI properly allows for “shades of grey”, which are not wrong. Furthermore, MBI has higher rates of publication-worthy outcomes with small sample sizes. With significance testing, if you get p>0.05, it’s hard to publish. And the resulting effects, if published, have trivial publication bias. Publication bias is a major problem with significance testing. But if our article is rejected, I will have to retire!
12
A caution about making an inference… Whatever method you use, the inference is about the one and only mean effect in the population. The confidence interval represents the uncertainty in the true effect, not a range of individual differences or individual responses. With a large-enough sample size, a treatment could be clearly beneficial (a mean beneficial effect with a narrow confidence interval, or p<0.0001, say), yet the treatment could be harmful for a substantial proportion of the population. Individual differences between groups and individual responses to a treatment are best summarized with a standard deviation to go with the mean effect. The mean effect and this SD both need confidence limits. Individual differences and responses may be accounted for by including subject characteristics as modifying covariates in the analysis. Researchers generally neglect this important issue.
13
Important Magnitudes of Effect Statistics Researchers need the smallest important magnitude of an effect statistic to estimate sample size for a study. Practitioners need to know about important magnitudes to monitor their athletes or patients. Researchers and practitioners all need to know about important magnitudes to interpret research findings. So the remaining slides are all about values for various magnitudes of various effect statistics.
14
Differences or Changes in the Mean The most common effect statistic, for numbers with decimals (continuous variables). Difference when comparing different groups, e.g., patients vs healthy. Change when tracking the same subjects. Difference in the changes in controlled trials. The between-subject standard deviation provides default thresholds for important differences and changes. You think about the effect ( mean) in terms of a fraction or multiple of the SD ( mean/SD). The effect is said to be standardized. The smallest and other important effects are…? Trial Strength prepost1post2 patientshealthy Strength Data are means & SD.
15
Example: the effect of a treatment on strength strength post pre Trivial effect (0.1x SD) strength post pre Very large effect (3.0x SD) Interpretation of standardized difference or change in means: Cohen < 0.2 Hopkins < 0.2 0.2- 0.5 0.2- 0.6 0.5- 0.8 0.6- 1.2 >0.8 1.2- 2.0 ?2.0-4.0 trivial small moderate large very large ?>4.0extremely large 0.20.61.22.04.0 trivialsmallmoderatelargevery largeext. large Complete scale:
16
Cautions with standardizing Standardizing works only when the SD comes from a sample that is representative of a well-defined population. The resulting magnitude applies only to that population. In a controlled trial, use the baseline (pre) SD, never the SD of change scores. Beware of authors who show standard errors of the mean (SEM) rather than standard deviations (SD). SEM = SD/ (sample size), so SEMs on graphs make effects look a lot bigger than they really are. Standardization may not be best for effects on means of some special variables: visual-analog scales, Likert scales, solo athletic performance…
17
Visual-analog scales The respondents indicate a perception on a line like this: Rate your muscle pain by placing a mark on this scale: Score the response as percent of the length of the line. Magnitude thresholds: ±10%, 30%, 50%, 70%, 90% for small, moderate, large, very large, extremely large differences/changes. Likert scales Example: How has the training camp affected your performance? much worse worse no effect better much better Most Likert-type questions have four to seven choices. Code them as integers (1, 2, 3, 4, 5…) and analyze as numerics. Magnitude thresholds are debatable. If you use the thresholds of the visual-analog scale as a guide, the threshold for a 6-pt scale would be ~0.5, 1.5, 2.5, 3.5 and 4.5. Both these scales could also be assessed by standardization. noneunbearable
18
Solo athletic performance For fitness tests and performance indicators of team-sport athletes, use standardization. But for top solo athletes, an enhancement that results in one extra medal per 10 competitions is the smallest important effect. The within-athlete variability that athletes show from one competition to the next determines this effect. Here's why… Because of this variability, each of the top athletes has a good chance of winning at each competition: Race 1 Race 2 Race 3
19
Your athlete needs an enhancement that overcomes this variability to give her or him a bigger chance of a medal. Simulations show an enhancement of 0.3 of an athlete's typical variability from competition to competition gives one extra win every 10 competitions. Example: if the variability is an SD (coefficient of variation) of 1%, the smallest important enhancement is 0.3%. In some early publications I have mistakenly referred to 0.5 of the variability as the smallest effect. Small, moderate, large, very large and extremely large effects result in an extra 1, 3, 5, 7 and 9 medals in every 10 competitions. The corresponding enhancements as factors of the variability are: 0.30.91.62.54.0 trivialsmallmoderatelargevery largeext. large
20
Beware: smallest effect on athletic performance in performance tests depends on method of measurement, because… A percent change in an athlete's ability to output power results in different percent changes in performance in different tests. These differences are due to the power-duration relationship for performance and the power-speed relationship for different modes of exercise. Example: a 1% change in endurance power output produces the following changes… 1% in running time-trial speed or time; ~0.4% in road-cycling time-trial time; 0.3% in rowing-ergometer time-trial time; ~15% in time to exhaustion in a constant-power test. A hard-to-interpret change in any test following a fatiguing pre-load. (But such tests can be interpreted for cycling road races: see Bonetti and Hopkins, Sportscience 14, 63-70, 2010.)
21
Slope (or Gradient) Used when the predictor and dependent are both numeric and a straight line fits the trend. The unit of the predictor is arbitrary. Example: a 2% per year decline in activity seems trivial… yet 20% per decade seems large. So it's best to express a slope as the difference in the dependent per two SDs of predictor. It gives the difference in the dependent (physical activity) between subjects with a typically low and high value of the predictor (age). The SD for standardizing the resulting effect is the standard error of the estimate (the scatter about the line). Age Physical activity 2 SD
22
Correlation Coefficient This effect statistic is closely related to the slope. It represents the overall linearity in a scatterplot. Examples: Negative values represent negative slopes. The value is unaffected by the scaling of the two variables. And it's much easier to calculate than a slope. But a properly calculated slope is easier to interpret clinically. Smallest important correlation is ±0.1. Here’s the complete scale: 0.10.30.50.70.9 triviallowmoderatehighvery highext. high r = 0.00r = 0.10r = 0.30r = 0.50r = 0.70r = 0.90r = 1.00
23
Differences and Ratios of Proportions, Risks, Hazards, Odds Example: percent of male and female players injured in a season of touch rugby. Risk difference or proportion difference A common measure. Example: a - b = 75% - 36% = 39%. Problem: the sense of magnitude depends on how big the proportions are. Example: for the same 10% difference, 90% vs 80% doesn't seem big, but 11% vs 1% can be interpreted as huge (11x the risk). So there is no scale of magnitudes for a risk or proportion difference. Exception: effects on winning a close match can be expressed as a proportion difference: 55% vs 45% is a 10% difference or 1 extra match in every 10 matches; 65% vs 35% is 3 extra, and so on. Hence this scale for extra matches won or lost per 10 matches: malefemale Proportion injured (%) Sex 0 100 a = 75% b = 36% 1.03.05.07.09.0 trivialsmallmoderatelargevery largeext. large
24
But the analyses don't work properly with proportions. We have to use hazards or odds instead of proportions. I will explain shortly. Risk ratio (relative risk) or proportion ratio Another common measure. Example: a/b = 75/36 = 2.1, which means males are "2.1 times more likely" to be injured, or "a 110% increase in risk" of injury for males. Problem: if it's a time-dependent measure, the risk ratio changes. If you wait long enough, everyone gets injured, so the risk ratio = 100/100 = 1.00. But it works for rare time-dependent risks and for time-independent classifications (e.g., proportion playing a sport). Hence we need values for the smallest and other important ratios for such risks and proportions. malefemale Proportion injured (%) Sex 0 100 a = 75% b = 36%
25
The smallest ratio is when one event or case in every 10 is due to the effect. Example: one in 10 injuries is due to being male. That is, for every 10 injured males, there are 9 injured females. If there are N males and N females (injured and uninjured), the injury risks are 10/N and 9/N, and the risk ratio = (10/N)/(9/N) = 10/9. For moderate, large, very large and extremely large ratios, for every 10 injured males, there are 7, 5, 3 and 1 injured females. Corresponding risk ratios are 10/7, 10/5, 10/3 and 10/1. Hence this scale for proportion ratio and low-risk ratio: and the inverses for reductions in proportions: But there is still the problem of analyzing proportions properly. Two solutions: hazards instead of risks; odds instead of proportions 1.111.432.03.310 trivialsmallmoderatelargevery largeext. large 0.90.70.50.30.1 trivialsmallmoderatelargevery largeext. large
26
Hazard ratio or incidence-rate ratio This ratio is used for risks or proportions that change with time, especially for risks or proportions that get large. The proportion is converted to an incidence rate, which is known as a hazard. Example: The incidence rate for male injuries was 2.0% per week. The incidence rate for female injuries was 0.8% per week. Therefore the incidence-rate ratio or hazard ratio = 2.0/0.8 = 2.5. You can think of the hazard ratio as the right-now risk ratio: right now, males have 2.5 times greater risk of injury than females. The scale of magnitudes is the same as for ratio of proportions: and the inverses 0.9, 0.7, 0.5, 0.3, 0.1. 1.111.432.03.310 trivialsmallmoderatelargevery largeext. large
27
Odds ratio for time- in dependent classifications. Odds are the awkward but only way to model classifications. Example: proportions of boys and girls playing a sport. Odds of a boy playing = a/c = 75/25. Odds of a girl playing = b/d = 36/64. Therefore the odds ratio = (75/25)/(36/64) = 5.3. Interpret the ratio as "…times more likely" only when the proportions in both groups are small (<10%). The odds ratio is then approximately equal to the proportion ratio. To assess magnitude, authors should convert the odds ratio and its confidence limits to the proportion ratio and its confidence limits. Unfortunately they often just leave effects as odds ratios. boysgirls Proportion playing (%) Sex 0 100 a = 75% b = 36% c = 25% d = 64%
28
Ratio of Counts Example: 93 injuries per 1000 player-hours of match play in sport A 69 injuries per 1000 player-hours of match play in sport B. The effect is expressed as a count ratio: 93/69 = 1.35 times more injuries in sport A than in sport B. It can also be expressed as 35% more injuries. The scale of magnitudes for the count ratio is the same as for ratio of proportions or hazards: and the inverses 0.9, 0.7, 0.5, 0.3, 0.1. 1.111.432.03.310 trivialsmallmoderatelargevery largeext. large
29
Inferences When Monitoring Individual Athletes Subjective assessments of the coach, athlete, and support personnel provide important evidence. An experienced coach notices important changes. Objective assessments of change with an instrument or test are contaminated with error or "noise". The noise is represented by the standard deviation of repeated measurements: the standard (or typical) error of measurement. Think of ± twice the error as the equivalent of confidence limits for the athlete's true change. Take into account clinically or practically important changes. Wow, you've made a moderate improvement! No real change since last time. [High reliability needed here.] It looks like you have made a small improvement, but we can’t be sure, because the test is quite noisy. [This is the usual scenario.]
30
Summary Inferential statistics are used to make conclusions about the true value of a simple or effect statistic derived from a sample. The inference from a null-hypothesis significance test is about whether the true value of an effect statistic could be null (zero). Magnitude-based inference addresses the issue of whether the true value could be important (beneficial and harmful, or substantial). Effect magnitudes have key roles in research and practice. Small, moderate, large, very large and extremely large magnitude thresholds for various effects… Mean differences: 0.20, 0.60, 1.2, 2.0, 4.0 of between-athlete SD. Solo performance: 0.3, 0.9, 1.6, 2.5, 4.0 of race-to-race CV. Correlations: 0.10, 0.30, 0.50, 0.70, 0.90. Team performance: 1, 3, 5,7, 9 games won/lost in every 10 games. Ratios of proportions, hazards, counts: 1.11, 1.43, 2.0, 3.3, 10 and their inverses 0.9, 0.7, 0.5, 0.3, 0.1. Take noise and thresholds into account when monitoring athletes.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.