Sample size and study design Brian Healy, PhD
Comments from last time We did not cover confounding Too much in one class/Not enough examples/Superficial level I wanted to show one example for each type of analysis so that you can determine what your data matches. This way you can speak to a statistician knowing the basic ideas. My hope was for you to feel confident enough to learn more about the topics relevant to you Worked example lectures This is not basic biostatistics I did Teach for America
Objectives Type II error How to improve power? Sample size calculation Study design considerations
Review Previous classes we have focused on data analysis AFTER data collection Hypothesis testing allowed us to determine whether there was a statistically significant: Difference between groups Association between two continuous factors Association between two dichotomous factors
Example We know that the heart rate for healthy adult is 80 beats per minute and this has an approximately normal distribution (according to my wife) Some elite athletes, like Lance Armstrong, have lower heart rate, but it is not known if this is true on average How could we address this question?
Experimental design One way to do this is to collect a sample of normal controls and a sample of elite athletes and compare their mean What test would you use? Another way is to collect a sample of elite athletes and compare their mean to the known population mean This is a one sample test Null hypothesis: meanelite=80
Question How large a sample of elite athletes should I collect? What is the benefit of having a large sample size? More information More accurate estimate of the population mean What is the disadvantage of a large sample size? Cost Effort required to collect What is the “correct” sample size?
Effect of sample size Let’s say we wanted to estimate the blood pressure of people at MGH If we sampled 3 people, would we have a good estimate of the population mean? How much will sample mean vary from sample to sample? Does our estimate of the improve if we sampled 30 people? Would the sample mean to vary more or less from sample to sample? What about 300 people?
Simulation http://onlinestatbook.com/stat_sim/sampling_dist/index.html What is the shape of the distribution of sample means? Where is the curve centered? What happens to curve as sample size increases? Technical: Central limit theorem
Standard error of the mean There are two measures of spread in the data Standard deviation: measure of spread of the individual observations The estimate of this is the standard deviation of the observations: Standard error: standard deviation of the sample mean The estimate of this is the standard deviation of the observations divided by the sample size
Technical: Distribution of sample mean under the null If we took repeated samples and calculated the sample mean, the distribution of the sample means would have a distribution Spread in distribution is based on standard error Mean of distribution=80
Type I error We could plot the distribution of the sample means under the null before collecting data Type I error is the probability that you reject the null given that the null is true a = P(reject H0 | H0 is true) Notice that the shaded area is still part of the null curve, but it is in the tail of the distribution a
Hypothesis test-review After data collection, we can calculate the p-value If the p-value is less than the pre-specified a-level, we reject the null hypothesis
As the sample size increases, the standard error decreases p-value is based on the standard error As you sample size increases, the p-value decreases if the mean and standard deviation do not change With an extremely large sample, a very small departure from the null is statistically significant What would you think if you found the sample mean heart rate of three elite athletes was 70 beats per minute? Do your thoughts change if you sampled 300 athletes and found the same sample mean?
How much data should we collect? Depends on several factors: Type I error Type II error (power) Difference we are trying to detect (null and alternative hypotheses) Standard deviation Remember this is decided BEFORE the study!!!
Type II error Definition: when you fail to reject the null hypothesis when the alternative is in fact true (type II error) This type of error is based on a specific alternative b= P(fail to reject the H0 | HA is true)
Power Definition: the probability that you reject the null hypothesis given that the alternative hypothesis is true. This is what we want to happen. Power = P(reject Ho | HA is true) = 1 - b Since this is a good thing, we want this to be high
Fail to reject H0 Reject Ho This is the population distribution under the null hypothesis The location of the curve is m0 and the spread in the curve is the standard error This is the cut-off value. This is the population distribution under the alternative hypothesis m0 m1
Fail to reject H0 Reject Ho a = P(reject H0| H0 is true) m0 Power = P(reject H0| HA is true) = P(fail to reject H0| HA is true) m1
Life is a trade off These two errors are related We usually assume that the type I error is 0.05 and calculate the type II error for a specific alternative If you are want to be more strict and falsely reject the null only 1% of the time (a=0.01), the chance of a type II error increases Sensitivity/specificity or false positive/false negative
Changing the power Note how the power (green) increases as you increase the difference between the null and alternative hypotheses How else do you think we could increase the power?
Another way to increase power is to increase type I error rate Two other ways to increase power involve changing the shape of the distribution Increasing the sample size When the sample size increases, the curve for the sample means tightens Decreasing the variability in the population When there is less variability, the curve for the sample means also tightens
Example For our study, we know that we can enroll 40 elite athletes. We also know that the population mean is 80 beats per minute and the standard deviation is 20 We believe the elite athletes will have a mean of 70 beats per minute How much power would we have to detect this difference at the two-sided 0.05 level? All this information fully defined our curves
Using STATA, we find that we have 88 Using STATA, we find that we have 88.5% power to detect the difference of 10 beats per minute between the groups at the two-sided 0.05 level using a one sample z-test Question: If we were able to enroll more subjects would our power increase or decrease?
Conclusions For a specific sample size, standard deviation, difference between the means and type I error, we can calculate the power Changing any of the four parameters above will change the power Some under the control of the investigator, but others are not
Sample size Up to now we have shown how to find the power given a specific sample size, difference between the means, standard deviation and alpha level. We can vary any four of these five factors and find the fifth. Usually the alpha level is required to be two-sided 0.05 How can we calculate the sample size for specific values of the remaining parameters?
Two approaches to sample size Hypothesis testing When you have a specific null AND alternative hypothesis in mind Confidence interval When you want to place an interval around an estimate
Hypothesis testing approach State null and alternative hypothesis Null usually pretty easy Alternative is more difficult, but very important State standard deviation of outcome State desired power and alpha level Power=0.8 Alpha=0.05 for two-sided test State test Use statistical package to calculate sample size
We know the location of the null and alternative curves, but we do not know the shape because the sample size determines the shape. We need to find the sample size that will give the curves the shape so that the a level and power equal the specified values. Alpha=0.025 Power=0.8 Beta=0.2
General form of sample size calculation Here is the general form of the normal sample size One-sided Two-sided Standard deviation Related to Type I error Sample size Mean under null and alternative Related to Type II error
Hypothesis testing approach State null and alternative hypothesis H0: m0=80 HA: m1=70 sd=20 State desired power and alpha level Power=0.8 Alpha=0.05 for two-sided test State test: z-test n=31.36 n=32
Example-more complex In a recently submitted grant, we investigated the sample size required to detect a difference between RRMS and SPMS patients in terms of levels of a marker Preliminary data: RRMS: mean level=0.54 +/- 0.37 SPMS: mean level=0.94 +/- 0.42
Hypothesis testing approach State null and alternative hypothesis H0: meanRRMS=meanSPMS=0.54 HA: meanRRMS=0.54, meanSPMS=0.94, Difference between groups=0.4 sdRRMS=0.37, sdSPMS=0.42 State desired power and alpha level Power=0.8 Alpha=0.05 for two-sided test State test: t-test
Results Use these values in statistical package 17 samples from each group are required Website: http://hedwig.mgh.harvard.edu/sample_size/size.html
Statistical considerations for grant “Group sample sizes of 17 and 17 achieve at least 80% power to detect a difference of -0.400 between the null hypothesis that both group means are 0.540 and the alternative hypothesis that the mean of group 2 is 0.940 with estimated group standard deviations of 0.370 and 0.420 and with a significance level (alpha) of 0.05 using a two-sided two-sample t-test.”
Technical remarks So we have shown that we can calculate the power for a given sample size and sample size for a given power. We can also change the clinically meaningful difference if we set the sample size and power. In many grant applications, we show the power for a variety of sample sizes and differences in the means in a table so that the grant reviewer can see that there is sufficient power to detect a range of differences with the proposed sample size.
Confidence interval approach If we do not have a set alternative, we can choose the sample size based on how close to the truth we want to get In particular we choose the sample size so that the confidence interval is of a certain width
Under a normal distribution, the confidence interval for a single sample mean is We can choose the sample size to provide the specified width of the confidence interval
Conclusions Sample size can be calculated if the power, alpha level, difference between the groups and standard deviation are specified For more complex setting than those presented here, statisticians have worked out the sample size calculations, but still need estimates of the hypothesized difference and variability in the data
Study design
Reasons for differences between groups Actual effect-when there is a difference between the two groups (ex. the treatment has an effect) Chance Bias Confounding
Chance When we run a study, we can only take a sample of the population. Our conclusions are based on the sample we have drawn. Just by chance, sometimes we can draw an extreme sample from the population. If we had taken a different sample, we may have drawn different conclusions. We call this sampling variability.
Note on variability Even though your experiments are well controlled, not all subjects will behave exactly the same This is true for almost all experiments If all animals acted EXACTLY the same, we would only need one animal Since one is not enough, we observe a group of mice We call this our sample Based on our sample, we draw a conclusion regarding the entire population
Study design considerations Null hypothesis Outcome variable Explanatory variable Sources of variability Experimental unit Potential correlation Analysis plan Sample size
Example We start with a single group (ex. Genetically identical mice) The group are broken into 3 groups that are treated with 3 different interventions An outcome is measured in each individual Questions: What analysis should we do? What is the effect of starting from the same population? Do we need to account for repeated measures?
Original group Condition 1 Condition 3 Condition 2
Generalizability Assume that we have found a difference between our exposure and control group and we have shown that this result is not likely due to chance, bias or confounding. What does this mean for the general population? Specifically, to which group can we apply our results? This is often based on how the sample was originally collected.
Example 2 We want to compare the expression of a marker in patients vs. controls Full sample size is 288 samples Can only run 24 samples (1 plate) per day Questions: What types of analysis should we do? Can we combine across the plates? Could other confounders be important to collect?
Plate 1: 10 patients, 14 controls Estimate of difference in this plate Plate 2: 14 patients, 10 controls Estimate of difference in this plate Plate 3: 12 patients, 12 controls Estimate of difference in this plate We can test if there is a different effect in each plate by investigating the interaction
Example 3 We want to compare the expression of 6 markers We measure the six markers in 5 mice Questions: What types of analysis should we do? How many independent groups do we have? What is the null hypothesis?
Example 4 “In our experiments, we collect 3 measurements. If it is significant, we call it a day. If it is close to significant, we measure 1 more animal” Question: Is this valid? Always more statistically valid if the number is specified BEFORE the experiment
Spreadsheet formation What to collect Everything that might be important for the analysis Plate Batch Technician All potential sources of variability All potential confounders Most accurate version of this you can If it is continuous, collect it as such. Can always dichotomize later
Spreadsheet formation Easiest to move to a statistical package if One row per measurement One column for the outcome, each predictor and potential confounders No open space
Conclusions Sample size for experiment must be considered BEFORE collecting data Can improve power by reducing standard deviation, increasing sample size or increasing difference between groups Important to consider study design as you develop your analysis plan