The Reasons for the Steps of Descriptive Statistics Comparing the number of AMY1 (salivary amylase) genes in people from cultures with high starch diets to people from cultures with low starch diets. These ten people were sampled randomly from a much larger random sample. First, a note on sampling: We take sample measurements from a population of possible measurements in an attempt to estimate some parameters of the population, like population mean and population standard deviation. Thus, the descriptive statistics we generate from a sample are themselves estimates and all estimates are plagued with error and uncertainty.
The Reasons for the Steps of Descriptive Statistics Comparing the number of AMY1 (salivary amylase) genes in people from cultures with high starch diets to people from cultures with low starch diets. These ten people were sampled randomly from a much larger random sample. After obtaining the sample and calculating the sample mean, we square the differences between each measurement in a group and its sample mean. This calculation amplifies the big differences (gives them more weight because they may be notable outliers) and minimizes the little differences (they may be chance/accidental differences).
The Reasons for the Steps of Descriptive Statistics Comparing the number of AMY1 (salivary amylase) genes in people from cultures with high starch diets to people from cultures with low starch diets. These ten people were selected randomly from a much larger random sample. The sample variance gives us the average sum of the squared differences for the sample. One reason we divide by n – 1 instead of n is to artificially increase our experimental error (noise) just a bit. This trick forces us to be a little more conservative when making generalizations about our measurement sample. Another reason is that when we calculated the sample mean, we lost a degree of freedom.
The Reasons for the Steps of Descriptive Statistics Comparing the number of AMY1 (salivary amylase) genes in people from cultures with high starch diets to people from cultures with low starch diets. These ten people were selected randomly from a much larger random sample. In this next step, we calculate the sample standard deviation by taking the square root of the sample variance. The step of taking the square root of the sample variance takes the sample standard deviation back to the same units of measurement we had at the beginning. In this case, the number of genes a person has, instead of # of genes2.
The Reasons for the Steps of Descriptive Statistics Comparing the number of AMY1 (salivary amylase) genes in people from cultures with high starch diets to people from cultures with low starch diets. These ten people were selected randomly from a much larger random sample. The equation for the sample standard error of the mean (SEM) is the result of the relationship between the dispersion of individual observations around the population mean (the standard deviation), and the dispersion of sample means around the population mean (the standard error). When enough sample means are taken from a population, the mean of those sample means begins to converge on the actual population mean. The standard deviation of the sample means decreases with increasing samples and eventually becomes equal to the population’s true standard deviation divided by the square root of the sample size: √n. So to estimate the SEM of a population from which you have taken a sample, you take the standard deviation of the sample (your estimate of the population’s standard deviation) and divide by √n. See the figures on the next slide.
Recall that the sample mean, the sample standard deviation, and the sample standard error of the mean are all estimates of the same parameters for the actual population. The graphs below show what happens to these estimates as the sample size approaches the actual population size. From: Krzywinski, M. & N. Altman. (2013). Points of significance: Importance of being uncertain. Nature Methods 10:809-810.
When the sample size is large (n ≥ 20), the 95% confidence intervals (CI) are roughly 2 X SEM. When the sample size gets smaller than 20, the 95% CIs become larger than 2 X SEM (Fig b). This is because the actual method for calculating the 95% CI uses a statistic called t. Shows that 95% CIs are expected to span/capture the true population mean about 19 out of every 20 times (n = 10 for this example). Shows the relationship between 95% CIs and SEM for increasing sample sizes. From: Krzywinski, M. & N. Altman. (2013). Points of significance: Error bars. Nature Methods 10:921-922.
Error bars are not intended to allow us to decide if two means are significantly different from each other, they simply show the uncertainty for a sample mean. However, when comparing the relative uncertainty of two sample means, error bars can lead us to hypothesize that two means may be significantly different from each other. 1. Error bar width and interpretation of spacing depends on the error bar type. n = 10 in both a and b. 2. Size and position for SEM and 95% CIs for different p-values. n = 10 in all cases. From: Krzywinski, M. & N. Altman. (2013). Points of significance: Error bars. Nature Methods 10:921-922.
A note on interpreting 95% CI and SEM Error Bars Incorrect – For 95% CIs: “I am 95% confident that the true mean lies somewhere within the error bars.” For SEM: “I am 68% confident that the true mean lies somewhere within the error bars.” “If the 95% error bars do not overlap, the means are significantly different.” Correct – For 95% CIs: “The true population mean should be captured by the error bars 95% of the time.” For SEM: “The true population mean should be captured by the error bars 68% of the time.” “My error bars either captured the true population mean or they didn’t, I can’t be sure” “If the error bars do not overlap, the means may be significantly different or they may not, but an additional statistical test (the t-Test in this case) is required for more confidence.” From Strode and Brokaw (2015). HHMI Teacher’s Guide: Mathematics and Statistics in Biology.