Topic 1: Statistical Analysis
Warm-Up: What is the importance of standard deviation with regards to the mean? What do error bars indicate? What percentage of values fall within 1 standard deviation of the mean? And 2? 1. Standard Deviation measures the spread of data around the mean. It can be used either as a measure of variation within a data set or of the reliability of a measurement such as the mean. 2. Error bars are a graphical representation of variability within a data set. 3. 68%. 95%.
Error Bars State that error bars are a graphical representation of the variability of data. There is almost always variation in biological data.
Graphs Despite the variety of graphs used in business and the popular press, there are only a few basic styles used in biology, and generally straightforward criteria for which to use in each situation. The object of graphing is to depict numeric data visually, so it is important to avoid visual elements that do not add to seeing the data, and to choose a graph design that visually shows the comparisons you intend to make.
Bar Graphs Bar graphs: These are best used to show numeric data that represent discrete items or experiments. Bars imply that there are no intermediate values (contrast with lines below), and in many (but not all) cases the order of the bars along the X-axis will be arbitrary.
Bar Graphs Side-by-side - Bar graphs can contain but a single series of data, but when they contain more than one, the additional series can be arranged in two ways. In a side-by-side graph, the bars are exactly that. This allows the series to be visually compared on an item-by-item basis
Bar Graphs Stacked - Sometimes the numeric values for an item accumulate between series, and the important visual comparison is between items rather than series. In this case, a stacked bar graph is more appropriate. In this example, the bars are oriented horizontally, because the flow of time is often represented horizontally, and the X-axis is now the dependent variable. As a general rule, horizontal bars should only be used if there is a reason to do so.
Bar Graphs Error bars - When the numeric value of a bar is a mean, it is often important to show variability. A common way of doing this is with error bars: lines extending above and below the top of the bar to show some aspect of variability, such as the standard deviation, the standard error of the mean, or the 95% confidence level of the mean. The error bars can extend up away from the top of the bar only, or both above and below (in that case the bar should have no fill).
Graphs Floating error bars - The same graph can be constructed without the bars: the error bars remain, but the mean is now represented by a symbol. The choice between this and the graph above is not straight forward, and different disciplines characteristically use one or the other. The example visually stresses comparison of the means over comparison of the variation; this example stresses comparison of the variation, and de-emphasized comparison of the means.
Box Plots Box plots - Sometimes it is useful to show a visual representation of variability in data without resorting to parametric measures of variation. A box plot depicts the median, rather than the mean (although many graph programs substitute the mean), and the quartiles (the 25% of the data above the median and the 25% below the median). This example adds thin lines including 90% of the data, and those individual data points that are outliers. Note that these are not confidence intervals; they are measures of the actual data.
Line Graph Line graph: Line graphs best represent data that are samples from continuous phenomena. The visual implication of the line is that intermediate points exist, but were not sampled. Values taken over time or through space fit this criterion, as do observations at different dosages (assuming that the dosage could be varied continuously). The order of the data along the X-axis is of course not arbitrary with a line graph. In this example, there are error bars for the individual samples. The samples are also connected by straight lines; they could also be connected by spline curves, which would give a smoother appearance, but which are no better predictors of intermediate values.
2. Calculating Mean The arithmetic mean is another name for the average of a set of scores. The mean can be found by dividing the sum of the scores by the number of scores. For example, the mean of 5, 8, 2, and 1 can be found by first adding up the numbers. 5 + 8 + 2 + 1 = 16. The mean is then found by taking this sum and dividing it by the number of scores. Our data set 5, 8, 2, and 1 has 4 different numbers, hence the mean is 16 ÷ 4 = 4.
2. Calculating Standard Deviation Variance and Standard Deviation- The variance and standard deviation of a data set measures the spread of the data about the mean of the data set. The variance of a sample of size n represented by s2 is given by: s2 = ∑(x – mean)2 (n-1) The standard deviation (s) can be calculated by taking the square root of the variance.
Standard Deviation 3. State that the term standard deviation is used to summarize the spread of values around the mean, and that 68% of the values fall within one standard deviation of the mean. For normally distributed data, 68% of the values fall within one standard deviation of the mean For normally distributed data, 95% of the values fall within two standard deviation of the mean
Why use Standard Deviation? 4. Explain how the standard deviation is useful for comparing the means and the spread of data between two or more samples. A small standard deviation means that the data are clustered closely around the mean value. A large standard deviation indicates a wider spread around the mean. Standard deviation can be used to compare the means and spread of two or more data sets.
t-Test 5. Deduce the significance of the difference between two sets of data using calculated values for t and the appropriate tables.
t-Test The t-test assesses whether the means of two groups are statistically different from each other. The larger the difference between the two means, the larger t is. The larger the standard deviations, the smaller t is.
t-Test Assumptions Normally distributed data Equal Variances Large sample size (at least 10 individuals)
t-Test Normality assumption. The data come from a distribution that has one of those nice bell-shaped curves known as a normal distribution. People worry about violating the assumption of normality because data often look skewed. Fortunately, it has been shown that if the sample size is even moderate for each group, quite severe departures from normality don't seem to affect the conclusions reached.
t-Test Equality of variance. Some researchers have argued that equality of variance is actually more important than the assumption of normality. In other words, the standard deviations of the two groups are pretty close to equal.
t-Test Enter the values in a graphic display calculator or a spreadsheet program, with values for the two populations entered separately. Use the calculator function keys or computer software to calculate t. Find the number of degrees of freedom. This will be the total number of values in both populations, minus 2.
t-Test 4. Find the critical value for t either using the computer software or a table of values of t. The level of significance (P) chosen should be 0.05 (5%) and the appropriate row should be selected according to the number of degrees of freedom. 5. Compare the calculated value of t with the critical value. If the critical value is exceeded, there is evidence of a significant difference between the means, at the 5% level.
Correlations 6. Explain that the existence of a correlation does not establish that there is a causal relationship between two variables. A correlation cannot be validly used to infer a causal relationship between variables. This does not mean that correlations cannot indicate causal relations. However, the causes underlying the correlation, if any, may be indirect and unknown. Consequently, establishing a correlation between two variables is a not sufficient condition to establish a causal relationship (in either direction).
Correlations Here is a simple example: hot weather may cause both crime and ice-cream purchases. Therefore crime is correlated with ice-cream purchases. But crime does not cause ice-cream purchases and ice- cream purchases do not cause crime.
Correlations A correlation between age and height in children is fairly causally transparent, but a correlation between mood and health in people is less so. Does improved mood lead to improved health? Or does good health lead to good mood? Or does some other factor underlie both? Or is it pure coincidence? In other words, a correlation can be taken as evidence for a possible causal relationship, but cannot indicate what the causal relationship, if any, might be.