Data Analysis: Simple Statistical Tests Modified for AP Biology Statistics Unit Lesson
Sampling a Population When a random study and a sample of a general population are taken, there are some characteristics that need to be determined. Based on those corresponding properties, the conclusion reached at the end of the study may be assumed to be representative of that population.
Why Choose a statistical analysis? Choose an estimator function for the characteristic (of the population) to study and then apply this function to the sample to obtain an estimate. Use the appropriate statistical test to then determine whether this estimate is based solely on chance.
The Null hypothesis The hypothesis that the estimate is based solely on chance is called the null hypothesis(H 0 ). Thus, the null hypothesis is true if the observed data (in the sample) do not differ from what would be expected on the basis of chance alone. The complement of the null hypothesis is called the alternative hypothesis.
The Alternative hypothesis The alternative hypothesis, denoted by H 1 or H a, is the hypothesis that sample observations are influenced by some non-random cause. “For example, suppose we wanted to determine whether a coin was fair and balanced. A null hypothesis might be that half the flips would result in Heads and half, in Tails. The alternative hypothesis might be that the number of Heads and Tails would be very different. Symbolically, these hypotheses would be expressed as H 0 : p = 0.5 H a : p <> 0.5”
Chi-Square Statistics Example A common analysis is whether Disease X occurs as much among people in Group A as it does among people in Group B People are often sorted into groups based on their exposure to some disease risk factor We then perform a test of the association between exposure and disease in the two groups
Hypothetical outbreak of Salmonella on a cruise ship All 300 people on the cruise ship were interviewed, and 60 of them had symptoms consistent with Salmonella Questionnaires indicated many of the case-patients ate tomatoes from the salad bar
The Study and the Tested Population Research Question: To see if there is a statistical difference in the amount of illness between those who ate tomatoes (41/130) and those who did not (19/170) Null H 0 : Salmonella infection occurs as much among people in Group A (ate tomatoes) as it does among people in Group B (did not eat tomatoes) Alternative H 1 : Salmonella infection occurs much more among people in Group A than it does among people in Group B
Table 2a. Cohort study: Salmonella? YesNoTotal Tomatoes No Tomatoes Total Exposure to tomatoes and Salmonella infection
Characteristics of the Study: To conduct a chi-square the following conditions must be met: There must be at least a total of 30 observations (people) in the table Each cell must contain a count of 5 or more To conduct a chi-square test we compare the observed data (from study results) with the data we would expect to see(calculated)
Table 2b. How to calculate the Expected Values: Total Size YesNoTotal Tomatoes??130 No Tomatoes??170 Total Gives an overall distribution of people who ate tomatoes and became sick and those that did not Based on these distributions we can fill in the empty cells with the expected values
Calculating the Expected Values: Expected Value = Row Total x Column Total Grand Total For the first cell, people who ate tomatoes and became ill: Expected value = 130 x 60 = Same formula can be used to calculate the expected values for each of the other cells
Salmonella? YesNoTotal Tomatoes 130 x 60 = x 240 = No Tomatoes 170 x 60 = x 240 = Total Formula = [(Observed – Expected) 2 /Expected] for each cell of the table Table 2c. Complete Expected values for exposure to tomatoes
Salmonella? YesNoTotal Tomatoes (41-26) 2 = (89-104) 2 = No Tomatoes (19-34) 2 = ( ) 2 = Total The chi-square (χ 2 ) for this example is: = 19.2 Table 2d. Expected values for exposure to tomatoes 34
Analyze the Chi-Square Test In general, the higher the chi-square value, the greater the likelihood there is a statistically significant difference between the two groups you are comparing To know for sure, you need to look up the p-value in a chi-square table
P-Values Using our hypothetical cruise ship Salmonella outbreak: 32% of people who ate tomatoes got Salmonella as compared with 11% of people who did not eat tomatoes How do we know whether the difference between 32% and 11% is a “real” difference? In other words, how do we know that our chi- square value (calculated as 19.2) indicates a statistically significant difference? The p-value is our indicator
P-Values Many statistical tests give both a numeric result (e.g. a chi-square value) and a p-value The p-value ranges between 0 and 1 What does the p-value tell you? The p-value is the probability of getting the result you got, assuming that the two groups you are comparing are actually the same
P-Values Start by assuming there is no difference in outcomes between the groups Look at the test statistic and p-value to see if they indicate otherwise A low p-value means that (assuming the groups are the same) the probability of observing these results by chance is very small Difference between the two groups is statistically significant A high p-value means that the two groups were not that different A p-value of 1 means that there was no difference between the two groups
P-Values <0.05 Generally, if the p-value is less than 0.05, the difference observed is considered statistically significant, ie. the difference did not happen by chance
1)The chi-square value is calculated as )There are two groups 3)Degrees of freedom = = 1 If p-value >0.05 there is not a significant difference between groups If p-value < 0.05 there is a significant difference between groups
If p-value >0.05 there is not a significant difference between groups If p-value < 0.05 there is a significant difference between groups Null H 0 : Salmonella infection occurs as much among people in Group A as it does among people in Group B
There is a significant statistical difference between the two groups. The Salmonella outbreak might have been due to contaminated tomatoes at the salad bar. p-value < 0.05 X 2 = 19.2 Reject H 0 because 19.2 is greater than 3.84 (for p-value = 0.05) Null H 0 : Salmonella infection occurs as much among people in Group A as it does among people in Group B
References 1.Bruce MG, Curtis MB, Payne MM, et al. Lake-associated outbreak of Escherichia coli O157:H7 in Clark County, Washington, August Arch Pediatr Adolesc Med. 2003;157: Wheeler C, Vogt TM, Armstrong GL, et al. An outbreak of hepatitis A associated with green onions. N Engl J Med. 2005;353: Gregg MB. Field Epidemiology. 2nd ed. New York, NY: Oxford University Press; Aureli P, Fiorucci GC, Caroli D, et al. An outbreak of febrile gastroenteritis associated with corn contaminated by Listeria monocytogenes. N Engl J Med. 2000;342:
References 5.Schafer S, Gillette H, Hedberg K, Cieslak P. A community-wide pertussis outbreak: an argument for universal booster vaccination. Arch Intern Med. 2006;166: Centers for Disease Control and Prevention. Partner counseling and referral services to identify persons with undiagnosed HIV --- North Carolina, MMWR Morb Mort Wkly Rep.2003;52: Centers for Disease Control and Prevention. Outbreak of Salmonella Enteritidis infection associated with consumption of raw shell eggs, MMWR Morb Mort Wkly Rep. 1992;41: Centers for Disease Control and Prevention. Outbreak of invasive group A streptococcus associated with varicella in a childcare center -- Boston, Massachusetts, MMWR Morb Mort Wkly Rep. 1997;46: