Chi-Square Goodness-of-Fit Test

Slides:



Advertisements
Similar presentations
Chi-Square Tests 3/14/12 Testing the distribution of a single categorical variable :  2 goodness of fit Testing for an association between two categorical.
Advertisements

Statistics: Unlocking the Power of Data Lock 5 Testing Goodness-of- Fit for a Single Categorical Variable Kari Lock Morgan Section 7.1.
Multinomial Experiments Goodness of Fit Tests We have just seen an example of comparing two proportions. For that analysis, we used the normal distribution.
Chi-Square Analysis Mendel’s Peas and the Goodness of Fit Test.
Chapter 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 26: Comparing Counts. To analyze categorical data, we construct two-way tables and examine the counts of percents of the explanatory and response.
11.4 Hardy-Wineberg Equilibrium. Equation - used to predict genotype frequencies in a population Predicted genotype frequencies are compared with Actual.
Chapter 11: Inference for Distributions of Categorical Data.
Multinomial Experiments Goodness of Fit Tests We have just seen an example of comparing two proportions. For that analysis, we used the normal distribution.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 11 Inference for Distributions of Categorical.
Chapter 11: Inference for Distributions of Categorical Data Section 11.1 Chi-Square Goodness-of-Fit Tests.
Testing for an Association between two Categorical Variables
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 10/30/12 Chi-Square Tests SECTIONS 7.1, 7.2 Testing the distribution of a.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan SECTION 7.1 Testing the distribution of a single categorical variable : χ.
Lecture 11. The chi-square test for goodness of fit.
Chapter 13- Inference For Tables: Chi-square Procedures Section Test for goodness of fit Section Inference for Two-Way tables Presented By:
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan SECTION 7.1 Testing the distribution of a single categorical variable : 
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan SECTION 7.2 χ 2 test for association (7.2) Testing for an Association between.
11.1 Chi-Square Tests for Goodness of Fit Objectives SWBAT: STATE appropriate hypotheses and COMPUTE expected counts for a chi- square test for goodness.
CHAPTER 11: INFERENCE FOR DISTRIBUTIONS OF CATEGORICAL DATA 11.1 CHI-SQUARE TESTS FOR GOODNESS OF FIT OUTCOME: I WILL STATE APPROPRIATE HYPOTHESES AND.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
The Chi-Square Distribution  Chi-square tests for ….. goodness of fit, and independence 1.
 Check the Random, Large Sample Size and Independent conditions before performing a chi-square test  Use a chi-square test for homogeneity to determine.
Chi Square Test of Homogeneity. Are the different types of M&M’s distributed the same across the different colors? PlainPeanutPeanut Butter Crispy Brown7447.
Check your understanding: p. 684
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
11.1 Chi-Square Tests for Goodness of Fit
Warm Up Check your understanding p. 687 CONDITIONS:
Chapter 11: Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
Chapter 11 Goodness-of-Fit and Contingency Tables
Chi-Square Goodness of Fit
Analysis of count data 1.
Chapter 11: Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
Chi-Square Goodness-of-Fit Tests
CHAPTER 11 Inference for Distributions of Categorical Data
Inference for Relationships
CHAPTER 11 Inference for Distributions of Categorical Data
Analyzing the Association Between Categorical Variables
WELCOME BACK! days From: Tuesday, January 3, 2017
Chapter 11: Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
UNIT V CHISQUARE DISTRIBUTION
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 11 Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
Chapter 13: Chi-Square Procedures
Chapter 11: Inference for Distributions of Categorical Data
Inference for Distributions of Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
Presentation transcript:

Chi-Square Goodness-of-Fit Test STAT 250 Dr. Kari Lock Morgan Chi-Square Goodness-of-Fit Test SECTION 7.1 Testing the distribution of a single categorical variable : χ2 goodness of fit (7.1)

Multiple Categories So far, we’ve learned how to do inference for categorical variables with only two categories In lab, you learned how to do inference for a quantitative variable and a categorical variable with multiple categories Today and next class we’ll learn how to do hypothesis tests for categorical variables with multiple categories

Are children with ADHD younger than their peers? Question of the Day Are children with ADHD younger than their peers? pheromones = subconscious chemical signals

ADHD or Just Young? In British Columbia, Canada, the cutoff date for entering school in any year is December 31st, so those born late in the year are almost a year younger than those born early in the year Is it possible that younger students are being over- diagnosed with ADHD? Are children diagnosed with ADHD more likely to be born late in the year, and so be younger than their peers? Data: Sample of kids aged 6-12 in British Columbia who have been diagnosed with ADHD Morrow, R., et. al., “Influence of relative age on diagnosis and treatment of attention-deficit/hyperactivity disorder in children,” Canadian Medical Association Journal, April 17, 2012; 184(7): 755-762.

Categories For simplicity, we’ll break the year into four categories: January – March April – June July – September October – December Are birthdays for kids with ADHD evenly distributed among these four time periods?

Hypothesis Testing State hypotheses Calculate a statistic, based on your sample data Create a distribution of this statistic, as it would be observed if the null hypothesis were true Measure how extreme your test statistic from (2) is, as compared to the distribution generated in (3), to get a p-value Make a conclusion

Hypotheses H0: p1 = p2 = p3 = p4 = 1/4 Ha: At least one pi ≠ 1/4 Define parameters as p1 = Proportion of ADHD births in January – March p2 = Proportion of ADHD births in April – June p3 = Proportion of ADHD births in July – Sep p4 = Proportion of ADHD births in Oct – Dec H0: p1 = p2 = p3 = p4 = 1/4 Ha: At least one pi ≠ 1/4

Observed Counts The observed counts are the actual counts observed in the study Data for boys: Jan-Mar Apr-Jun July-Sep Oct-Dec Observed 6880 7982 9161 8945

Expected Counts The expected counts are the expected counts if the null hypothesis were true For each cell, the expected count is the sample size (n) times the null proportion, pi

Expected Counts n = 32,968 boys with ADHD H0: p1 = p2 = p3 = p4 = ¼ Expected count for each category, if the null hypothesis were true: 32,968(1/4) = 8242 Jan-Mar Apr-Jun July-Sep Oct-Dec Observed 6880 7982 9161 8945 Expected 8242

Chi-Square Statistic Jan-Mar Apr-Jun July-Sep Oct-Dec Observed 6880 7982 9161 8945 Expected 8242 Need a way to measure how far the observed counts are from the expected counts… Use the chi-square statistic :

Chi-Square Statistic Observed Expected Contribution to χ2 Jan-Mar 6880 8242 Apr-Jun 7982 Jul-Sep 9161 Oct-Dec 8945

StatKey More Advanced Randomization Tests: χ2 Goodness-of-Fit

A distribution of the test statistic assuming H0 is true What Next? We have a test statistic. What else do we need to perform the hypothesis test? A distribution of the test statistic assuming H0 is true How do we get this? Two options: Simulation Distributional Theory

Upper-Tail p-value To calculate the p-value for a chi-square test, we always look in the upper tail Why? Values of the χ2 are always positive The higher the χ2 statistic is, the farther the observed counts are from the expected counts, and the stronger the evidence against the null

Randomization Distribution Can we reject the null? Yes No

Chi-Square (χ2) Distribution If each of the expected counts are at least 5, AND if the null hypothesis is true, then the χ2 statistic follows a χ2 –distribution, with degrees of freedom equal to df = number of categories – 1 ADHD Birth Month: df = 4 – 1 = 3

Chi-Square Distribution

χ2 Null Distribution

p-value

Conclusion This is a TINY p-value!!! We have incredibly strong evidence that birthdays of boys with ADHD are not evenly distributed throughout the year.

Chi-Square Test for Goodness of Fit State null hypothesized proportions for each category, pi. Alternative is that at least one of the proportions is different than specified in the null. Calculate the expected counts for each cell as npi . Calculate the χ2 statistic: Compute the p-value as the proportion above the χ2 statistic for either a randomization distribution or a χ2 distribution with df = (# of categories – 1) if expected counts all > 5 Make a conclusion. n = 1

Births Equally Likely? Maybe all birthdays (not just for boys with ADHD) are not evenly distributed throughout the year We have the overall proportion of births for all boys during each time period in British Columbia: January – March: 0.244 April – June: 0.258 July – September: 0.257 October – December: 0.241

Hypotheses Define parameters as p1 = Proportion of ADHD births in January – March p2 = Proportion of ADHD births in April – June p3 = Proportion of ADHD births in July – Sep p4 = Proportion of ADHD births in Oct – Dec H0: p1 = 0.244, p2 = 0.258, p3 = 0.257, p4 = 0.241 Ha: At least one pi is not as specified in H0

ADHD or Just Young? (BOYS) Birthday Proportion of Births ADHD Expected Counts Jan-Mar 0.244 6880 Apr-Jun 0.258 7982 Jul-Sep 0.257 9161 Oct-Dec 0.241 8945 n = 32,968 The expected count for the Jan-Mar cell is 8044.2 8505.7 8472.8 7945.3

ADHD or Just Young? (BOYS) Birthday Proportion of Births ADHD Expected Counts Contribution to χ2 Jan-Mar 0.244 6880 Apr-Jun 0.258 7982 8505.7 Jul-Sep 0.257 9161 8472.8 Oct-Dec 0.241 8945 7945.3 The contribution to χ2 for the Oct-Dec cell is 32.2 55.9 125.8 168.5

ADHD or Just Young? (BOYS) Birth Date Proportion of Births ADHD Diagnoses Expected Counts Contribution to χ2 Jan-Mar 0.244 6880 168.5 Apr-Jun 0.258 7982 8505.7 32.2 Jul-Sep 0.257 9161 8472.8 55.9 Oct-Dec 0.241 8945 7945.3 n = 32,968 Expected: 8044.2, 8505.7, 8472.8, 7945.3 Chi-square: 168.5 + 32.2 + 55.9 + 125.8 = 382.4

ADHD or Just Young (Boys) p-value ≈ 0 We have VERY (!!!) strong evidence that boys with ADHD do not have the same distribution of birthdays as all boys in British Columbia.

A Deeper Understanding The p-value only tells you that the counts provide evidence against the null distribution If you want to know more, look at Which cells contribute most to the χ2 statistic? Are the observed or expected counts higher?

ADHD or Just Young? (BOYS) Birth Date Proportion of Births ADHD Diagnoses Expected Counts Contribution to χ2 Jan-Mar 0.244 6880 8044.2 168.5 Apr-Jun 0.258 7982 8505.7 32.2 Jul-Sep 0.257 9161 8472.8 55.9 Oct-Dec 0.241 8945 7945.3 125.8 n = 32,968 Expected: 8044.2, 8505.7, 8472.8, 7945.3 Chi-square: 168.5 + 32.2 + 55.9 + 125.8 = 382.4 Big contributions to χ2 : beginning and end of the year Jan-Feb: Many fewer ADHD than would be expected Oct-Dec: Many more ADHD cases than would be expected

ADHD and Birth Year Stop and think about the implications of this! In British Columbia, 6.9% of all boys 6-12 are diagnosed with ADHD 5.5% of boys 6-12 receive medication for ADHD How many of these diagnoses are simply due to the fact that these kids are younger than their peers??? ARE YOU CONCERNED?

ADHD or Just Young? (Girls) Want more practice? Here is the data for girls. (χ2 = 236.8) Birth Date Proportion of Births ADHD Diagnoses Jan-Mar 0.243 1960 Apr-Jun 0.258 2358 Jul-Sep 0.257 2859 Oct-Dec 0.242 2904

To Do HW 7.1 (due Wednesday, 4/12)

Mendel’s Pea Experiment In 1866, Gregor Mendel, the “father of genetics” published the results of his experiments on peas He found that his experimental distribution of peas closely matched the theoretical distribution predicted by his theory of genetics (involving alleles, and dominant and recessive genes) Source: Mendel, Gregor. (1866). Versuche über Pflanzen-Hybriden. Verh. Naturforsch. Ver. Brünn 4: 3–47 (in English in 1901, Experiments in Plant Hybridization, J. R. Hortic. Soc. 26: 1–32)

Mendel’s Pea Experiment Mate SSYY with ssyy: 1st Generation: all Ss Yy Mate 1st Generation: => 2nd Generation  Second Generation Phenotype Theoretical Proportion Round, Yellow 9/16 Round, Green 3/16 Wrinkled, Yellow Wrinkled, Green 1/16 S, Y: Dominant s, y: Recessive

Mendel’s Pea Experiment Phenotype Theoretical Proportion Observed Counts Round, Yellow 9/16 315 Round, Green 3/16 101 Wrinkled, Yellow 108 Wrinkled, Green 1/16 32 Let’s test this data against the null hypothesis of each pi equal to the theoretical value, based on genetics

Mendel’s Pea Experiment Phenotype Null pi Observed Counts Expected Counts Round, Yellow 9/16 315 Round, Green 3/16 101 Wrinkled, Yellow 108 Wrinkled, Green 1/16 32 The expected count for the round, yellow phenotype is 177.2 310.5 312.75 318.25 n = 556 Expected = 312.75, 104.25, 104.25, 34.75 chisquare = 0.47

Mendel’s Pea Experiment Phenotype Null pi Observed Counts Expected Counts Contribution to χ2 Round, Yellow 9/16 315 312.75 Round, Green 3/16 101 104.25 Wrinkled, Yellow 108 Wrinkled, Green 1/16 32 34.75 Phenotype Null pi Observed Counts Expected Counts Contribution to χ2 Round, Yellow 9/16 315 312.75 0.016 Round, Green 3/16 101 104.25 0.101 Wrinkled, Yellow 108 0.135 Wrinkled, Green 1/16 32 34.75 0.1218 The contribution to the χ2 statistic for the round, yellow phenotype is 0.012 0.014 0.016 0.018 n = 556 Expected = 312.75, 104.25, 104.25, 34.75 chisquare = 0.47

Mendel’s Pea Experiment χ2 = 0.47 Two options: Simulate a randomization distribution Compare to a χ2 distribution with 4 – 1 = 3 df

Mendel’s Pea Experiment p-value = 0.925 Does this prove Mendel’s theory of genetics? Or at least prove that his theoretical proportions for pea phenotypes were correct? Yes No

To Do HW 7.1 (due Wednesday, 4/12)