Analysis of frequency counts with Chi square

Slides:



Advertisements
Similar presentations
Contingency Tables Chapters Seven, Sixteen, and Eighteen Chapter Seven –Definition of Contingency Tables –Basic Statistics –SPSS program (Crosstabulation)
Advertisements

Hypothesis Testing IV Chi Square.
Chi-Squared tests (  2 ):. Use with nominal (categorical) data – when all you have is the frequency with which certain events have occurred. score per.
Statistical Tests Karen H. Hagglund, M.S.
Copyright ©2011 Brooks/Cole, Cengage Learning More about Inference for Categorical Variables Chapter 15 1.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.
Making Inferences for Associations Between Categorical Variables: Chi Square Chapter 12 Reading Assignment pp ; 485.
CJ 526 Statistical Analysis in Criminal Justice
Chi-square Test of Independence
Crosstabs and Chi Squares Computer Applications in Psychology.
22-1 Copyright  2010 McGraw-Hill Australia Pty Ltd PowerPoint slides to accompany Croucher, Introductory Mathematics and Statistics, 5e Chapter 22 Analysis.
Presentation 12 Chi-Square test.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Categorical Data Prof. Andy Field.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics, A First Course 4 th Edition.
CJ 526 Statistical Analysis in Criminal Justice
Chi-square (χ 2 ) Fenster Chi-Square Chi-Square χ 2 Chi-Square χ 2 Tests of Statistical Significance for Nominal Level Data (Note: can also be used for.
Chi-Square X 2. Parking lot exercise Graph the distribution of car values for each parking lot Fill in the frequency and percentage tables.
The binomial applied: absolute and relative risks, chi-square.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
FPP 28 Chi-square test. More types of inference for nominal variables Nominal data is categorical with more than two categories Compare observed frequencies.
Nonparametric Tests: Chi Square   Lesson 16. Parametric vs. Nonparametric Tests n Parametric hypothesis test about population parameter (  or  2.
DIRECTIONAL HYPOTHESIS The 1-tailed test: –Instead of dividing alpha by 2, you are looking for unlikely outcomes on only 1 side of the distribution –No.
HYPOTHESIS TESTING BETWEEN TWO OR MORE CATEGORICAL VARIABLES The Chi-Square Distribution and Test for Independence.
Chi Square Classifying yourself as studious or not. YesNoTotal Are they significantly different? YesNoTotal Read ahead Yes.
CHI-SQUARE x 2. Chi Square Symbolized by Greek x 2 pronounced “Ki square” a Test of STATISTICAL SIGNIFICANCE for TABLE data “What are the ODDs the relationship.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics: A First Course Fifth Edition.
Copyright © 2010 Pearson Education, Inc. Slide
Section 10.2 Independence. Section 10.2 Objectives Use a chi-square distribution to test whether two variables are independent Use a contingency table.
Reasoning in Psychology Using Statistics Psychology
Chapter 13 Inference for Counts: Chi-Square Tests © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
4 normal probability plots at once par(mfrow=c(2,2)) for(i in 1:4) { qqnorm(dataframe[,1] [dataframe[,2]==i],ylab=“Data quantiles”) title(paste(“yourchoice”,i,sep=“”))}
Nonparametric Tests of Significance Statistics for Political Science Levin and Fox Chapter Nine Part One.
Chapter Outline Goodness of Fit test Test of Independence.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Hypothesis test flow chart frequency data Measurement scale number of variables 1 basic χ 2 test (19.5) Table I χ 2 test for independence (19.9) Table.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Bullied as a child? Are you tall or short? 6’ 4” 5’ 10” 4’ 2’ 4”
Chapter 14 – 1 Chi-Square Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research.
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
T-tests Chi-square Seminar 7. The previous week… We examined the z-test and one-sample t-test. Psychologists seldom use them, but they are useful to understand.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 11 Inference for Distributions of Categorical.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Section 10.2 Objectives Use a contingency table to find expected frequencies Use a chi-square distribution to test whether two variables are independent.
Comparing Observed Distributions A test comparing the distribution of counts for two or more groups on the same categorical variable is called a chi-square.
Chi Square Test of Homogeneity. Are the different types of M&M’s distributed the same across the different colors? PlainPeanutPeanut Butter Crispy Brown7447.
Presentation 12 Chi-Square test.
Chapter 11 Chi-Square Tests.
Hypothesis Testing Review
CHAPTER 11 Inference for Distributions of Categorical Data
Qualitative data – tests of association
The Chi-Square Distribution and Test for Independence
CHAPTER 11 Inference for Distributions of Categorical Data
Chi Square (2) Dr. Richard Jackson
Chapter 11 Chi-Square Tests.
CHAPTER 11 Inference for Distributions of Categorical Data
Analyzing the Association Between Categorical Variables
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 11 Chi-Square Tests.
CHAPTER 11 Inference for Distributions of Categorical Data
CHI SQUARE (χ2) Dangerous Curves Ahead!.
Presentation transcript:

Analysis of frequency counts with Chi square 2 The autumn term lectures defined a level of measurement called categorical, but they did not cover the kind of statistics that are possible when the level of measurement of a variable is categorical. Dr David Field

Summary Categorical data Frequency counts One variable chi-square testing the null hypothesis that frequencies in the sample are equally divided among the catgegories varying the null hypothesis Two variable chi-square testing the null hypothesis that status on one categorical variable is independent from status on another categorical variable Limitations and assumptions of chi-square Andy Field chapter 18 covers chi-square There is also a guide online at http://davidmlane.com/hyperstat/ Chi-square is topic 16 in the list

Categorical data 18.2 Each participant is a member of a single category, and the categories cannot be meaningfully placed in order e.g., nationality = French, German, Italian Sometimes chi-square is used with ordered categories, e.g. age bands To perform statistical tests with categorical data each participant must be a member of only one category Category membership must be mutually exclusive You can’t be a smoker and a non-smoker This allows frequency counts in each category to be calculated

Chi square If you can express the data as frequency counts in several categories, then chi square can be used to test for differences between the categories You will also see chi square written as a Greek letter accompanied by the mathematical symbol indicating that a number should be squared 2

Chi square with a single categorical variable Suppose we are interested in which drink is most popular We ask a sample of 100 people if they prefer to drink coffee, tea, or water each respondent is only allowed to select one answer this is important: if each person can have membership of more than one category you can’t use Chi square By default, the null hypothesis for chi-square is that each of the categories is equally frequent in the underlying population it is possible to modify this (see later)

One variable chi-square example Let’s say that the preferences expressed by the sample of 100 people result in the following observed frequency counts: tea 39 coffee 30 Water 31 SUM 100 The null hypothesis assumes that each category is equally frequent, and thus provides a model that the data can be used to test Based on the null hypothesis, the expected frequency counts would 100 / 3 = 33.3 per category The Chi square statistic works out the probability that the observed frequencies could be obtained by random sampling from a population where the null hyp is true

One variable chi-square example Observed Expected Difference Difference squared Divide by expected 39 33.3 30 31 100 Here is a table of the expected and observed frequencies. Each cell is one of the categories from the previous slide, i.e. tea, coffee and water

One variable chi-square example Observed Expected Difference Difference squared Divide by expected 39 33.3 5.7 30 -3.3 31 -2.3 100 Begin by quantifying out the difference between the expected frequencies (null hypothesis) and the observed frequencies.

One variable chi-square example Observed Expected Difference Difference squared Divide by expected 39 33.3 5.7 32.49 30 -3.3 10.89 31 -2.3 5.29 100 The second step is to square the difference scores. Squaring has two important effects. The first effect is to remove and minus signs, so that we deal only with the magnitude of the difference, not its direction. This tells us that chi-square is insensitive the direction of differences. The second thing that squaring does is to increase (exaggerate) the contribution to the eventual statistic that is made by the larger differences, and reduce the emphasis on small differences between observed and expected. This is intuitively useful because small differences are more likely to occur due to sampling error than large differences. The red ellipse compared to the other ellipse illustrates this latter point.

One variable chi-square example Observed Expected Difference Difference squared Divide by expected 39 33.3 5.7 32.49 0.98 30 -3.3 10.89 0.33 31 -2.3 5.29 0.16 100 The next step is to divide each squared dif score by the expected score. This is a bit like dividing by an estimate of the sample variance in the t-test formula. Consider a difference score of 5. If the expected score was 180, then intuitively an observed of 175, or 185, which would produce a difference of 5 does not seem to be very different from the null hypothesis (expected) value of 180. On the other hand, if the difference score is 5, and the expected score was 15, then 10 or 20 (which would be the observed scores that could produce such a difference score) then the difference of 5 now seems like a big difference. By dividing the squared dif by the expected value we express the squared dif as a proportion of the expected score. This puts all squared dif scores into a common scale, regardless of what the initial sample sizes and expected scores were

One variable chi-square example Observed Expected Difference Difference squared Divide by expected 39 33.3 5.7 32.49 0.98 30 -3.3 10.89 0.33 31 -2.3 5.29 0.16 100 SUM 1.47 Finally, one sums the figures to arrive at the value of chi-square. Chi square is the sum of all the differences between the expected and obtained frequencies. Therefore, it is a measure of how similar the observed and expected frequencies are. A bigger value of chi square indicates a greater difference from the null hyp.

Converting Chi square to a p value SPSS will do this for you Chi square has degrees of freedom equal to the number of categories minus 1 2 in the example this is because if you knew the frequencies of preference for tea and coffee and the sample size, the frequency of preference for water would not be free to vary “The chi square value of 1.47, df = 2 had an associated p value of 0.48, so the null hypothesis that preferences for drinking tea, coffee and water in the population are equal cannot be rejected.”

One variable chi square with unequal expected frequencies By default, the expected frequencies are just the sample size divided equally among the number of categories. But, sometimes this is inappropriate For example, we know that the % of the population of the UK that smokes is less than 50% Let’s assume for purposes of illustration that 25% of the UK population are smokers We might hypothesise that the smoking rate is higher in Glasgow than the UK average rate The null hypothesis is that it is the same

One variable chi square with unequal expected frequencies We ask 200 adults in Glasgow if they smoke. 80 say yes 120 say no We know that the UK average rate is 25%, and 80 is rather more than 25% of 200 Chi square can be used to assess the probability of the above frequencies being obtained by random sampling if the real smoking rate in Glasgow was actually 25%

One variable chi-square example with unequal expected frequencies Observed Expected Difference Difference squared Divide by expected 120 150 -30 900 6 80 50 30 18 200 SUM 24 The difference between this table and the previous example is that the expected frequencies have been entered based on a null hypothesis of a 25/75 split, rather than an equal split between categories Notice that although the squared differences are the same in both cases, the contribution to the overall chi-square statistic is unequal because of the division by the two different expected values

One variable chi square with unequal expected frequencies “80 of the sample of 200 people from Glasgow classified themselves as smokers. This resulted in a chi square value of 24.0, df = 1 with an associated p value of < 0.001, so the null hypothesis that smoking rates in Glasgow are equal to the UK average of 25% can be rejected.”

Chi square with two variables 18.3 Chi square with two variables Usually, it is more interesting to use Chi square to ask about the relationship between 2 categorical variables. For example, what is the relationship between gender and smoking? gender can be male or female smoking can be smoker or non-smoker If you have smoking data from just men, you can only use chi-square to ask if the proportion of smokers and non-smokers is different If you have smoking data from men and women you can use chi-square to ask if the proportion of men who smoke differs from the proportion of women who smoke

What 2*2 chi square does not do It is important to realise that in the 2*2 chi square, having a big imbalance between the number of men and the number of women will not increase the value of the chi-square statistic Also, having a big imbalance between the number of smokers and non-smokers will not increase the value of the chi-square statistic This contrasts with the one variable chi-square, where an imbalance in the numbers of men vs women, or smokers vs. non-smokers does increase the value of chi-square. The value of chi-square for two variables is high if smoking frequency is contingent on gender, and low if smoking frequency is independent of gender

The key to understanding 2 The key to understanding 2*2 chi square is how the expected frequencies are calculated The expected frequencies provide the null hypothesis, or null model, that the chi square statistic tests If there are 200 participants, the simplest null model would be to expect 50 female smokers, 50 male smokers, 50 female non smokers, and 50 male non smokers but we already know that it is implausible to expect an equal split of smokers and non-smokers the expected frequencies will have to allow for the imbalance of smokers vs non smokers and a possible imbalance of men vs women in the sample A sample with 20 male smokers, 10 female smokers, 80 male non-smokers and 40 female non-smokers has an imbalance of gender and smoking status, but smoking status does not depend on gender and there is no deviation from the null model

The contingency table of observed frequencies Men Women Row totals Smoke 13 31 44 Don’t smoke 29 86 115 Column totals 42 117 159 The first step in calculating the value of Chi square for a two variable design is to set up a contingency table. In the example, this tabulates the number of female smokers, male smokers, male non-smokers, and female non-smokers in the sample. Each entry in the contingency table is called a “cell” . The green box highlights a single cell, and there are four cells in this table. In more complicated chi-square designs you can have more than four cells, for example in a 2 * 4 Chi square, which has 8 cells. The first step in calculating chi square is to produce the row and column totals, highlighted in blue. This tells us that there are 42 men and 117 women in our sample. There are 44 smokers and 115 non smokers. Notice that there are very uneven numbers of men and women in the sample, but in a 2*2 chi square this inequality won’t contribute to the value of chi-square. The value of chi-square statistic will be determined by the relative proportions of men and women that smoke.

Calculating the expected frequencies The key step in the calculation of chi-square is to estimate the frequency counts that would occur in each cell if the null hypothesis that the row frequencies and column frequencies do not depend upon each other were true To calculate the expected frequency of the male smokers cell, we first need to calculate the proportion of participants that are male, without considering if they smoke or not This proportion is 42 males out of 159 (the total number of participants) 42 / 159 = 0.26

Calculating the expected frequencies If the null hyp is true, and the proportion of female smokers and male smokers is equal, then the proportion of the smokers in the sample that are male should be equal to the overall proportion of the sample that is male Total number of smokers in sample (44) * proportion of sample that is male (0.26) 44 * 0.26 = 11.62

Calculating the expected frequencies Men Women Row totals Smoke 13 31 44 Expected smokers 11.62 Don’t smoke 29 86 115 Expected non smoke Column totals 42 117 159 The expected number of male smokers is (42/159) * 44 = 11.62

Calculating the expected frequencies Men Women Row totals Smoke 13 31 44 Expected smokers 11.62 32.37 Don’t smoke 29 86 115 Expected non smoke Column totals 42 117 159 The expected number of female smokers is (117/159) * 44 = 32.37 0.74

Calculating the expected frequencies Men Women Row totals Smoke 13 31 44 Expected smokers 11.62 32.37 Don’t smoke 29 86 115 Expected non smoke 30.37 Column totals 42 117 159 The expected number of male non-smokers is (42/159) * 115 = 30.37

Calculating the expected frequencies Men Women Row totals Smoke 13 31 44 Expected smokers 11.62 32.37 Don’t smoke 29 86 115 Expected non smoke 30.37 84.62 Column totals 42 117 159 The expected number of female non-smokers is (117/159) * 115 = 84.62

Calculating the value of chi square Each cell in the contingency table makes a contribution to the total chi-square For each cell you calculate (Observed – Expected) and square it You then divide by the Expected Do this for each cell individually and add up the results

Calculating chi square Men Women Row totals Smoke 13 31 44 Expected smokers 11.62 32.37 Don’t smoke 29 86 115 Expected non smoke 30.37 84.62 Column totals 42 117 159 (13-11.62)2 = 1.90 1.90 / 11.62 = 0.16 The first thing to note about this contingency table is that the observed and expected values are all very close to each other, and so intuitively, the value of chi-square is going to be low. The worked example shows the contribution to the overall chi-square from the male smokers cell. To calculated the overall chi square for this table you need to repeat the process for all four cells and add up the results.

Converting chi-square to a p value The degrees of freedom for a two way Chi square depends upon the number of categories in the contingency table (num columns -1) * (num rows -1) SPSS will calculate the DF and p value for you “The chi square value of 0.31, df = 1 had an associated p value of 0.58, so the null hypothesis that the proportion of men and women that smoke is equal cannot be rejected.” Also see 18.5.7

Larger contingency tables You can perform chi-square on larger contingency tables For example, we might be interested in whether the proportion of smokers vs. non smokers differs according to age, where age is a 3 level categorical variable 20-29 years old 30-39 years old 40-49 years old This results in a 2 * 3 contingency table However, there is some uncertainty as to what a significant chi-square means in this case

Partitioning chi-square A statistically significant 2 * 3 chi-square might have occurred for one of these 3 reasons The proportion of 20-29 year olds who smoke differs from the proportion of 30-39 year olds that smoke The proportion of 20-29 year olds that smoke differs from the proportion of 40-49 year olds that smoke The proportion of 30-39 year olds that smoke differs from the proportion of 40-49 year olds that smoke Or all 3 of the above might be true Or 2 of the above might be true As a researcher, you will want to distinguish between these possibilities

Partitioning chi-square The solution is to break the 2 * 3 contingency table into smaller 2 * 2 contingency tables to test each of the comparisons in the list The proportion of 20-29 year olds who smoke differs from the proportion of 30-39 year olds that smoke The proportion of 20-29 year olds that smoke differs from the proportion of 40-49 year olds that smoke The proportion of 30-39 year olds that smoke differs from the proportion of 40-49 year olds that smoke Run 3 separate 2 * 2 chi-square tests

Partitioning chi-square However, running 3 tests results in 3 chances of a type 1 error occurring To maintain the probability of a type 1 error at the conventional level of 5% you divide the alpha level by the number of chi-square tests you run Effectively, you share the 5% risk of rejecting the null hypothesis due to sampling error equally among the tests you perform For a single chi-square, it is significant if SPSS reports that p is less than 0.05 For two chi-square tests, they are significant at the 0.05 level individually if SPSS reports that p is less than 0.025 For three chi-square tests, they are significant at the 0.05 level individually if SPSS reports that p is less than 0.0166 You will encounter this procedure in year 2 ANOVA as “correction for multiple comparisons”. It is common to all situations where you have to perform multiple null hypothesis tests. It might not be necessary to run all the 2*2 chi-square tests that are possible based on a larger contingency table. You could argue that you only need to run 2 or 3 based on inspection of the differences between expected and observed cell values. This will make it more likely that you will get a significant result. If you have to divide your p value amongst 20 different chi-square tests it will never be significant….

Warnings about chi-square The expected frequency count in any cell must not be less than 5 If this occurs then chi-square is not reliable If the contingency table is 2 * 2 or 2 * 3 you can use the Fisher exact probability test instead SPSS will report this For bigger contingency tables the only solution is to “collapse” across categories, but only where this is meaningful If you began with age categories 0-4, 5-10, 11-15, 16-20 you could collapse to 0-10 and 11-20, which would increase the expected frequencies in each cell Finally, remember that the total of frequencies is equal to the number of participants you have each person must only be a member of one cell in the table