The 2 (chi-squared) test for independence IB Math Studies SL West Hall High School
A random sample of 200 teachers in higher education, secondary schools and primary schools gave the following numbers of men and women in each sector: Higher Education Secondary Education Primary Education Male 21 39 20 Female 13 55 52 We might want to find out whether or not there is an association between ‘age-group taught’ and ‘gender’. One way of finding out is to perform a 2 (chi-squared) test for independence. To set up the test: We first set up a null hypothesis, H0, and an alternative hypothesis, H1. H0 always states that the data sets are independent, and H1 always states that they are related. In this case, H0 could be “The age-group taught is independent of gender”. H1 could be “There is an association between age-group taught and gender.”
We put the data into a table We put the data into a table . The elements in the table are our observed data and the table is known as a contingency table. Higher Education Secondary Education Primary Education Male 21 39 20 Female 13 55 52
We put the data into a table We put the data into a table . The elements in the table are our observed data and the table is known as a contingency table. Higher Education Secondary Education Primary Education TOTAL Male 21 39 20 Female 13 55 52 80 120 34 94 72 200
From the observed data we can calculate the expected frequencies. We put the data into tables. The elements in the table are our observed data and the table is known as a contingency table. Higher Education Secondary Education Primary Education TOTAL Male 21 39 20 80 Female 13 55 52 120 34 94 72 200 From the observed data we can calculate the expected frequencies. The expected frequency for each cell will be: row total x column total total sample size Higher Education Secondary Education Primary Education TOTAL Male 80 Female 120 34 94 72 200 13.6 37.6
This gives us the degree of freedom for this table - it is 2 In fact for this table we only need to actually work out two of the expected values, and the rest will follow from the totals. This gives us the degree of freedom for this table - it is 2 The expected frequency for each cell will be: row total x column total total sample size Higher Education Secondary Education Primary Education TOTAL Male 80 Female 120 34 94 72 200 13.6 37.6 28.8 20.4 56.4 43.2
In fact for this table we only need to actually work out two of the expected values, and the rest will follow from the totals. This tells us that the degree of freedom for this table is 2 You can always find the degree of freedom by going back to the original table (without the totals). Crossing off one column and one row, and the number of cells left is the degree of freedom. (No. of columns – 1) x (No. of rows – 1) Higher Education Secondary Education Primary Education Male 21 39 20 Female 13 55 52 df = 2 Higher Education Secondary Education Primary Education TOTAL Male 80 Female 120 34 94 72 200 13.6 37.6 28.8 20.4 56.4 43.2
2calc fo is the observed value fe is the expected value 2calc Contingency Table – Observed Data Expected Frequencies Higher Education Secondary Education Primary Education Male 21 39 20 Female 13 55 52 Higher Education Secondary Education Primary Education Male 13.6 37.6 28.8 Female 20.4 56.4 43.2 Now we are ready to calculate the 2 value using the formula: 2calc fo is the observed value fe is the expected value 2calc Finally look at the critical value that you have been given. If the 2 calc value is less than the critical value, we accept H0, the null hypothesis. If the 2 calc value is more than the critical value, we do not accept the null hypothesis, so we accept H1 In this case the 2 calc value is 11.3, and the critical value at 5% is 5.991. So we do not accept H0, the null hypothesis. There is an association between age-group taught and gender.
If the 2 calc value is less than the critical value, we do accept H0, the null hypothesis. (The 2 calc value is small – there is nothing of significance going on!) If the 2 calc value is more than the critical value, we do not accept the null hypothesis, so we accept H1 (The 2 calc value is large – there is something of significance going on!) If the p-value is less than the significance level, we do not accept H0, the null hypothesis. We accept H1 (The probability of this happening just by chance is small – there is probably something of significance going on!) If the p-value is more than the significance level, we do accept the null hypothesis, so we accept H0 (The probability of this happening just by chance is large – there is probably nothing of significance going on!)
2 is given to you. p is the probability df is the degree of freedom You can do all this on the GDC: Enter the data into a Matrix MATRIX ENTER [EDIT] Enter the size of your matrix ; in this case 2 x 3 (2 rows, 3 columns) Enter your data, pressing after every value. ENTER STAT [TESTS] Scroll up to find 2 ENTER You will now see where your table of expected values will be ; change it if you wish. Otherwise scroll down to Calculate and ENTER 2 is given to you. p is the probability df is the degree of freedom To see the table of expected values: ENTER MATRIX Finally look at the critical value that you have been given. If the 2 calc value is less than the critical value, we accept the null hypothesis. If the 2 calc value is more than the critical value, we do not accept the null hypothesis, so we accept H1
Suppose we collect data on the favourite colour of car for men and women. Black White Red Blue Male 51 22 33 24 Female 45 36 27 We may want to find out whether favourite colour of car and gender are independent or related. One way of finding out is to perform a 2 (chi-squared) test for independence. To set up the test: We first set up a null hypothesis, H0, and an alternative hypothesis, H1. H0 always states that the data sets are independent, and H1 always states that they are related. In this case, H0 could be “The favourite colour of car is independent of gender”. H1 could be “There is an association between favourite colour of car and gender.”
Black White Red Blue TOTAL Male 51 22 33 24 130 Female 45 36 27 130 96 58 55 51 260
From the observed data we can calculate the expected frequencies. Black White Red Blue TOTAL Male 51 22 33 24 130 Female 45 36 27 96 58 55 260 From the observed data we can calculate the expected frequencies. The expected frequency for each cell will be: row total x column total total sample size Black White Red Blue TOTAL Male 130 Female 96 58 55 51 260 48 29 27.5
This gives us the degree of freedom for this table - it is 3 In fact for this table we only need to actually work out three of the expected values, and the rest will follow from the totals. This gives us the degree of freedom for this table - it is 3 The expected frequency for each cell will be: row total x column total total sample size Black White Red Blue TOTAL Male 48 29 27.5 130 Female 96 58 55 51 260 25.5 48 29 27.5 25.5
In fact for this table we only need to actually work out two of the expected values, and the rest will follow from the totals. This tells us that the degree of freedom for this table is 2 You can always find the degree of freedom by going back to the original table (without the totals). Crossing off one column and one row, and the number of cells left is the degree of freedom. (No. of columns – 1) x (No. of rows – 1) Black White Red Blue Male 51 22 33 24 Female 45 36 27 df = 3 Black White Red Blue TOTAL Male 48 29 27.5 25.5 130 Female 96 58 55 51 260
2calc fo is the observed value fe is the expected value 2calc Contingency Table – Observed Data Expected Frequencies Black White Red Blue Male 51 22 33 24 Female 45 36 27 Black White Red Blue Male 48 29 27.5 25.5 Female Now we are ready to calculate the 2 value using the formula: fo is the observed value fe is the expected value 2calc 2calc Finally look at the critical value that you have been given. If the 2 calc value is less than the critical value, we accept H0, the null hypothesis. If the 2 calc value is more than the critical value, we do not accept the null hypothesis, so we accept H1 In this case the 2 calc value is 6.13, and the critical value at 5% is 7.815. So we do accept H0, the null hypothesis. There is no association between favourite colour of car and gender.
The entries in the contingency table must be frequencies. The expected frequencies must not be less than 1, and no more than 20% of the entries can be between 1 and 5. Otherwise the test is invalid.