The Chi-Squared Test Learning outcomes International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate Higher Level International Baccalaureate The Chi-Squared Test Learning outcomes This work will help you Perform a goodness of fit test Perform a test for independence on contingency tables
The Chi-squared Distribution The distribution has one parameter v, pronounced ‘new’ and a constant , and the shape of the distribution is given by the probability distribution function (p.d.f.) where If X is distributed in this way we write The p.d.f. is very complicated and is not required for the IB course, but the shape of its graph is shown below.
Some features of the distribution are It is reversed J-shaped for v = 1 and v = 2 it is positively skewed for v > 2. The larger the value of v, the more symmetrical the distribution becomes. When v is large, the distribution becomes approximately normal.
The Significance Test There are two tests A test for independence or for association. This is conducted if you have some practical data with two variables and you want to know if they are independent or there is an association between them. We make two hypothesis, the null hypothesis H0, is that the factors are independent and the alternate hypothesis H1, is that they are not. A goodness of fit test. This is used if you have some practical data and you want to know how well it fits to a statistical distribution such as a normal distribution or binomial. We make two hypothesis, the null hypothesis H0, is that a particular distribution does provide a model for the data and the alternate hypothesis H1, is that it does not.
Critical values and levels of significance The Chi-squared test is a one-tailed test. The idea is that you want to know if the calculated test statistic lies in the main part of the distribution or in the upper tail critical (or rejection) region. The boundary of the critical region is called the critical value. Critical region Reject H0 The critical value depends on the level of significance of the test. Often a 5% or a 1% level of significance is used and the critical values can be found from tables.
Steps for carrying out a test Step 1: Write down null hypothesis H0 and alternate hypothesis H1. Step 2: Calculate a table of expected values. Step 3: Calculate the test statistic. Step 4: Find the critical value from calculator. Step 5: Make a conclusion depending whether the test statistic is in the critical region or not. The test statistic Where fe is the expected frequency and fo is the observed frequency. The distribution can be use as an approximation for the distribution, provided that none of the expected frequencies (fe) fall below 5.
A Headmaster of a large school wants to check on the number of students who are absent during one term. The results are shown in the table below. Days of the week Mon Tues Weds Thurs Fri Total Number of absentees 250 171 160 183 236 1000 Test the hypothesis that the number of absentees is independent of the days of the week. Test at the 5% level. What conclusions might the headmaster draw?
A bag contains red, yellow and green balls in the ratio 3:4:5 A bag contains red, yellow and green balls in the ratio 3:4:5. A ball is drawn out at random from a bag and its colour is noted and it is then replaced back into the bag. In 240 trials the results are as follows Colour Red Yellow Green Total Frequency 68 74 98 240 Perform a test at the 5% level to determine whether the differences between the observed and expected frequencies are significant.
The table below shows the result of planting seeds in rows of 6 and the number of seeds that germinate in each row after a two-week period. Test at the 10% level whether the data can be modelled by a binomial distribution. Number of seeds that germinate (x) 1 2 3 4 5 6 Frequency (f) 15 26 21 14 10 9
The number of telephone calls received by an operator at a hotel between the hours of 9.00 a.m. and 10.00 p.m. over a 100 day period is shown in the table below. Number of phone calls 1 2 3 4 5 Number of days 25 36 16 11 8 Determine whether a Poisson distribution with mean 2 can model the above distribution. Test at the 10% level.
A survey is carried out at a supermarket till A survey is carried out at a supermarket till. When the till opens, the number of customers up to and including the first person to use one of the carrier bags provided by supermarket is recorded. This is repeated on 100 consecutive days. The data is summarised in the table below. Number of customers 1 2 3 4 >4 Frequency (f) 79 15 It is thought that this distribution may be modelled by a geometric distribution with parameter p, where p is the probability that a person uses a supermarket carrier bag. Calculate the mean and hence obtain an estimate of p. Carry out a test at the 5% significance level of goodness of fit of the model to the data.
The heights measured in cm, of a group of students are given in the table below. Determine whether the data can be modelled by a normal distribution. Test at the 5% level. Height in cm 146-150 151-155 156-160 161-165 166-170 171-175 Frequency 10 17 20 14 9
Results of first-time candidates Results of first-time candidates Chi-Squared Test for independence on contingency tables Example A driving school examined the results of 100 candidates who were taking their driving test for the first time. They found that of the 40 men, 28 passed and out of the 60 women, 34 passed. Do these results indicate, at the 5% significance, a relationship between the sex of the candidate and the ability to pass first time? Solution The results can be shown in a table, known as a (read ‘2 by 2’) contingency table Observed data: Results of first-time candidates Pass Fail Totals Sex Male Female Results of first-time candidates Pass Fail Totals Sex Male 28 12 40 Female 34 26 60 62 38 100
H0: There is no relationship between the sex of the candidate and the ability to pass first time; the attributes are independent. H1: There is a relationship between the sex of the candidate and the ability to pass first time; the attributes are not independent. To calculate the expected frequencies: Under H0 events are independent. Therefore row total column total grand total
Results of first-time candidates Results of first-time candidates We could work through this procedure to give the other expected frequencies, but this is unnecessary, as the other frequencies can be found by using the fact that the sub-totals and totals must agree with those in the observed data: Expected frequencies: Results of first-time candidates Pass Fail Totals Sex Male 40 Female 60 62 38 100 Results of first-time candidates Pass Fail Totals Sex Male 24.8 15.2 40 Female 37.2 22.8 60 62 38 100 24.8 15.2 37.2 22.8 Degrees of freedom (v): the number of independent variables (once one expected frequency is known, the others are determined by agreement of totals).
From the tables We test at 5% and reject H0 if fo fe 28 24.8 12 15.2 34 37.2 26 22.8 0.4129 0.6737 0.2753 Reject H0 0.4491 Therefore 1.8110 As we do not reject H0 and conclude that these results do not indicate a relationship between the sex of the candidate and the ability to pass first time. Or p-value > 0.05.
Contingency tables (h rows and k columns) Example In the principality of Viewmania a survey of 200 families known to be regular television viewers was undertaken. They were asked which of the three television channels they watched most during an average week. A summary of their replies is given in the following table, together with the region in which they lived. Region North East South West Channel watched most CCB1 29 16 42 23 CCB2 6 11 26 7 VIT 15 3 12 10 Find the expected frequencies on the hypothesis that there is no association between the channel watched most and the region. Use the distribution and a 5% level of significance to test the above hypothesis.
Solution H0: There is no association between the channel watched most and the region. H1: There is association between the channel watched most and the region. The observed frequencies are first totalled, and then the expected frequencies under H0 are calculated from Observed data: North East South West Totals CCB1 29 16 42 23 CCB2 6 11 26 7 VIT 15 3 12 10 North East South West Totals CCB1 29 16 42 23 110 CCB2 6 11 26 7 50 VIT 15 3 12 10 40 30 80 200 This is a contingency table.
Expected data Expected frequency for the northern viewers of This process is continued for the expected frequencies shown in red. The remaining frequencies are found by ensuring that the totals and the sub-totals agree. North East South West Totals CCB1 27.5 16.5 44 22 110 CCB2 12.5 7.5 20 10 50 VIT 6 16 8 40 30 80 200 North East South West Totals CCB1 110 CCB2 50 VIT 40 30 80 200 North East South West Totals CCB1 27.5 16.5 44 110 CCB2 12.5 7.5 20 50 VIT 40 30 80 200 Degrees of freedom: Once 6 expected frequencies have been found, the others are known automatically (by agreement of the totals). number of independent variables , and we consider the distribution.
From the tables We test at 5% and reject H0 if fo 29 27.5 16 16.5 42 44 23 22 6 12.5 11 7.5 26 20 7 10 15 3 12 8 0.0818 0.0152 0.0909 0.0454 3.3800 1.6333 1.8000 0.9000 Reject H0 2.5000 Therefore 1.5000 As 1.0000 we reject H0 and conclude that there is an association between the channel watched most and the region. Or p-value < 0.05. 0.5000 13.447
Example A university sociology department believes that students with a good grade in A-level General Studies tend to do well on Sociology degree courses. To check this it collected information on a random sample of 100 students who had just graduated and who had also taken general studies at A-level. The students’ performance in General Studies was divided two categories, those with A or B and ‘others’. Their degree classes were recorded as Class I, Class II, Class III and Fail. The data are given in the table below. Class I Class II Class III Fail Totals Grade A or B 11 22 6 1 40 Others 4 28 24 60 Total 15 50 30 5 100 H0: degree class is independent of General studies A-level performance. H1: degree class is not independent of General studies A-level performance.
Expected data Class I Class II Class III Fail Totals Grade A or B 6 20 12 2 40 Others 9 30 18 3 60 Total 15 50 5 100 New Observed and (Expected) data Class I Class II Class III and Fail Totals Grade A or B 11 (6) 22 (20) 7 (14) 40 Others 4 (9) 28 (30) 28 (21) 60 Total 15 50 35 100